AquaINFRA News

Improving Species Distribution Models with specleanr

November 18th, 2025
Improving Species Distribution Models with specleanr

High-quality input data is crucial for reliable Species Distribution Modelling (SDM). Yet one often overlooked aspect of data quality in SDM workflows is the presence of environmental outliers in occurrence records or unusual or extreme environmental values. These outliers can distort model training and predictions, underscoring the need for automated detection methods in large biodiversity datasets. Recognising this gap, Basooma et al. (2025) have introduced specleanr, a new R package designed to flag and remove environmental outliers from ecological data in a reproducible way. This tool aims to improve model accuracy by ensuring that input data better reflect plausible environmental conditions for species.

Full publication available here: DOI: 10.1002/ecog.08221 and the package repository (https://github.com/AnthonyBasooma/fwtraits)

https://doi.org/10.6084/m9.figshare.29126783.v1

The specleanr R Package: Ensemble Outlier Detection

The specleanr package includes 20 outlier detection methods, drawn from a range of statistical approaches. These methods fall into three broad categories:

  • Species-specific ecological range methods – checks against each species’ known environmental tolerances (for example, flagging records outside a species’ typical climate range).

  • Univariate methods – detection of outliers on single environmental predictor variables (identifying data points with extreme values in any one variable).

  • Multivariate methods – detection of outliers in multi-dimensional environmental space (flagging records that are anomalous combinations of variables).

By applying these methods in parallel, specleanr flags any occurrence record that one or more methods identify as an outlier. The results are then combined: any record flagged by multiple methods is designated as an absolute outlier. To objectively determine how many flags are enough to consider a record an outlier, specleanr employs a local regression (LOESS) technique that automatically sets an optimal threshold for outlier identification. This makes the process data-driven, removing the need for subjective manual thresholds.

In addition to simply flagging outliers, the package provides a graded assessment of outlier likelihood. Each record is classified into categories such as poor, fair, moderate, very strong, or perfect outlier (or identified as a non-outlier) based on how consistently it is flagged across the methods. This nuanced classification allows researchers to focus on the most extreme outliers first, while retaining borderline cases for further expert review. By quantifying outlier status on a spectrum, specleanr supports more informed decisions about data cleaning; for example, whether to remove only the most extreme outliers or a wider set, depending on the analysis needs.

Case Study: Danube River Basin Fish Species

To demonstrate specleanr’s performance, Basooma et al. applied the package to occurrence records of 15 fish species from the Danube River Basin. These included native, alien, threatened, and common species, providing a robust test across different ecological profiles. For each species, an SDM was built using bioclimatic and hydromorphological predictors, and model accuracy was evaluated before and after outlier removal.

The authors tested three outlier-removal scenarios:

  1. Removing outliers identified using the LOESS-derived threshold (all absolute outliers flagged by multiple methods).

  2. Removing only records classified as very strong outliers.

  3. Removing only perfect outliers — the most extreme values.

Across all three approaches, the cleaned datasets produced higher Area Under the Curve (AUC) scores, indicating more robust model performance. Although the effect sizes were generally small to moderate, the improvement was consistent, showing that even modest outlier removal enhances SDM accuracy. The study provides clear evidence that environmental outlier detection should be a standard part of data preparation for ecological modelling.

Broader Impact and Alignment with AquaINFRA

While specleanr was developed with species distribution modelling in mind, its ensemble approach is generalisable across taxonomic groups, data types, and environmental realms. It can be applied beyond SDM, for example, in marine ecology, hydrology, or any field where identifying anomalous observations is critical. By automating and standardising environmental outlier detection, specleanr helps researchers improve data integrity and model reproducibility across scientific disciplines.

Importantly, specleanr was created in the spirit of open, transparent, and reproducible science. The package is fully open-source and includes detailed documentation and vignettes to support users. This aligns closely with the goals of the AquaINFRA project, which is building a Virtual Research Environment (VRE) and Data Discovery and Access System (DDAS) to enhance the FAIRness (Findability, Accessibility, Interoperability, and Reusability) of water and biodiversity data across Europe.

AquaINFRA aims to make data and analytical tools interoperable across marine, freshwater, and environmental research communities. Tools like specleanr, which automate and improve data quality control, are key to this vision. By embedding reproducible data-cleaning processes into AquaINFRA workflows, researchers can ensure their analyses are transparent, high-quality, and easily reused by others.

Reference

Basooma, A., Schmidt‐Kloiber, A., Domisch, S., Torres-Cambas, Y., Smederevac-Lalić, M., Bremerich, V., Tschikof, M., Meulenbroek, P., Funk, A., Hein, T., & Borgwardt, F. (2025). specleanr: An R package for automated environmental outlier detection in species distribution modelling and general data analysis. Figshare. DOI: https://doi.org/10.6084/m9.figshare.29126783.v1