- Research Note
- Open access
- Published:
Regional soil salinity analysis using stepwise M5 decision tree
BMC Research Notes volume 18, Article number: 90 (2025)
Abstract
Objective
The study aimed to evaluate the potential of multispectral satellite images in soil salinity assessment using linear multiple regression and the M5 decision tree regression method. Therefore, 96 soil samples were collected and correlated with 15 independent spectral information and Landsat 8 satellite image indices.
Results
Due to the nonlinear relationship between EC and spectral bands, linear regression results were unsatisfactory, with the highest correlation coefficient of 58% and an RMSE of 0.78. The M5 decision tree regression model provided better results, with a correlation coefficient of 73% and an RMSE of 0.29 after establishing 9 regression relationships, successfully estimating the natural logarithm of EC. The B64, NDII, and S2 indices are the most influential in remotely sensed soil salinity estimation. Furthermore, the M5 model, utilizing six regression equations, demonstrates a 37.18% improvement in accuracy compared to a multivariate linear regression approach. Factors such as vegetation cover, soil moisture, and uneven moisture content of samples during collection contributed to errors in assessing soil salinity using satellite images.
Introduction
Soil salinity, caused by natural processes and human activities, is a significant environmental issue affecting over 6% of the global land area, or about 800 million hectares [1]. It results in the loss of three hectares of agricultural land per minute worldwide in arid and semi-arid regions like Iran—where 90% of the land is vulnerable [2]. An estimated 55.6 million hectares, or 34%, of Iran is at salinity risk [3]. Contributing factors include poor irrigation infrastructure, saline or sodic water use, geological salt levels, seawater intrusion, river salinity, low precipitation, and high evaporation rates due to extreme heat [4].
Saline soils have high soluble salts, negatively impacting soil productivity, stability, and biodiversity. This salinization also reduces water resources by hindering the soil’s ability to retain and release water [5, 6]. Conventional methods for monitoring salt-affected soils involve regular field visits and laboratory tests, making them impractical for extensive areas. Remote sensing (RS) offers a more efficient alternative, allowing for timely observation of large areas. The accuracy of predictions from RS and predictive mapping techniques depends on the quality of the data and models used [7, 8]. RS techniques are popular for assessing soil salinity because they provide quick, non-invasive evaluations over large areas. However, there are limitations concerning accuracy and resolution [9, 10]. Effective management and corrective measures for saline land are essential to enhancing global food security, and they require thorough assessment and quantification [11, 12].
Materials and methods
Study area
Golestan Province is located in northern Iran, spanning from 54°00′ to 54°32′ East longitude and 36°46′ to 37°18′ North latitude (Fig. 1), with an average 455 mm of annual rainfall and a mean temperature of 17 °C [13]. According to the United States soil taxonomy, the soil is classified as Haploxerept [14].
Location of Golestan province and sampling site (a), overlaying on National Geographic Style Map in Esri ArcGIS (version 10.8), Ground samples (b) [15]
This study investigates the effectiveness of optical Earth observation imagery in estimating soil salinity across the province, from the saline soils near Gomishan to the fertile areas south of Kordkuy. The research methodology is illustrated in Fig. 2.
Datasets
This paper uses two datasets. The soil Electrical Conductivity (EC) values were measured directly on July 5, 2018, from 98 soil samples using the saturation extract method, which is a widely accepted and accurate technique for determining soil salinity levels [16]. Then, using a statistical method called the “natural breaks” or “Jenks” identifying breakpoints by optimizing the differences between the classes and minimizing the variation within each class, the ECs are classified into five groups. Detailed information on soil sampling, laboratory analysis, and salinity classification is provided in Appendix A.
The Spectral information is calculated using Landsat 8 (L8) satellite images. The detailed information dataset, its reprocessing, and indices are provided in Appendix B.
M5 model tree
The M5 model, a type of Decision Tree (DT) designed for regression analysis, was introduced by Quinlan [17]. Functioning as a prediction algorithm, the M5 model utilizes the standard deviation of values within a subset of training data to determine how to segment the data at each node. It is characterized as a piecewise linear model positioned between linear and nonlinear models, such as a neural network. Constructing the model tree involves the initial creation of a tree, followed by pruning to prevent overfitting. In this phenomenon, the model closely fits the training data but performs poorly on the test set. Additionally, smoothing addresses abrupt transitions between adjacent linear models on the pruned tree leaves. The construction process of the M5 model tree closely parallels that of a traditional DT [18]. A challenge with the M5 model tree is that augmenting the number of variables in the model tree may not necessarily result in enhanced accuracy. Consequently, a stepwise implementation is proposed to counteract the inherent greediness of the model tree algorithm [19]. In the M5 decision model process, the iterative node branching is performed multiple times to enhance the model’s accuracy. This results in an extensive tree with numerous branches and leaves, which presents a computational challenge. Therefore, pruning becomes imperative to transform the model into an acceptable, simpler, and more efficient form [20]. The division within DT is guided by the standard deviation index. The expected reduction of each error can be calculated as:
where T is the input samples of the node, Sd is the standard deviation, yi is the numerical value of the features of the samples, and n is the number of nodes. For evaluation, the Coefficient of determination (R2) and Root Mean Squared Error (RMSE) are used and caculated as follows:
where Pi represents the predicted value by the model, and Qi represents the measured value. The \(\:\overline{P\:}\)is the mean of the measured values, and n is the number of observations.
Results
The dot plot histogram of the distribution of the target or predicted variable indicated a non-normal distribution of soil salinity values (Fig. 3).
Consequently, a logarithmic transformation normalized the distribution of values, now referred to as the EC distribution (Fig. 4).
Variations in spectral reflectance curves corresponding to different salinity levels
A spectral reflectance curve was generated to examine the changes in spectral reflectance among soils with varying salinity levels, utilizing spectral band information extracted from L8 images (Fig. 5). The curve illustrates minimal differences in reflection across various salinity levels, with consistent reflection levels observed in other bands corresponding to different salinity levels.
Among the spectral indices computed, the Normalized Difference Salinity Index (NDSI) transitioned from positive to negative values as salinity levels increased (Fig. 6). The index value decreased with escalating salinity, although a slight increase was noted in extremely high salinity classes. This trend was also observed in other indices, where the index values generally decreased with rising salinity, but a slight increase was observed in the highest salinity classes. The NDSI index effectively distinguishes between saline and non-saline soils, demonstrating notable differences between non-saline and slightly salty soil.
Various spectral ratios exhibit diverse trends; certain ratios witness a decline in values with escalating salinity, while others experience an increase (Fig. 7). Notably, the spectral ratios of B52, B53, and B54 demonstrate the most substantial variations across distinct salinity classes.
Investigating correlations of spectral bands, indices, ratios, and EC
Table 1 indicates that individual bands do not correlate strongly with EC (See Fig. 8).
Initially, we computed correlation coefficients between spectral bands and their derived indices concerning soil EC (Table 2). Subsequently, these correlations were refined by applying the natural logarithm to EC. The most substantial correlation was identified between the natural logarithm of EC and spectral information, particularly with band 4, band 3, band 2, and the spectral ratio of band 6 to band 4. Moreover, three variables exhibited correlations exceeding 60% with Ln EC.
Table 3 elucidates the influence of variables on the model performance. It emphasizes the significance of the number of variables in predicting EC and how excluding a specific variable affects the correlation coefficient and prediction error. In the 9-variable tree model, “B64” emerges as the pivotal variable, possessing the highest Standard Deviation Reduction (SDR) value and serving as the root for splitting. Notably, in certain splits, “B64” initiates further divisions. These accurately formed splits in the M5 tree model efficiently segregate data and model non-linear relationships through simple linear relationships.
In the optimal 3-variable tree model, excluding the B64 index increases error, and this omission leads to a more substantial reduction in error compared to the removal of other variables (Table 4).
The highest prediction error occurs when variable B7 is excluded in the optimal four-variable model. Notably, the three variables S2, B64, and NDII_NDWI2, forming the best combination of four variables along with B7, are found to be absent in models after the exclusion of B7, resulting in the most significant errors, as indicated in Table 5.
In the 5-variable tree model, excluding the B64 index leads to an elevation in prediction error, as shown in Table 6. Subsequently, the highest prediction error occurs within the optimal 9-variable tree model when the B64 variable is absent. Notably, the absence of the two B7 variables and B32, forming the best 3-variable combination with B64, results in the most substantial error in models after B64.
Regression analysis of M5
The DT regression model significantly improved soil salinity modeling accuracy, increasing the correlation to 73%. This improvement was achieved by partitioning the data space into nine segments and providing a multiple-variable regression equation for each segment. The DT regression process involved the initial division of data into two subsets based on whether the band ratio B64 is greater or less than 1.6 (Fig. 9). Subsequently, each subgroup was divided into two additional subsets based on band 7.
Ultimately, the dataset was divided into six subsets, and the non-linear space of the data was modeled using six multivariate regression equations (Table 7).
The findings demonstrate that the M5 can estimate EC with enhanced accuracy, achieving a correlation coefficient of 0.85 using information extracted from L8 (Fig. 10).
Evaluation and model comparison.
The findings highlight that the M5 regression tree model can estimate soil EC with better accuracy by achieving a correlation coefficient of 0.73 using information extracted from the L8 (Table 8).
Discussion
Our findings unveil a shifting trend in the NDVI from positive to negative values as EC increases, aligning with prior research [21]. This implies that the NDVI index, among various spectral reflectance indices, holds the potential to detect and monitor soil salinity using RS techniques.
Various studies have employed spectral indices independently or in conjunction with remote sensing techniques to estimate soil salinity. L8 has been frequently utilized in these studies [22,23,24,25]. Across diverse spatial extents, these investigations have reported R2 values ranging from 0.54 to 0.91. Models with an R2 value exceeding 0.66 can offer an acceptable and approximate prediction [24]. In our study, using Landsat 8, the stepwise M5 and Multiple Linear Regression (MLR) methods yielded R2 values of 0.73 and 0.58, respectively.
Noroozi et al. [26] estimated soil salinity levels in the Garmsar Plain, employing spectral unmixing and principal component analysis based on spectral information extracted from Landsat satellite imagery. Their study used the k-NN algorithm and SVM classifier, yielding 58% and 60% R-squared values, respectively. Notably, their estimates of soil salinity exhibited higher accuracy. Discrepancies in accuracy may stem from differences in the study area and the methodologies employed for soil salinity estimation.
Table 8 demonstrates that the Stepwise M5 technique proves successful for soil salinity estimation, showcasing its applicability across various studies. Furthermore, the soil salinity indices employed in this research effectively predicted salinity levels in the region. A comprehensive review of multiple case studies [27, 28] reveals that no single spectral index or remote sensing technique can universally and accurately predict soil salinity. Scholars tend to select the most suitable index and remote sensing method based on the studied region’s unique physical and environmental characteristics. Nevertheless, our findings suggest that RS techniques can be valuable for estimating soil salinity levels in extensive areas, notwithstanding associated limitations. Understanding the accuracy and limitations of RS-based soil salinity estimates is crucial for effective soil salinity management, contributing to enhanced crop yields and sustained soil health.
The current study’s results align with the findings of Amini et al. [29], who utilized Landsat satellite imagery for EC estimation with an R2 of 0.62. In alignment with Ghorbani et al. [19], our study similarly achieved optimal results using the M5 model with five to seven variables. In addition, the model’s accuracy and overall performance decreased as the number of variables increased. Given the absence of a significant difference between models employing five and seven variables, a preference was given to the simpler model with fewer variables. These consistent outcomes indicate that RS techniques can accurately estimate soil EC. Nonetheless, it is essential to continue investigating methods to enhance the accuracy and resolution of these approaches for more precise and reliable soil salinity assessment.
Limitations
The limitation of utilizing spectral indices and ratios for soil salinity estimation through L8 images in the M5 model lies in its exclusive reliance on the association between spectral information and soil salinity. It does not consider additional factors influencing soil salinity, including soil type, irrigation practices, or groundwater levels. The model’s accuracy may fluctuate depending on the region and prevailing conditions under investigation. Therefore, validating the model’s results with ground truth data and considering other influential factors when deciding on water resources and agricultural management is imperative.
Conclusion
In this study, soil salinity estimation for specific areas in Golestan province was conducted using a combined model and L8 imagery. The assessments revealed a relatively strong correlation between the calculated soil salinity based on satellite data and the measured salinity values. The proposed algorithm, employing the DT regression model, specifically the M5 variant, demonstrated capability in indicating soil salinity levels to a significant extent. The findings suggest that acknowledging the diverse soil conditions is crucial for accurate soil salinity estimation through satellite imagery. Nonetheless, a reasonable approximation of soil salinity is achievable. The spectral ratio of band 6 to band 4 exhibits the highest correlation among spectral ratios and indices, reaching up to approximately 86% correlation with the natural logarithm of EC. Moreover, these bands’ spectral reflectance shows a substantial correlation with EC. The modeling results were improved by capturing the non-linear relationship between changes in spectral indices, ratios, and bands extracted from Landsat satellite imagery and changes in soil salinity. This enhancement was achieved using a DT regression model with data partitioned into six segments. Satellite imagery serves as valuable auxiliary information, complementing other salinity mapping methods for comprehensive salinity mapping purposes.
The M5 tree model emerges as a competitive alternative to other methods, offering simplicity, interpretability, and the ability to provide simple linear relationships within a specific range of input data.
Data availability
The datasets used and analyzed during the current study are available from the corresponding author upon reasonable request.
Abbreviations
- EC:
-
Electrical Conductivity
- RS:
-
Remote Sensing
- ML:
-
Machine Learning
- kNN:
-
K-nearest neighbor
- SVM:
-
Support Vector Machine
- DT:
-
Decision Tree
- RF:
-
Random Forest
- MLR:
-
Multiple Linear Regression
- NDVI:
-
Normalized Difference Vegetation Index
- NDSI:
-
Salinity Index
- ETM+:
-
Thematic Mapper Plus
- B7:
-
Mid-infrared band
- B4:
-
Near-infrared band
- CRSI:
-
Crop Rooting System Index
- CanopCSRI:
-
Canopy Salt Response Index
- GPS:
-
Global Positioning System
- L8:
-
Landsat 8
- OperOLI:
-
Operational Land Imager
- TIRS:
-
Thermal Infrared Sensor
- Top-TOA:
-
Top-Of-Atmosphere reflectance
- SR:
-
Surface Reflectance
- NDVI:
-
Normalized Difference Vegetation Index
- MNDWI:
-
Modified Normalized Difference Water Index
- NDII:
-
Difference Infrared Index
- MIR:
-
Mid-infrared
- R2 :
-
Coefficient of determination
- RMSE:
-
Root Mean Squared Error
- StanSDR:
-
Standard Deviation Reduction
- Ln EC:
-
logarithm of Electrical Conductivity
References
Sharma D, Singh A. Salinity research in India-achievements, challenges and future prospects. Water Energy Int. 2015;58(6):35–45.
Qureshi AS, et al. Managing salinity and waterlogging in the Indus Basin of Pakistan. Agric Water Manage. 2008;95(1):1–10.
Arrekhi A et al. Relationship among plant measurements of Salsola turcomanica (Litv) and soil properties in semi-arid region of Golestan province, Iran. 2022.
Emami M, et al. Digital modeling of surface and subsurface soil salinity in Golestan Province, Iran. Geoderma Reg. 2024;37:e00800.
Jalali J, Ahmadi A, Abbaspour K. Runoff responses to human activities and climate change in an arid watershed of central Iran. Hydrol Sci J. 2021;66(16):2280–97.
Ge H et al. Estimating soil salinity using multiple spectral indexes and machine learning algorithm in Songnen Plain, China. IEEE J Sel Top Appl Earth Observations Remote Sens, 2023.
Guida-Johnson B, Abraham EM, Cony MA. Salinización del suelo en tierras secas irrigadas: perspectivas de restauración en Cuyo, Argentina. Revista de la Facultad de Ciencias Agrarias. Universidad Nacional de Cuyo, 2017. 49(1): pp. 205–215.
Xiao C, et al. Prediction of soil salinity parameters using machine learning models in an arid region of northwest China. Volume 204. Computers and Electronics in Agriculture; 2023. p. 107512.
Lobell D, et al. Regional-scale assessment of soil salinity in the Red River Valley using multi‐year MODIS EVI and NDVI. J Environ Qual. 2010;39(1):35–41.
Guo S, et al. Characterizing the spatiotemporal evolution of soil salinization in Hetao Irrigation District (China) using a remote sensing approach. Int J Remote Sens. 2018;39(20):6805–25.
Singh A. Managing the salinization and drainage problems of irrigated areas through remote sensing and GIS techniques. Ecol Ind. 2018;89:584–9.
Nachshon U. Cropland soil salinization and associated hydrology: Trends, processes and examples. Water. 2018;10(8):1030.
Ghorbani K, Zakerinia M, Hezarjaribi A. The effect of climate change on water requirement of soybean in Gorgan. J Agricultural Meteorol. 2014;1(2):60–72.
Roozitalab MH et al. Major Soils, Properties, and Classification, in The Soils of Iran, M.H. Roozitalab, H. Siadat, and A. Farshad, Editors. 2018, Springer International Publishing: Cham. pp. 93–147.
Esri. Working with basemap layers. 2018; Available from: https://desktop.arcgis.com/en/arcmap/latest/map/working-with-layers/working-with-basemap-layers.htm
Bandak S, et al. A longitudinal analysis of soil salinity changes using remotely sensed imageries. Sci Rep. 2024;14(1):10383.
Quinlan JR. Learning with continuous classes. in 5th Australian joint conference on artificial intelligence. 1992. World Scientific.
Wang L, et al. A novel 3D tree-modeling method of incorporating small-scale spatial structure parameters in a heterogeneous forest environment. Forests. 2023;14(3):639.
Ghorbani K, Mohammadi J, Rezaei L, Ghaleh. Annual growth of Fagus orientalis is limited by spring drought conditions in Iran’s Golestan Province. J Forestry Res. 2024;35(1):19.
Roshan A, et al. Evaluation of meteorological drought effects on underground water level fluctuations using data mining methods (case study: semi-deep wells of Golestan province). Environ Monit Assess. 2024;196(3):236.
Khosravi M, Afshar A, Molajou A. Decision tree-based conditional operation rules for optimal conjunctive use of surface and groundwater. Water Resour Manage. 2022;36(6):2013–25.
Peng J, et al. Proximal soil sensing of low salinity in southern xinjiang, China. Remote Sens. 2022;14(18):4448.
Yu H, et al. Mapping soil salinity/sodicity by using Landsat OLI imagery and PLSR algorithm over semiarid West Jilin Province, China. Sensors. 2018;18(4):1048.
Farifteh J, et al. Quantitative analysis of salt-affected soil reflectance spectra: a comparison of two adaptive methods (PLSR and ANN). Remote Sens Environ. 2007;110(1):59–78.
Jackson ML. Soil chemical analysis-advanced course. 1969.
Noroozi AA, Homaee M, Farshad A. Estimating Topsoil Salinity from LANDST Data: a comparison between classic and spatial statistics. J Range Watershed Managment. 2014;66(4):609–20.
Melesse AM, et al. River water salinity prediction using hybrid machine learning models. Water. 2020;12(10):2951.
Bouaziz M, Matschullat J, Gloaguen R. Improved remote sensing detection of soil salinity from a semi-arid climate in Northeast Brazil. Comptes Rendus Géoscience. 2011;343(11–12):795–803.
Amini D, Tavakoli M, faramarzi M. Investigation of the Relationship between Soil Salinity Trend, Land Use and climatic factors change (case study: Shadegan, Khuzestan). J Environ Sci Technol. 2020;9(22):43–58.
Acknowledgements
This study was supported by Gorgan University of Agricultural Sciences and Natural Resources.
Funding
No funding.
Author information
Authors and Affiliations
Contributions
KhGh: Supervision, Sampling and Investigation, Methodology, Analysis, Software. SB: Writing-review & editing. LRGh: Writing-review. SM: Writing-review & editing, Visualizing. AL: Writing-review & editing.
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Ghorbani, K., Bandak, S., Ghaleh, L.R. et al. Regional soil salinity analysis using stepwise M5 decision tree. BMC Res Notes 18, 90 (2025). https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s13104-025-07097-3
Received:
Accepted:
Published:
DOI: https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s13104-025-07097-3