PrOsteoporosis: predicting osteoporosis risk using NHANES data and machine learning approach

Si, Zebing; Zhang, Di; Wang, Huajun; Zheng, Xiaofei

doi:10.1186/s13104-025-07089-3

Research Note
Open access
Published: 11 March 2025

PrOsteoporosis: predicting osteoporosis risk using NHANES data and machine learning approach

Zebing Si^1,2,
Di Zhang³,
Huajun Wang¹ &
…
Xiaofei Zheng¹

BMC Research Notes volume 18, Article number: 108 (2025) Cite this article

882 Accesses
Metrics details

Abstract

Objectives

Osteoporosis, prevalent among the elderly population, is primarily diagnosed through bone mineral density (BMD) testing, which has limitations in early detection. This study aims to develop and validate a machine learning approach for osteoporosis identification by integrating demographic data, laboratory and questionnaire data, offering a more practical and effective screening alternative.

Methods

In this study, data from the National Health and Nutrition Examination Survey were analyzed to explore factors linked to osteoporosis. After cleaning, 8766 participants with 223 variables were studied. Minimum Redundancy Maximum Relevance and SelectKBest were employed to select the import features. Four Machine learning algorithms (RF, NN, LightGBM and XGBoost.) were applied to examine osteoporosis, with performance comparisons made. Data balancing was done using SMOTE, and metrics like F1 score, and AUC were evaluated for each algorithm.

Results

The LightGBM model outperformed others with an F1 score of 0.914, an MCC of 0.831, and an AUC of 0.970 on the training set. On the test set, it achieved an F1 score of 0.912, an MCC of 0.826, and an AUC of 0.972. Top predictors for osteoporosis were height, age, and sex.

Conclusions

This study demonstrates the potential of machine learning models in assessing an individual’s risk of developing osteoporosis, a condition that significantly impacts quality of life and imposes substantial healthcare costs. The superior performance of the LightGBM model suggests a promising tool for early detection and personalized prevention strategies. Importantly, identifying height, age, and sex as top predictors offers critical insights into the demographic and physiological factors that clinicians should consider when evaluating patients’ risk profiles.

Peer Review reports

Introduction

Osteoporosis is a systemic skeletal disease characterized by a decrease in bone mass and deterioration of bone microarchitecture, resulting in weakened bones that are more susceptible to fractures [1]. The number of osteoporosis patients aged 50 years or older has increased from 10.2 million in 2010 to 12.3 million in 2020, and it is expected to reach 13.6 million by 2030 [2], indicating that the burden of this disease on society is increasing. The International Osteoporosis Foundation reported that approximately one-third of elderly women and one-fifth of elderly men experience osteoporotic fractures [3, 4]. Osteoporotic fractures, especially hip fractures, are associated with limited mobility, chronic pain, and a decline in quality of life. In addition, approximately 20–30% of osteoporosis patients die within one year after experiencing an osteoporotic fracture [5]. Osteoporosis often does not have symptoms until it causes a fracture, so early screening and detection are crucial for managing the disease.

Measuring bone density is the gold standard for identifying osteoporosis, but its use is limited because of its low sensitivity and high cost [6]. Additionally, patients typically only undergo bone density testing after experiencing symptoms of osteoporosis, which limits its use [6]. To overcome the challenges of early screening and detection for osteoporosis, some teams have developed clinical assessment tools, such as the FRAX (Fracture Risk Assessment) tool, the SCORE (Simple Clinical Evaluation of Osteoporosis) tool, the ORAI (Osteoporosis Risk Assessment) tool, the OSIRIS (Osteoporosis Self-Assessment) tool, and the OST (Osteoporosis Test), for identifying patients who may have an increased risk of osteoporosis [5]. Although these tools have some practical value and convenience, their accuracy is limited [7, 8].

With the exponential growth of computing power in the era of big data, machine learning (ML) methods have been rapidly applied in the medical field, including the diagnosis of orthopedic diseases. Compared with existing clinical tools, AI-based approaches have the advantage of analyzing the interrelationships among multiple features, resulting in improved accuracy [9,10,11]. Several studies have explored machine learning for osteoporosis diagnosis and detection, including aspects of fracture risk assessment, prediction, and treatment response [12]. These methods are expected to play an important role in improving the accuracy of osteoporosis screening and diagnosis, but further research and validation are needed. Many studies have used machine learning methods for risk assessment on the basis of medical databases. However, these early attempts revealed several limitations, including overfitting, underrepresentation of the population, a lack of confidence intervals around point estimates, and arbitrary variable selection [13, 14]. In addition, the lack of accuracy has hindered the widespread use of machine learning for osteoporosis risk prediction. These limitations may be due to factors such as data quality, sample size, feature selection, and model design.

Therefore, we developed a machine learning model to screen for osteoporosis risk and used interpretable artificial intelligence techniques to clinically interpret the model results. To demonstrate the performance of our approach, we compared the results of our model with the results of other machine learning models. Our study provides an innovative and feasible method for osteoporosis risk screening, and provides clinicians with more accurate and reliable risk predictions and individualized treatment recommendations. Future studies can further improve and extend this method for better application in clinical practice.

Method

Data sources and preprocessing

The dataset from the National Health and Nutrition Examination Survey (NHANES,, https://www.cdc.gov/nchs/nhanes/index.htm), is a cross-sectional study conducted by the National Center for Health Statistics (NCHS) to assess the overall health and nutritional status of the U.S. population. This study used 2013–2014, and 2017–2020 data from the NHANES, including laboratory, demographic, and questionnaire data. From our screenings, we defined two categories of features,: blood biomarkers (blood set from ‘Laboratory Data’), and clinical features (clinical set from ‘Demographics and Questionnaire Data’). SEQN is a unique identifier for each participant in the NHANES, and was used to merge all the aforementioned datasets (union set). Therefore, we evaluated whether the essential risk factors could even be comparable to these osteoporosis features in classification.

Assessment of osteoporosis

The medical conditions file was used to define osteoporosis. The participants were asked: “Has a doctor ever told you that you had osteoporosis, sometimes called thin or brittle bones?” Participants who answered “Yes” to this question were considered to have osteoporosis within this study, and participants who answered “Yes” were considered to have nonosteoporosis. We included only patients with a diagnosis of osteoporosis and nonosteoporotic patients. We excluded variables with nonnumeric values and treated “refused to answer” or “do not know” as missing values. We excluded variables with missing values exceeding 20% of all variables or with missing values exceeding 20% of all respondents. Missing values have a significant impact on data analysis, statistical modelling and machine learning. We have two ways to deal address outliers, interpolation via plurality for categorical variables and interpolation via the mean for continuous variables. The dataset from the NHANES included 8766 respondents (933 in the osteoporosis group and 7833 in the normal group), with 223 variables. To eliminate the effect of intervariable scaling, we standardized the data via max‒min scaling. The data processing workflow is shown in Fig. 1.

Borderline SMOTE was applied to balance the dataset

With a more detailed examination of the dataset, it became clear that we are dealing with an imbalanced dataset. That is, the number of samples of two classes is not equal. For a successful classification, the number of samples of both classes should be equal. If the number of samples of the classes is not equal, we may face the risk of increasing the accuracy and decreasing the area under the curve (AUC) metric, and this high accuracy will not be real accuracy. To balance the dataset, the borderline SMOTE method is used in the proposed method. The borderline SMOTE is an improved oversampling algorithm based on synthetic minority oversampling technique (SMOTE) [15]. This algorithm focuses on minority class samples on the border to synthesize new samples, thus improving the class distribution of the samples [15]. The advantage of the Borderline SMOTE algorithm is that it avoids duplicity by identifying few-class seed samples that are sampled only on the border [15]. After the dataset was balanced, the number of samples in each class reached 7833 for the osteoporosis group and 7833 for the normal group (Fig. 1).

Attribute selection

Attribute selection can reduce the risk of overfitting. Minimum redundancy maximum relevance (mRMR) and SelectKBest were employed to select the import features. mRMR is widely used to select a subset of features from among all the features. The maximum relevance is used to search for features such that the correlation between features and targets is as high as possible, whereas the minimum redundancy between features is as low as possible. SelectKBest is a commonly used feature selection method for selecting the K features from the dataset that have the highest correlation with the target variable. It is available in scikit-learn, and can be used with a variety of statistical tests, and the features picked by k with the highest score. mRMR was implemented via the pymrmr python package.

Machine learning techniques

AutoGluon version 0.3.1 [16], is an open-source automated machine learning (AutoML) Python library. These models (LightGBM, RandomForest, Neural Networkrandom forest, neural network, and XGBoost) were constructed via the autogluon package.

The LightGBM is a gradient boosting framework based on the decision tree algorithm [17]. It can be used for machine learning tasks such as classification. It is based on the decision tree algorithm, and it uses the optimal leafwise strategy to split the leaf nodes [17]. Therefore, in the LightGBM algorithm, when growing to the same leaf node, the leafwise algorithm reduces more loss than the levelwise algorithm does, resulting in higher accuracy.

A neural network is a computational model inspired by the structure and functioning of the human brain [18]. The input layer of a neural network receives the initial input data, which could be features of an image, attributes of a text, or any other type of data. Each neuron in the input layer represents a feature of the input data. The input data are then passed through the hidden layers, where computation and learning take place. The hidden layers perform complex calculations on the input data and gradually extract higher-level features.

XGBoost is also a type of gradient boosting tree model, that generates models sequentially and sums all the models to produce the output [19]. It uses second-order Taylor expansion to approximate the loss function and optimizes the loss function via the second derivative information. It greedily selects whether to split nodes on the basis of whether the loss function decreases.

Random forest (RF) is a widely used machine learning technique that provides the probability for the predicted class [20]. RF is an ensemble classifier consisting of multiple decision trees.

Explanation of feature importance

To assess which features are the most critical, the Shapley Value Additivity Properties Interpretation (SHAP) algorithm has been regarded as an effective and widely used analytical tool to quantify the small changes in the predictive outcomes of a model caused by each feature. The technique is particularly well suited for elucidating “black box” models that are difficult to understand.

Performance evaluation

The accuracies of the classifiers were evaluated via tenfold cross validation on the training set. The classification performance of the model was assessed in terms of the following measures:

recall = TP/(TP + FN).

where:

TP: True positives, correctly predicted osteoporosis; FN: false negatives, osteoporosis predicted as normal. In addition, receiver operating characteristic (ROC) analysis is a visualization of prediction performance. This indicates the tradeoffs between sensitivity and specificity at different thresholds. The (area under the ROC curve (AUC) was used as a measure of goodness of fit for the predictions.

Results

Framework of PrOsteoporosis

With the laboratory data, demographic data, and questionnaire data, we presented a computational framework for identifying osteoporosis patients, as shown in Fig. 1. First, we preprocessed the data as introduced above. The data were then divided into a training set and a test set at a ratio of 8:2. Multiple widely used machine learning methods, including the lightGBM, random forest (RF), neural network, and XGBoost methods, were subsequently exploited to construct the predictor. Next, we applied feature selection methods to obtain the optimal subset. Finally, feature importance analysis was used to explain the model. The framework for identifying the risk factors or osteoporosis is shown in Fig. 2.

Evaluation of the borderline SMOTE algorithm

The dataset used is unbalanced, as the number of osteoporosis samples is fewer than the number of normal samples. To solve this problem, the borderline SMOTE algorithm is used to balance the number of both classes. With 933 osteoporosis samples and 7833 normal samples, the model achieved an AUC of 83% when 18 features were used. When the borderline SMOTE algorithm was used, the method demonstrated an AUC of 97.0% on the training set. Overall, our model outperformed the model without balance data (Fig. 3). The Brier Score integrates the model’s discrimination and calibration capabilities and is utilized to assess the model’s overall performance. The nearer the Brier Score is to 0, the more closely the predicted values align with the actual outcomes. In our model, the Brier Score is 0.0828, which indicates the model is effective.

Feature selection for constructing the final classifier

An excessive number of features can introduce noise and irrelevant information, making it difficult for the model to identify the most relevant predictors of the outcome. Moreover, this can lead to overfitting. Therefore, feature selection is essential for identifying the most informative and relevant features for prediction tasks. A high correlation of several pairs of features was observed (Additional file 1: Fig. S1), for example, LBXWBCSI (white blood cell count), LBDLYMNO (lymphocyte number), LBDSUASI (Uricuric acid (mg/dL)), and LBXSUA (Uricuric acid (umol/L)), which could introduce redundant information affecting the decision making and stability of the model.

Thus, min redundancy (mRMR) and selectKbest were applied to search for an optimal feature space. We evaluate each feature and then rank it via the MrMR algorithm, and the importance of each feature is summarized in Table 1. Supported by the literature, we discussed the first 10 characteristics that have been reported to be associated with osteoporosis. Greater height has been associated with increased fracture risk [21]. The second feature is taking prescription medicine in the past month. Previous studies revealed an increasing trend in the use of most drugs that may trigger osteoporosis between 1999 and 2016, and suggested that taking medicine may play a clinically relevant role in causing osteoporosis [22]. The third feature is thyroid disease. Thyroid hormones play a crucial roles in the body’s metabolism and cell differentiation. Complications that may accompany abnormal thyroid function can interfere with bone metabolism, which can potentially cause osteoporosis and increase the risk of fractures [23]. In terms of age, sex, and blood pressure, age, sex, and blood pressure are risk factors for osteoporosis [24]. It has been reported that cancer therapies have the potential to decrease bone density, which increases the likelihood of developing osteoporosis [25]. Osteoporosis may lead to reduced support of the pelvic floor muscles and tissues, which in turn affects urine control [26]. It also causes bones to become less strong, increasing their susceptibility to fracture [27]. For hepatitis A, a previous study revealed an association between hepatitis A and bone mineral density, and hepatitis A may be a risk factor for osteoporosis [28]. As a result, a low correlation of 18 selected features was observed (Fig. 4). Overall, these findings indicate that these selected features can complement each other to effectively depict osteoporosis, enhancing the overall prediction performance.

Table 1 Ranking of 18 features using MRMR algorithm

Full size table

Interpretation of LightGBM model via the SHAP method

To investigate whether the feature value has a positive or negative influence on the predicted outcome, we applied the SHapley Additive exPlanations (SHAP) method. Figure 5 illustrates the importance of these features and ranks them according to the magnitude of their influence on the predicted outcomes. Among all the predictor variables, height (WHD010) had the greatest influence. We observed that increased age was positively correlated with an increased risk of osteoporosis, which made the model more inclined to predict osteoporosis. In contrast, low height was negatively correlated with a decreased risk of osteoporosis, which led to the model being more likely to predict osteoporosis.

Case study

A 80-year-old female, standing 140 cm tall, Routine physical examination revealed reduced bone density, which is collected from Yuebei People’s Hospital. According to the SHAP values derived from the LightGBM model, age exhibits a positive correlation with the predicted risk of osteoporosis, whereas height shows a negative correlation. The patient’s advanced age (80 years) significantly elevates her risk score for osteoporosis. Additionally, her relatively low stature (140 cm) further increases this risk. Given that SHAP values indicate both age and height as strong predictors, the physician decided to increase the frequency of bone density assessments for this patient (supplementary Fig. S2).

Comparison of different machine learning algorithms

In this study, the LightGBM (GBM) was used to construct the model. To evaluate whether this classification algorithm is the most suitable for predicting osteoporosis, we compared the performance of several other machine learning algorithms, including the random forest (RF), neural network, and XGBoost algorithms. To ensure comparability, the optimal parameters for each machine- learning algorithm were modified. A comparison of the results is presented in Fig. 6. The results revealed that LightGBM demonstrated the best performance on both the training sets (F1 = 0.914, MCC = 0.831, AUC = 0. 970, and recall = 0.920) and the test set (F1 = 0.91, MCC = 0.826, AUC = 0.972, and recall = 0.922). Therefore, the osteoporosis prediction model constructed using the LightGBM demonstrated better generalizability and robustness.

Comparison of other prediction models

The results of our model were compared with existing deep learning -based predictions of osteoporosis (Table 2). All of these methods were developed to predict osteoporosis. Suh’s team used osteoporosis data from the nhanes database with clinically relevant features, and ultimately used deep learning to predict osteoporosis, with an AUC of 0.851 [5]. Ha et al. constructed a dataset of 3012 data points using CT scan data, demographic data, and image data, and used deep learning to predict osteoporosis, with an AUC of 0.94 [29]. Although the model achieved good predictive results, the model required a comparison of enhanced abdominal CT data, which came at the cost of additional radiation exposure. In addition, there is a time lag between the collected data and the same patient may be matched with two or more CT images. The data come from a single healthcare facility, making it difficult to demonstrate this effect via data from other facilities or CT devices. Therefore, the method has several limitations. In contrast, our model uses features that are easy to obtain, does not require CT data, has higher accuracy and sensitivity, and shows better results, which are superior to those of other prediction models.

Table 2 Models for predicting osteoporosis using various machine learning

Full size table

Discussion and conclusion

In summary, we developed a machine learning model to diagnose osteoporosis via NHANES data that outperforms traditional clinical assessment tools and machine learning models. In addition, we discuss important features of the choice of interpretable artificial intelligence -based techniques. These findings suggest that machine learning can be comprehensively applied to healthcare big data for risk analysis of certain diseases. In addition, our model is able to personalize the assessment of osteoporosis risk and provide an explanation for the contribution of each feature to the model results, and the features needed for the model are easy to obtain. At a later stage, we will collect clinical data as an independent test set for external validation of our machine learning model.

Our study has several limitations that should be acknowledged and considered in future research. (1) The diagnosis of osteoporosis was obtained through self-reported questionnaires, which may introduce recall bias. The precise definition and standardization of osteoporosis diagnoses remain areas for further exploration.(2) Osteoporosis is a multifaceted disorder influenced by genetic, behavioral, dietary, and environmental factors. However, the NHANES database does not explicitly document all relevant factors, and some important variables were excluded during the data cleaning process. This limitation underscores the need for comprehensive data collection and integration of multiple data types to better understand the full spectrum of osteoporosis risk factors. In future research, we aim to explore methods for combining various types of data, such as genetic information, imaging studies, and detailed dietary histories, to construct more robust predictive models. Integrating these diverse data sources will enhance our ability to identify and mitigate potential confounders, ultimately improving the accuracy and applicability of our findings.

Data availability

All data used in this research can be downloaded from the following website: http://www.cdc.gov/nchs/nhanes/

References

Nelson HD, et al. Screening for osteoporosis: an update for the US Preventive Services Task Force. Jurna l. 2010;153(2):99–111.
Google Scholar
Sarafrazi N et al. Osteoporosis or low bone mass in older adults: United States, 2017–2018. Jurna.l 2021.
Curtis EM, et al. Epidemiology of fractures in the United Kingdom 1988–2012: variation with age, sex, geography, ethnicity and socioeconomic status. Jurna l. 2016;87:19–26.
Google Scholar
Testa EJ, et al. Osteoporosis and fragility fractures. Jurna l. 2022;105(8):15–21.
Google Scholar
Suh B, et al. Interpretable deep-learning approaches for osteoporosis risk screening and individualized feature analysis using large population-based data: model development and performance evaluation. Jurna l. 2023;25:e40179.
Google Scholar
Mithal A, et al. The Asia-pacific regional audit-epidemiology, costs, and burden of osteoporosis in India 2013: a report of international osteoporosis foundation. Jurna l. 2014;18(4):449–54.
Google Scholar
Curry SJ, et al. Screening for osteoporosis to prevent fractures: US Preventive Services Task Force recommendation statement. Jurna l. 2018;319(24):2521–31.
Google Scholar
Tu J-B, et al. Using machine learning techniques to predict the risk of osteoporosis based on nationwide chronic disease data. Jurna l. 2024;14(1):1–11.
Google Scholar
Khan NA, et al. Supervised machine learning for jamming transition in traffic flow with fluctuations in acceleration and braking. Jurna l. 2023;109:108740.
Google Scholar
Hussain S, et al. Deep learning-driven analysis of a six-bar mechanism for personalized gait rehabilitation. Jurna l. 2025;25:011001–011001.
Google Scholar
Khan NA et al. Predictive insights into nonlinear nanofluid flow in rotating systems: a machine learning approach. Jurna l 2024:1–18.
Shim J-G, et al. Application of machine learning approaches for osteoporosis risk prediction in postmenopausal women. Jurna l. 2020;15:1–9.
Google Scholar
Smets J, et al. Machine learning solutions for osteoporosis—a review. Jurna l. 2020;36(5):833–51.
Google Scholar
Kong SH, et al. A novel fracture prediction model using machine learning in a community based Cohort. Jurna l. 2020;4(3):e10337.
Google Scholar
Han H et al. Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: International conference on intelligent computing: 2005: Springer; 2005: 878–887.
Erickson N et al. Autogluon-tabular: Robust and accurate automl for structured data. Jurna.l. 2020.
Ke G et al. Lightgbm: a highly efficient gradient boosting decision tree. Jurna l 2017;30.
Gupta N. Artificial neural network. Jurna l. 2013;3(1):24–8.
Google Scholar
Chen T et al. Xgboost: A scalable tree boosting system. In: Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining: 2016; 2016: 785–794.
Rigatti SJ. Random forest. Jurna l. 2017;47(1):31–9.
Google Scholar
Nishikura T, et al. Body mass index, height, and osteoporotic fracture risk in community-dwelling Japanese people aged 40–74 years. Jurna l. 2024;42(1):47–59.
Google Scholar
Skjødt M, et al. Long term time trends in use of medications associated with risk of developing osteoporosis: nationwide data for Denmark from 1999 to 2016. Jurna l. 2019;120:94–100.
Google Scholar
Apostu D, et al. The influence of thyroid pathology on osteoporosis and fracture risk: a review. Jurna l. 2020;10(3):149.
CAS Google Scholar
Xiao P-L, et al. Global, regional prevalence, and risk factors of osteoporosis according to the World Health Organization diagnostic criteria: a systematic review and meta-analysis. Jurna l. 2022;33(10):2137–53.
Google Scholar
Shapiro CL. Osteoporosis: a long-term and late-effect of breast cancer treatments. Jurna l. 2020;12(11):3094.
CAS Google Scholar
Wei MC, et al. Osteoporosis and stress urinary incontinence in women: a National Health Insurance Database Study. Jurna l. 2020;17(12):4449.
Google Scholar
Warriner AH, et al. Which fractures are most attributable to osteoporosis? Jurna l. 2011;64(1):46–53.
Google Scholar
Yu Z, et al. Association between Hepatitis A seropositivity and bone mineral density in adolescents and adults: a cross-sectional study using NHANES data. Jurna l. 2024;142:e2023266.
Google Scholar
Ha TJ et al. Multi-classification of Grading Stages for Osteoporosis Using Abdominal Computed Tomography with Clinical Variables: Application of Deep Learning with a Convolutional Neural Network. Jurna.l. 2023.

Download references

Acknowledgements

The authors thank all members of our laboratory for their valuable discussions. We acknowledge the data from the National Health and Nutrition Examination Survey (NHANES).

Funding

The study was supported by the Shaoguan City Science and Technology Project (220601224532885), Guangdong Provincial Education Science Planning Project (2021GXJK366).

Author information

Authors and Affiliations

Department of Sports Medicine, The First Affiliated Hospital, Guangdong Provincial Key Laboratory of Speed Capability, The Guangzhou Key Laboratory of Precision Orthopedics and Regenerative Medicine, Jinan University, Guangzhou, 510630, China
Zebing Si, Huajun Wang & Xiaofei Zheng
Department of Orthopedics, Yuebei People’s Hospital, 133 Shaoguan Huimin South Avenue, Shaoguan, 512026, China
Zebing Si
Country College of information science and engineering, Shaoguan University, Shaoguan, Guangdong, China
Di Zhang

Authors

Zebing Si
View author publications
You can also search for this author inPubMed Google Scholar
Di Zhang
View author publications
You can also search for this author inPubMed Google Scholar
Huajun Wang
View author publications
You can also search for this author inPubMed Google Scholar
Xiaofei Zheng
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

ZS carried out design of the study, data management, statistical analysis, and writing of the manuscript. ZS and DZ helped to write and finalize the manuscript. HW conceived the study, participated in its design, and helped to write the manuscript. XZ helped in the study’s design and in writing the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Xiaofei Zheng.

Ethics declarations

Ethics approval and consent to participate

Use of data from the NHANES 2013–2020 is approved by the National Center for Health Statistics (NCHS) Research Ethics Review Board (ERB) Approval for NHANES 2013–2014 (Continuation of Protocol #2011-17), NHANES 2017–2020 (Continuation of Protocol # 2018-01).

Consent for publication

Not applicable

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary Material 1: Fig. S1: A high correlation of several pairs of features.

Supplementary Material 2: Fig. S2: The case of Osteoporosis.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Si, Z., Zhang, D., Wang, H. et al. PrOsteoporosis: predicting osteoporosis risk using NHANES data and machine learning approach. BMC Res Notes 18, 108 (2025). https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s13104-025-07089-3

Download citation

Received: 04 November 2024
Accepted: 07 January 2025
Published: 11 March 2025
DOI: https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s13104-025-07089-3

PrOsteoporosis: predicting osteoporosis risk using NHANES data and machine learning approach

Abstract

Objectives

Methods

Results

Conclusions

Introduction

Method

Data sources and preprocessing

Assessment of osteoporosis

Borderline SMOTE was applied to balance the dataset

Attribute selection

Machine learning techniques

Explanation of feature importance

Performance evaluation

Results

Framework of PrOsteoporosis

Evaluation of the borderline SMOTE algorithm

Feature selection for constructing the final classifier

Interpretation of LightGBM model via the SHAP method

Case study

Comparison of different machine learning algorithms

Comparison of other prediction models

Discussion and conclusion

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher’s note

Electronic supplementary material

Supplementary Material 1: Fig. S1: A high correlation of several pairs of features.

Supplementary Material 2: Fig. S2: The case of Osteoporosis.

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Research Notes

Contact us