https://doi.org/10.2196/preprints.44081

">
 

Document Type

Article

Abstract

Background: Low birthweight (LBW) is one of the leading causes of neonatal mortality in the United States (US), and also is a major causative factor of short- and long-term adverse health effects in newborns. To prevent adverse birth outcomes, it is critical to precisely predict and identify which mothers are at high risk of bearing LBW babies in the early stage of their pregnancy. Previous studies proposed various LBW prediction models from different ML algorithms primarily on small datasets. However, their model performance is significantly limited by data access barriers and data quality issues with one major technical challenge of handling data imbalance issues. To date, scarce studies have successfully benchmarked the performance of machine learning (ML) models in maternal health, thus, it is critical to establish such benchmarks to advance the ML use and to improve birth outcomes. Objective: This study aims to establish several key benchmarking ML models in predicting LBW and to systematically apply different rebalance optimization methods to a large-scale and extremely unbalanced Medicare and Medicaid Claim dataset, which connects mother and baby data at a state level in the US. We also investigate the risk factors that adversely affect birth outcomes, which lead to LBW. Methods: Our large dataset consisted of 266,687 birth records across six years from a state in the US. Among these records, 23,019 (8.63%) are labeled as LBW. To set up benchmark ML models to predict LBW, we applied six classic ML models (i.e., Logistic Regression, Naïve Bayes, Random Forest, Extreme Gradient Boosting [XGBoost], Adaptive Boosting [AdaBoost], and Multi-layer Perceptron) while using four different data rebalance methods: random under-sampling, random oversampling, synthetic minority oversampling technique (SMOTE), and weight rebalancing. Due to the ethical consideration, in addition to ML evaluation metrics, such as accuracy, precision, and F1-score, we primarily used recall to evaluate the model performance, indicating the rate of number of correct predicted LBW cases of all actual LBW cases, since false negative healthcare outcomes (i.e., an actual LBW patient is predicted as non-LBW) could be fatal to the patient. We also further analyzed feature importance to explore the degree of each feature contributing to the ML model prediction among our best performing models. Results: We found Random Forest achieved the highest recall score – 0.62, using the random under-sampling method. XGBoost achieved the same recall score but with the weight rebalancing method. Our results show that various data rebalance methods improved the prediction performance of the LBW group significantly, e.g., increasing the Recall score from 0.34 to 0.62. From the feature importance analysis, the maternal race, the sum of pre-12 months inpatient hospitalization, predelivery disease profile, and social vulnerability index of housing type are important risk factors associated with LBW. Conclusions: Our study findings establish useful ML benchmarks to improve birth outcomes in maternal health domain. They are informative to identify the minority classes based on an extremely unbalanced dataset, and also have important practical implications for personalized LBW early prevention programs and maternal and infant health policy changes.

Digital Object Identifier (DOI)

https://doi.org/10.2196/preprints.44081

APA Citation

Ren, Y., Wu, D., Tong, Y., López-DeFede, A., & Gareau, S. (2022). Issue of Data Imbalance on Low Birthweight Baby Outcomes Prediction and Associated Risk Factors Identification: Establishment of Benchmarking Key Machine Learning Models With Data Rebalancing Strategies (Preprint). https://doi.org/10.2196/preprints.44081

Included in

Public Health Commons

Share

COinS