International Journal of Academic Research in Business and Social Sciences

search-icon

Optimizing Early Stage Diabetes Detection: A Robust Evaluation of Machine Learning Algorithms

Open access

Hanissah Mohamad @ Sulaiman, Norazlina Abd Razak, Siti Huzaimah Husin, Siti Aisah Mat Junos @ Yunus

Pages 4240-4255 Received: 16 Nov, 2024 Revised: 10 Dec, 2024 Published Online: 30 Dec, 2024

http://dx.doi.org/10.46886/IJARBSS/v14-i12/14192
The escalating global incidence of diabetes emphasizes the imperative for prompt detection to alleviate significant health adversities. This investigation assesses the efficacy and robustness of three machine learning algorithms—Decision Tree, Support Vector Machine (SVM), and Naive Bayes—utilizing methodologies such as Train-Test Split, K-Fold Cross Validation, and Stratified K-Fold Cross Validation. Critical performance indicators including Accuracy, Precision, Recall, F1-Score, and ROC-AUC were meticulously examined, with standard deviation employed to evaluate the stability of the models. SVM consistently surpassed the other algorithms, exhibiting superior accuracy and reliability across the various validation approaches, particularly within the context of Stratified K-Fold Cross Validation. Naive Bayes revealed commendable recall efficacy, while Decision Tree experienced augmented stability through the application of cross-validation techniques. The results underscore the significance of employing cross-validation methods, particularly Stratified K-Fold, for dependable model assessment in scenarios characterized by imbalanced datasets. Subsequent research endeavors should investigate ensemble methodologies and data augmentation strategies to further enhance the resilience of the models.
Abubakar, A. I., Waziri, S. A., & Uba, M. (2021). Application of naive Bayes classifier in medical diagnosis using optimized features. Mathematics, 9(4), 419. https://doi.org/10.3390/math9040419
Al-Sideiri, A., Che Cob, Z., & Drus, S. (2019, December 14). Machine Learning Algorithms for Diabetes Prediction: A Review Paper.International Conference on Artificial Intelligence. https://doi.org/10.1145/3388218.3388231
Bereda, G. (2022). A Review of the Hybrid Description of Diabetes Mellitus. BOHR International Journal of Current Research in Diabetes and Preventive Medicine.
Berrar, D. (2019). Cross-validation. In Encyclopedia of Bioinformatics and Computational Biology: ABC of Bioinformatics (pp. 542-545). Elsevier. https://doi.org/10.1016/B978-0-12-809633-8.20349-X
Brnabic, A., & Hess, L. M. (2021). Systematic literature review of machine learning methods used in the analysis of real-world data for patient-provider decision making.BMC Medical Informatics and Decision Making. https://doi.org/10.1186/S12911-021-01403-2
Brownlee, J. (2019). Repeated K-Fold Cross-Validation for Model Evaluation in Python. Machine Learning Mastery. https://machinelearningmastery.com/repeated-k-fold-cross-validation-with-python/
Catania, C., Guerra, J. J., Romero, J. M., Caffaratti, G. D., & Marchetta, M. G. (2022). Beyond Random Split for Assessing Statistical Model Performance.arXiv.Org. https://doi.org/10.48550/arXiv.2209.03346
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2021). SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, 16(1), 321–357. https://doi.org/10.1613/jair.953
Chicco, D., & Jurman, G. (2020). The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics, 21(1), 6. https://doi.org/10.1186/s12864-019-6413-7
Ebubeogu, A. F., & Lee, S. P. (2019). Systematic literature review of preprocessing techniques for imbalanced data.IET Software. https://doi.org/10.1049/IET-SEN.2018.5193
Fu, G. H., Wu, Y. J., Zong, M. J., & Pan, J. (2020). Hellinger distance-based stable sparse feature selection for high-dimensional class-imbalanced data. BMC bioinformatics, 21, 1-14.
Goot, R. (2021, November 1). We Need to Talk About train-dev-test Splits.Empirical Methods in Natural Language Processing.
Gupta, S., & Rani, R. (2021). An empirical evaluation of ensemble techniques for imbalanced learning problems. International Journal of Machine Learning and Cybernetics, 12(1), 221-237. https://doi.org/10.1007/s13042-020-01169-8
Hakimi, H., Kamalrudin, M., & Abdullah, R. S. (2023). Software Security Readiness Model For Remote Working In Malaysian Public Sectors: Conceptual Framework. Journal Of Theoretical And Applied Information Technology, 101(8).
Haq, A., Li, J., Khan, J., Memon, M. H., Nazir, S., Ahmad, S., Khan, G., & Ali, A. (2020). Intelligent machine learning approach for effective recognition of diabetes in e-healthcare using clinical data. Sensors (Basel, Switzerland)
Husain, A., & Khan, M. H. (2018). Early diabetes prediction using voting based ensemble learning. In Advances in Science, Technology and Innovation (pp. 95-103).
Khoshgoftaar, T. M., & Gao, K. (2021). Performance metrics for classification problems involving imbalanced datasets. IEEE Transactions on Knowledge and Data Engineering, 33(4), 1544–1559. https://doi.org/10.1109/TKDE.2020.2987069
Li, C., Zheng, H., & He, Z. (2018). Improving model performance on imbalanced datasets with stratified k-fold cross-validation. IEEE Access, 6, 20368–20375. https://doi.org/10.1109/ACCESS.2018.2816348
Perez, J., Díaz, J., Garcia-Martin, J., & Tabuenca, B. (2020). Systematic Literature Reviews in Software Engineering -- Enhancement of the Study Selection Process using Cohen’s Kappa Statistic.arXiv: Software Engineering. https://doi.org/10.1016/J.JSS.2020.110657
Purwanto, A. D., Wikantika, K., Deliar, A., & Darmawan, S. (2023). Decision Tree and Random Forest Classification Algorithms for Mangrove Forest Mapping in Sembilang National Park, Indonesia. Remote Sensing, 15(1), 16. https://doi.org/10.3390/rs15010016
Rekha, G., Tyagi, A. K., & Reddy, V. K. (2019). A Wide Scale Classification of Class Imbalance Problem and its Solutions: A Systematic Literature Review.Journal of Computer Science. https://doi.org/10.3844/JCSSP.2019.886.929
Rampisela, T. V., & Rustam, Z. (2018). Classification of schizophrenia data using Support Vector Machine (SVM). Journal of Physics: Conference Series, 1108, 012044
Subramani, N., Easwaramoorthy, S. V., Mohan, P., Subramanian, M., & Sambath, V. (2023). A Gradient Boosted Decision Tree-Based Influencer Prediction in Social Network Analysis. Big Data and Cognitive Computing, 7(1), 6. https://doi.org/10.3390/bdcc7010006
Suhaimin, K. N., Mahmood, W. H. W., Ebrahim, Z., Hakimi, H., & Aziz, S. (2023). Human Centric Approach in Smart Remanufacturing for End-Life-Vehicle (ELV)’s Stabilizer Bar. Malaysian Journal on Composites Science and Manufacturing, 12(1), 1-12.
Wen, Y., Yang, B., Song, S., & Yu, W. (2019). A comprehensive comparison of data splitting algorithms for machine learning. IEEE Access, 7, 125503-125521. https://doi.org/10.1109/ACCESS.2019.2939695
Yang, T., Zhang, L., Yi, L., Feng, H., Li, S., Chen, H., Zhu, J., Zhao, J., Zeng, Y., & Liu, H. (2020). Ensemble learning models based on noninvasive features for type 2 diabetes screening: Model development and validation. JMIR Medical Informatics, 8.
Zhang, Y., Yang, L., & Zhu, X. (2020). A survey on cross-validation techniques and methods. Journal of Machine Learning Research, 21(1), 146–175. https://doi.org/10.1109/JMLR.2020.321876
Sulaima, H. M. @, Razak, N. A., Husin, S. H., & Yunus, S. A. M. J. @. (2024). Optimizing Early Stage Diabetes Detection: A Robust Evaluation of Machine Learning Algorithms. International Journal of Academic Research in Business and Social Sciences, 14(12), 4240–4255.