Cardiovascular disease (CVD), especially coronary heart disease (CHD), accounts for a major portion of global mortality 1. This has led to scientists collecting vast amount of data related to heart-disease and other conditions. With this data available, machine learning algorithms can better predict patients who are developing various kinds of diseases ranging from Diabetes to CVD 2. Research into which supervised learning techniques are best for CVD prediction is still ongoing into 2024 3, but we wish to also use this data to further develop unsupervised learning techniques since they can help us predict the disease without any labels.
We plan to explore these two datasets:
Both databases contains 13 features and a target variable specifying whether or not the patient was diagnosed with heart disease. They have 8 nominal values and 5 numeric values including age, blood pressure, and cholestrol levels.
We want to use machine learning models to predict if someone has cardiovascular disease from various health metrics. Most of the prior studies 4 focused on supervised learning algorithms for making predictions; however, our project will focus on both unsupervised and supervised learning for more comprehensive results.
We plan to use these data pre-processing methods:
We plan to use these unsupervised learning techniques:
Lastly, we want to use these supervised learning techniques:
To evaluate our supervised learning models, we plan to use the following metrics:
For unsupervised models, we plan to use the following metrics:
Project Goals: Not many studies have looked at unsupervised learning for this problem, so want to focus on how accurately unsupervised models cluster patient records. While unsupervised algorithms cannot provide comparisons to ground truth values, it is possible to create mappings between identified labels and clusters to directly use the metrics.
Expected Results: Based on the existing literature, we expect to predict heart disease accuracy scores of 95%+ for supervised models. Many papers have conflicting results on what the best algorithm is, so our goal is perform a similar study to determine the best algorithm via quantitative metrics. Furthermore, we expect clustering methods to give an accurate answer as to whether the disease is present or not.
See here for our Gantt Chart.
| Name | Contribution |
|---|---|
| Suzan | Website, Methods, Results |
| Natasha | Results, Gantt Chart |
| Kalp | Results & Discussion |
| Chih-Chun | Problem Definition, Methods |
| Eric | Intro/Background |
S. Hossain et al., “Machine Learning Approach for predicting cardiovascular disease in Bangladesh: Evidence from a cross-sectional study in 2023 - BMC Cardiovascular Disorders,” BioMed Central, https://bmccardiovascdisord.biomedcentral.com/articles/10.1186/s12872-024-03883-2. ↩
A. Dinh, S. Miertschin, A. Young, and S. D. Mohanty, “A data-driven approach to predicting diabetes and cardiovascular disease with Machine Learning - BMC Medical Informatics and Decision making,” SpringerLink, https://link.springer.com/article/10.1186/s12911-019-0918-5/metrics. ↩
Ogunpola, A.; Saeed, F.; Basurra, S.; Albarrak, A.M.; Qasem, S.N. Machine Learning-Based Predictive Models for Detection of Cardiovascular Diseases. Diagnostics 2024, 14, 144. https://doi.org/10.3390/diagnostics14020144 ↩ ↩2
A. Javaid et al., “Medicine 2032: The Future of Cardiovascular Disease Prevention with Machine Learning and Digital Health Technology,” American Journal of Preventive Cardiology, vol. 12, p. 100379, Dec. 2022. doi:10.1016/j.ajpc.2022.100379. ↩
Palechor, Fabio Mendoza et al. “Cardiovascular Disease Analysis Using Supervised and Unsupervised Data Mining Techniques.” J. Softw. 12 (2017): 81-90. ↩