ml-cvd-prediction

Proposal

Video Presentation

Introduction

Cardiovascular disease (CVD), especially coronary heart disease (CHD), accounts for a major portion of global mortality ¹. This has led to scientists collecting vast amount of data related to heart-disease and other conditions. With this data available, machine learning algorithms can better predict patients who are developing various kinds of diseases ranging from Diabetes to CVD ². Research into which supervised learning techniques are best for CVD prediction is still ongoing into 2024 ³, but we wish to also use this data to further develop unsupervised learning techniques since they can help us predict the disease without any labels.

We plan to explore these two datasets:

Cardiovascular Heart Disease Dataset from the Mendeley database
Heart Disease Cleveland Dataset from the UC Irvine Machine Learning Repository

Both databases contains 13 features and a target variable specifying whether or not the patient was diagnosed with heart disease. They have 8 nominal values and 5 numeric values including age, blood pressure, and cholestrol levels.

Problem Definition

We want to use machine learning models to predict if someone has cardiovascular disease from various health metrics. Most of the prior studies ⁴ focused on supervised learning algorithms for making predictions; however, our project will focus on both unsupervised and supervised learning for more comprehensive results.

Methods

We plan to use these data pre-processing methods:

Dimensionality Reduction: We can combine correlated features using methods like PCA to not only reduce the computational time and cost but also lead to better model performance.
Data Cleaning: For missing values, we can put in temporary median or mean values computed from the entire dataset so that our algorithms work well.
Data Augmentation: We can utilize data augmentation to generate new data if we have too little data for a specific algorithm to work well.

We plan to use these unsupervised learning techniques:

K-means Clustering: This technique will help us understand if hard clustering is be useful for our problem.
GMM: This technique will help us compare how well soft assignment methods work for our project.

Lastly, we want to use these supervised learning techniques:

Logistic Regression, Neural Networks: These are the most commonly used classification models which can work on almost any dataset. These can serve as a base model to compare all other models.
SVM: This technique usually performs well on datasets which have high dimensions and unstructured data.
Random Forest: This method is great for training models on datasets with a lot of missing values.
XGBoost, KNN: From our literature review ³⁵, these methods were found to have the best performance on healthcare data.
Decision Tree: This method usually works well when the data is discrete or categorical.

(Potential) Results and Discussion

To evaluate our supervised learning models, we plan to use the following metrics:

Accuracy
F1 Score
Precision
Recall

For unsupervised models, we plan to use the following metrics:

Completeness Score
Fowlkes-Mallows Score

Project Goals: Not many studies have looked at unsupervised learning for this problem, so want to focus on how accurately unsupervised models cluster patient records. While unsupervised algorithms cannot provide comparisons to ground truth values, it is possible to create mappings between identified labels and clusters to directly use the metrics.

Expected Results: Based on the existing literature, we expect to predict heart disease accuracy scores of 95%+ for supervised models. Many papers have conflicting results on what the best algorithm is, so our goal is perform a similar study to determine the best algorithm via quantitative metrics. Furthermore, we expect clustering methods to give an accurate answer as to whether the disease is present or not.

Timeline

See here for our Gantt Chart.

Contributors

Name	Contribution
Suzan	Website, Methods, Results
Natasha	Results, Gantt Chart
Kalp	Results & Discussion
Chih-Chun	Problem Definition, Methods
Eric	Intro/Background

References

S. Hossain et al., “Machine Learning Approach for predicting cardiovascular disease in Bangladesh: Evidence from a cross-sectional study in 2023 - BMC Cardiovascular Disorders,” BioMed Central, https://bmccardiovascdisord.biomedcentral.com/articles/10.1186/s12872-024-03883-2. ↩
A. Dinh, S. Miertschin, A. Young, and S. D. Mohanty, “A data-driven approach to predicting diabetes and cardiovascular disease with Machine Learning - BMC Medical Informatics and Decision making,” SpringerLink, https://link.springer.com/article/10.1186/s12911-019-0918-5/metrics. ↩
Ogunpola, A.; Saeed, F.; Basurra, S.; Albarrak, A.M.; Qasem, S.N. Machine Learning-Based Predictive Models for Detection of Cardiovascular Diseases. Diagnostics 2024, 14, 144. https://doi.org/10.3390/diagnostics14020144 ↩ ↩²
A. Javaid et al., “Medicine 2032: The Future of Cardiovascular Disease Prevention with Machine Learning and Digital Health Technology,” American Journal of Preventive Cardiology, vol. 12, p. 100379, Dec. 2022. doi:10.1016/j.ajpc.2022.100379. ↩
Palechor, Fabio Mendoza et al. “Cardiovascular Disease Analysis Using Supervised and Unsupervised Data Mining Techniques.” J. Softw. 12 (2017): 81-90. ↩