Cardiovascular diseases (CVDs) are the leading cause of death globally, responsible for an estimated 17.9 million deaths each year—about 31% of all deaths worldwide. The majority of CVD deaths result from heart attacks and strokes, with a significant portion occurring prematurely in individuals under 70 years old. Heart failure, a frequent consequence of CVDs, can be predicted using clinical and demographic features. This project leverages a dataset containing 11 relevant features to analyze and predict heart disease risk. By applying data analysis and machine learning techniques, the goal is to support early detection and management of cardiovascular risk, especially for individuals with risk factors such as hypertension, diabetes, or hyperlipidaemia. The resulting insights and predictive models can assist healthcare professionals in making informed decisions and improving patient outcomes.
Kaggle - Heart Failure Prediction Dataset
This dataset was curated by merging five previously independent heart disease datasets, unified by 11 shared features. As a result, it represents the largest heart disease dataset currently available for research.
It includes the following datasets:
Cleveland: 303 observations
Hungarian: 294 observations
Switzerland: 123 observations
Long Beach VA: 200 observations
Stalog (Heart) Data Set: 270 observations
Total: 1190 observations
Duplicated: 272 observations
Final dataset: 918 observations
Each observation includes the following features:
Age: Age of the patient
Sex: Sex of the patient [M: Male, F: Female]
ChestPainType: Chest pain type [TA: Typical Angina, ATA: Atypical Angina, NAP: Non-Anginal Pain, ASY: Asymptomatic]
RestingBP: Resting blood pressure [mm Hg]
Cholesterol: Serum cholesterol [mm/dl]
FastingBS: Fasting blood sugar [1: if FastingBS > 120 mg/dl, 0: otherwise]
RestingECG: Resting electrocardiogram results [Normal: Normal, ST: ST-T wave abnormality, LVH: left ventricular hypertrophy]
MaxHR: Maximum heart rate achieved [Numeric value between 60 and 202]
ExerciseAngina: Exercise-induced angina [Y: Yes, N: No]
Oldpeak: ST depression induced by exercise relative to rest [Numeric value]
ST_Slope: The slope of the peak exercise ST segment [Up: upsloping, Flat: flat, Down: downsloping]
HeartDisease: Output class [1: heart disease, 0: Normal]
Before analysis, the dataset underwent a thorough data cleaning process to ensure accuracy and reliability. The following steps were performed:
As shown above, both RestingBP and Cholesterol had some 0 values, which are not physiologically plausible. These were treated as missing values and replaced with the median values for the respective fields. A binary indicator was also created to flag these records for further analysis.
The cleaned dataset contains 918 observations and 12 features. Below is a summary of the dataset and key descriptive statistics:
Age | Sex | ChestPainType | RestingBP | Cholesterol | FastingBS | RestingECG | MaxHR | ExerciseAngina | Oldpeak | ST_Slope | HeartDisease |
---|---|---|---|---|---|---|---|---|---|---|---|
40 | M | ATA | 140 | 289 | 0 | Normal | 172 | N | 0.0 | Up | 0 |
49 | F | NAP | 160 | 180 | 0 | Normal | 156 | N | 1.0 | Flat | 1 |
37 | M | ATA | 130 | 283 | 0 | ST | 98 | N | 0.0 | Up | 0 |
48 | F | ASY | 138 | 214 | 0 | Normal | 108 | Y | 1.5 | Flat | 1 |
54 | M | NAP | 150 | 195 | 0 | Normal | 122 | N | 0.0 | Up | 0 |
Feature | Mean | Std | Min | 25% | 50% | 75% | Max |
---|---|---|---|---|---|---|---|
Age | 53.51 | 9.43 | 28 | 47 | 54 | 60 | 77 |
RestingBP | 132.40 | 18.51 | 0 | 120 | 130 | 140 | 200 |
Cholesterol | 198.80 | 109.38 | 0 | 173.25 | 223 | 267 | 603 |
FastingBS | 0.23 | 0.42 | 0 | 0 | 0 | 0 | 1 |
MaxHR | 136.81 | 25.46 | 60 | 120 | 138 | 156 | 202 |
Oldpeak | 0.89 | 1.07 | -2.6 | 0.0 | 0.6 | 1.5 | 6.2 |
HeartDisease | 0.55 | 0.50 | 0 | 0 | 1 | 1 | 1 |
Visualizations can provide valuable insights into the dataset. Below are some examples of visualizations that can be created:
This project uses a Random Forest classifier to predict heart disease based on the dataset features. Random Forest is an ensemble method that builds multiple decision trees and combines their outputs for improved accuracy and robustness. The performance of these models can be evaluated using metrics such as accuracy, precision, recall, and F1-score. Cross-validation can then be used to ensure the model's robustness and generalizability.
Feature importance analysis can help identify the most significant predictors of heart disease.
The plot below shows the relative importance of each feature in predicting heart disease using the Random Forest classifier:
This project demonstrates the potential of data analysis and machine learning in predicting heart disease risk.
By leveraging a comprehensive dataset and applying appropriate techniques, we can gain valuable insights into the factors contributing to cardiovascular diseases.
The Random Forest classifier provides a robust model for predicting heart disease, with feature importance analysis highlighting key predictors. These findings can assist healthcare professionals in identifying at-risk patients and implementing preventive measures. Future work could involve exploring additional machine learning algorithms, hyperparameter tuning, and incorporating more diverse datasets to enhance model performance and generalizability.
Overall, this project underscores the importance of data-driven approaches in healthcare and the potential for improving patient outcomes through early detection.