Table of Contents

Healthcare Data Analysis Documentation

Overview

Cardiovascular diseases (CVDs) are the leading cause of death globally, responsible for an estimated 17.9 million deaths each year—about 31% of all deaths worldwide. The majority of CVD deaths result from heart attacks and strokes, with a significant portion occurring prematurely in individuals under 70 years old. Heart failure, a frequent consequence of CVDs, can be predicted using clinical and demographic features. This project leverages a dataset containing 11 relevant features to analyze and predict heart disease risk. By applying data analysis and machine learning techniques, the goal is to support early detection and management of cardiovascular risk, especially for individuals with risk factors such as hypertension, diabetes, or hyperlipidaemia. The resulting insights and predictive models can assist healthcare professionals in making informed decisions and improving patient outcomes.

Data Sources

Kaggle - Heart Failure Prediction Dataset

This dataset was curated by merging five previously independent heart disease datasets, unified by 11 shared features. As a result, it represents the largest heart disease dataset currently available for research.


It includes the following datasets:


Total: 1190 observations
Duplicated: 272 observations


Final dataset: 918 observations

Each observation includes the following features:

Data Cleaning

Before analysis, the dataset underwent a thorough data cleaning process to ensure accuracy and reliability. The following steps were performed:


Continuous Variables Before Cleaning

As shown above, both RestingBP and Cholesterol had some 0 values, which are not physiologically plausible. These were treated as missing values and replaced with the median values for the respective fields. A binary indicator was also created to flag these records for further analysis.

Data Analysis

The cleaned dataset contains 918 observations and 12 features. Below is a summary of the dataset and key descriptive statistics:


Sample Data

Age Sex ChestPainType RestingBP Cholesterol FastingBS RestingECG MaxHR ExerciseAngina Oldpeak ST_Slope HeartDisease
40 M ATA 140 289 0 Normal 172 N 0.0 Up 0
49 F NAP 160 180 0 Normal 156 N 1.0 Flat 1
37 M ATA 130 283 0 ST 98 N 0.0 Up 0
48 F ASY 138 214 0 Normal 108 Y 1.5 Flat 1
54 M NAP 150 195 0 Normal 122 N 0.0 Up 0

Feature Overview


Descriptive Statistics

Feature Mean Std Min 25% 50% 75% Max
Age 53.51 9.43 28 47 54 60 77
RestingBP 132.40 18.51 0 120 130 140 200
Cholesterol 198.80 109.38 0 173.25 223 267 603
FastingBS 0.23 0.42 0 0 0 0 1
MaxHR 136.81 25.46 60 120 138 156 202
Oldpeak 0.89 1.07 -2.6 0.0 0.6 1.5 6.2
HeartDisease 0.55 0.50 0 0 1 1 1

Key Insights

Data Visualization

Visualizations can provide valuable insights into the dataset. Below are some examples of visualizations that can be created:


Continuous Variables Binary Variables Categorical Variables

Machine Learning Model

This project uses a Random Forest classifier to predict heart disease based on the dataset features. Random Forest is an ensemble method that builds multiple decision trees and combines their outputs for improved accuracy and robustness. The performance of these models can be evaluated using metrics such as accuracy, precision, recall, and F1-score. Cross-validation can then be used to ensure the model's robustness and generalizability.

Feature importance analysis can help identify the most significant predictors of heart disease.

Feature Importance Visualization

The plot below shows the relative importance of each feature in predicting heart disease using the Random Forest classifier:


Feature Importance Bar Chart for Heart Failure Prediction Dataset

Conclusion

This project demonstrates the potential of data analysis and machine learning in predicting heart disease risk.

By leveraging a comprehensive dataset and applying appropriate techniques, we can gain valuable insights into the factors contributing to cardiovascular diseases.


The Random Forest classifier provides a robust model for predicting heart disease, with feature importance analysis highlighting key predictors. These findings can assist healthcare professionals in identifying at-risk patients and implementing preventive measures. Future work could involve exploring additional machine learning algorithms, hyperparameter tuning, and incorporating more diverse datasets to enhance model performance and generalizability.


Overall, this project underscores the importance of data-driven approaches in healthcare and the potential for improving patient outcomes through early detection.