Heart Attack Prediction Using Machine Learning Models

Lucinda Liu
4 min readMay 8, 2021

--

Heart disease is the leading cause of death for men, women, and people of most racial and ethnic groups in the United States. Here are some astonishing statistics regarding this condition:

  • One person dies every 36 seconds in the United States from cardiovascular disease.
  • About 655,000 Americans die from heart disease each year — that’s 1 in every 4 deaths.

Not only can the disease be lethal, but bills associated with it can also be extremely burdening. Heart disease costs the United States about $219 billion each year from 2014 to 2015.3 This includes the cost of health care services, medicines, and lost productivity due to death.

As this is a serious health condition in the US as well as worldwide, I took this opportunity to look at ways to predict the chances of getting heart attacks and some factors associated with it. The dataset was found from Kaggle and contains information about numerous attributes that may contribute to heart attack chances. If a functional model can be predicted, doctors and health care providers may use these indicators to predict a patient or visitor’s chances of heart attack, which could be in turn used as preventative measures and effective warning systems.

Here are the variables included in the dataset:

  • age: Age of the patient
  • sex: Sex of the patient
  • cp: Chest pain type, 0 = Typical Angina, 1 = Atypical Angina, 2 = Non-anginal Pain, 3 = Asymptomatic
  • trtbps: Resting blood pressure (in mm Hg) (130+ = high)
  • chol: Cholestoral in mg/dl fetched via BMI sensor (240+ = high)
  • fbs: (fasting blood sugar > 120 mg/dl), 1 = True, 0 = False
  • restecg: Resting electrocardiographic results, 0 = Normal, 1 = ST-T wave normality, 2 = showing probable or definite left ventricular hypertrophy by Estes’ criteria
  • thalachh: Maximum heart rate achieved
  • oldpeak: Previous peak
  • slp: Slope
  • caa: Number of major vessels
  • thall: Thalium Stress Test result, (0–3)
  • exng: Exercise induced angina, 1 = Yes, 0 = No
  • output: 0 = less chance of heart attack, 1 = more chance of heart attack

At first, I ran a quick visualization to make sure the data is relatively evenly distributed so we can perform fair analysis. The distribution is fairly even and not leaning towards one side or another.

Distribution of instances with more and less chance of heart attack

Then, some basic visualization is performed to spot the initial finding of patterns.

It seems like male are more prone to heart attack than female, and symptoms of non-anginal pain is a leading contributor to having more chance of heart attack
Fasting blood sugar > 120mg does not seem to impact heart attack chance by much, and having ST-T wave abnormality seem to contribute slightly to higher chances
Having exercised induced angina or having little major vessels seem not to lead to higher chances
It seems like having level 2 thall stress test contribute to more chances, and high cholesterol/blood pressure do not influence the chances by much

Next, I split the data into training and testing test using a 3/7 split and ran it through a few models. The accuracy result is as follows:

  • Logistic Regression: 83.52%
  • Decision Tree: 95.6%
Decision Tree with Depth of 6

We can see that the decision tree method has a much higher accuracy rate at 95.6%. To avoid overfitting the training data, I will regularize the model. One way of doing so is to restrict the maximum depth of the Decision Tree and controling the leaf size. I changed the maximum depth from 6 to 5, and set a minimum samples leaf of 10. The resulting accuracy score was 84.6%.

Another way of regularizing the model is to use an ensemble method of Random Forest, which is to construct multiple trees as a collective model. The resulting accuracy on this model was 83.5%.

Some other ensemble methods was used as well, with generic bagging method with an accuracy rate of 81% and adaboost classification with 80% accuracy rate.

Based off the accuracy score, the most fitted model would be the Decision Tree Classifier with an accuracy score of 95.6%. I also ran a test on the training data accuracy and received a result of 96.2%, which suggests that the testing data accuracy is unlikely to be overfitted.

The top 3 important measure of the model include chest pain type, caa, and cholestrol level. Interestingly, from the graph it was hard for us to spot any visible impact that cholestrol level has on the chances of getting a heart attack. Hopefully, these finding would be useful for healthcare instituitions to use as a proxy of judgement and a preventative measure.

--

--

Lucinda Liu
0 Followers

Passionate about beauty & lifestyle💄💅and the stories behind the business📚✍️