Seeing the Unseen

Seeing the Unseen: Predicting Brain Stroke

Authors:

Abhishek Deore
Daniel Shevelev
Kashyap Nadendla
Shreya Kolte
Ram Dheeraj Kamarajugadda

Dataset Description

The Brain Stroke Dataset is taken from Kaggle datasets and consists of 4982 rows and 11 columns.
Gender, age, hypertension, average glucose level, smoking status, stroke, ever married,work type, residence type
No null values were to be found in the dataset

EDA Plots

Gender
Smoking status and age relation
Age vs average glucose level
Heart disease and strokes
Correlation Heat Map

Gender

Smoking Status and Age relation

Age vs Glucose

Heart Disease and Strokes

Correlation Heat Map

This heatmap is useful for identifying potential patterns and relationships between different numeric variables in the dataset.

Question 1 : How accurate are various classification models for detecting a stroke in a patient?

Approach

In the initial phase of our brain stroke prediction project, we conducted a comprehensive examination of the dataset. The brain.shape and brain.info() checks ensured the dataset’s integrity, revealing its dimensions and basic information.

Classification models used:

Logistic Regression
K-Nearest Neighbors (KNN)
Naive Bayes
Decision Tree
Random Forest

Model Accuracy Results

Selected Model:

Logistic Regression

Accuracy - 95%

Question 2: How are certain lifestyle changes ranked on the basis of their importance in reducing the possibility of a stroke?

Approach

The feature ranking provided by the logistic regression model with their corresponding coefficients gives insights into the impact of each feature on the likelihood of the target variable.

Feature ranking :

Feature	Coefficient
age_band	1.187519
avg_glucose_level	0.880860
hypertension	0.551390
heart_diesease	0.288187
gender	0.073007
smoking_status	-0.053451
bmi	-0.211967

Here’s an interpretation of the feature ranking :

Age is a big factor in stroke and has a positive coefficient above 1.
Average glucose level is the second most significant feature with a coefficient of 0.88
BMI coefficient is -0.21. Higher BMI is associated with lower likelihood of the target variable.

Challenges faced

Class imbalance where one class (e.g stroke) significantly outnumbers the other, chances of leading to biased models
Choosing an appropriate model that balances complexity and interpretability

THANK YOU