Seeing the Unseen

Predicting Brain Stroke in a Patient

Setup

Code

if (!require("pacman")) 
  install.packages("pacman")

pacman::p_load(here,dplyr)

Objective

The central objective of this project is to evaluate the accuracy of various prediction models, shedding light on their effectiveness in early stroke detection, which could significantly impact patient outcomes and healthcare resource allocation.

Dataset

Code

brain_stroke <- read.csv(here("data","brain_stroke.csv"))

The Brain Stroke Dataset is taken from Kaggle datasets and consists of 4982 rows and 11 columns. This dataset provides a substantial pool of patient records and attributes for analysis. Our motivation for this project is rooted in the pressing need to enhance early stroke detection and risk assessment, two critical factors directly impacting patient care and stroke prevention strategies.

Attribute Information

Column	Data Type	Description
Gender	`character`	“Male”, “Female” or “Other”
Age	`integer`	Age of the patient
Hypertension	`boolean`	0 if the patient doesn’t have hypertension, 1 if the patient has hypertension
Ever-married	`character`	“No” or “Yes”
Work Type	`character`	“Children”, “Govt. job”, “Never worked”, “Private” or “Self-employed”
Residence type	`character`	“Rural” or “Urban”
Avg glucose level	`numeric`	Average glucose level in blood
BMI	`numeric`	Body mass index
Smoking status	`character`	“formerly smoked”, “never smoked”, “smokes” or “Unknown”
Stroke	`boolean`	1 if the patient had a stroke or 0 if not

Why we chose this dataset?

We selected the “Brain Stroke Dataset” for its comprehensiveness and relevance to stroke prediction. The 11 columns encompass essential patient information, including demographic details, medical history, lifestyle factors, and the presence or absence of stroke. By harnessing the insights from this dataset, we aim to create a predictive model to accurately classify whether an individual is at a high risk of brain stroke, ultimately enabling timely interventions and tailored preventive measures.

Approach

We will consider the attributes for analysis and perform Exploratory Data Analysis to identify the general patterns in the dataset.
Develop classification models using K-Nearest Neighbors, Random Forest and Decision Tree and pick the best model in identifying the early stage risk of brain stroke in a patient.
Using the best model, we will check the significance of each attribute in classifying the risk level of a patient.

Questions

1) How accurate are various prediction models for detecting a stroke in a patient?

2) How are certain lifestyle changes ranked on the basis of their importance in reducing the possibility of a stroke?

Analysis plan

Question 1 -

Data Pruning and Cleaning
- Implement various Imputation techniques to replace NA values
- Possibly remove various NA values
- Prune data values that are not needed
Exploratory Data Analysis
- Summary statistics of the data
- Create simple graphs such as box plot and histograms
Create Visualizations
- Visualize various variables that could affect brain strokes such as age, heart disease, avg glucose level, hypertension and smoking status
Prediction Model Algorithms
- K-Nearest Neighbors Model implementation
- Random Forest Model implementation
- Decision Tree Model implementation
- Other models if need
Evaluations and Results
- Compare F1 score, Accuracy, Precision and Recall between the models
- Compare error metrics between models to determine which had the best performance
- Choose overall best model and give explanation as to why this model worked the best

Question 2 -

Pick the overall best model from the first question
Calculate performance metrics like F1 score, accuracy, precision and recall excluding each attribute one at a time
Based on the above results estimate how significant each attribute and estimate the ranking order

Plan of Attack

Week	Weekly Tasks	Team members involved
Till November 8^th	Explore the dataset and the problem statement	Everyone
-	Complete the proposal	Everyone
Nov. 9^th - 14^th	Exploratory Data Analysis	Abhishek, Shreya
-	Data cleaning and Data pre-processing based on EDA	Daniel, Ram
Nov. 15^th - 21^st	Data Visualization	Kashyap, Daniel
-	Training and Testing	Abhishek, Shreya
Nov. 22^nd - 29^th	K-Nearest Neighbors Model implementation	Kashyap, Ram
-	Random Forest Model implementation	Daniel, Ram
Nov. 30^th - December 7^th	Model Evaluation	Abhishek, Shreya
-	Model Comparison	Everyone
-	Finding the attribute significance	Kashyap, Ram
Dec. 8^th - 11^th	Refining the code with comments and final changes	Everyone
-	Write-up and presentation for the project	Everyone

Repo Organization

The following are the folders involved in the Project repository.

‘data/’: Used for storing any necessary data files for the project, such as input files.
‘images/’: Used for storing image files used in the project.
‘_extra/’: Used to brainstorm our analysis which won’t impact our project workflow.
‘_freeze/’: This folder is used to store the generated files during the build process. These files represent the frozen state of the website at a specific point in time.
‘.github/’: Folder for storing github templates and workflow.