Seeing the Unseen

Predicting Brain Stroke in a Patient

Setup

Code
if (!require("pacman")) 
  install.packages("pacman")

pacman::p_load(here,dplyr)

Objective

The central objective of this project is to evaluate the accuracy of various prediction models, shedding light on their effectiveness in early stroke detection, which could significantly impact patient outcomes and healthcare resource allocation.

Dataset

Code
brain_stroke <- read.csv(here("data","brain_stroke.csv"))

The Brain Stroke Dataset is taken from Kaggle datasets and consists of 4982 rows and 11 columns. This dataset provides a substantial pool of patient records and attributes for analysis. Our motivation for this project is rooted in the pressing need to enhance early stroke detection and risk assessment, two critical factors directly impacting patient care and stroke prevention strategies.

Attribute Information

Column Data Type Description
Gender character “Male”, “Female” or “Other”
Age integer Age of the patient
Hypertension boolean 0 if the patient doesn’t have hypertension, 1 if the patient has hypertension
Ever-married character “No” or “Yes”
Work Type character “Children”, “Govt. job”, “Never worked”, “Private” or “Self-employed”
Residence type character “Rural” or “Urban”
Avg glucose level numeric Average glucose level in blood
BMI numeric Body mass index
Smoking status character “formerly smoked”, “never smoked”, “smokes” or “Unknown”
Stroke boolean 1 if the patient had a stroke or 0 if not

Why we chose this dataset?

We selected the “Brain Stroke Dataset” for its comprehensiveness and relevance to stroke prediction. The 11 columns encompass essential patient information, including demographic details, medical history, lifestyle factors, and the presence or absence of stroke. By harnessing the insights from this dataset, we aim to create a predictive model to accurately classify whether an individual is at a high risk of brain stroke, ultimately enabling timely interventions and tailored preventive measures.

Approach

  • We will consider the attributes for analysis and perform Exploratory Data Analysis to identify the general patterns in the dataset.

  • Develop classification models using K-Nearest Neighbors, Random Forest and Decision Tree and pick the best model in identifying the early stage risk of brain stroke in a patient.

  • Using the best model, we will check the significance of each attribute in classifying the risk level of a patient.

Questions

1) How accurate are various prediction models for detecting a stroke in a patient?

2) How are certain lifestyle changes ranked on the basis of their importance in reducing the possibility of a stroke?

Analysis plan

Question 1 -

  • Data Pruning and Cleaning

    • Implement various Imputation techniques to replace NA values

    • Possibly remove various NA values

    • Prune data values that are not needed

  • Exploratory Data Analysis

    • Summary statistics of the data

    • Create simple graphs such as box plot and histograms

  • Create Visualizations

    • Visualize various variables that could affect brain strokes such as age, heart disease, avg glucose level, hypertension and smoking status
  • Prediction Model Algorithms

    • K-Nearest Neighbors Model implementation

    • Random Forest Model implementation

    • Decision Tree Model implementation

    • Other models if need

  • Evaluations and Results

    • Compare F1 score, Accuracy, Precision and Recall between the models

    • Compare error metrics between models to determine which had the best performance

    • Choose overall best model and give explanation as to why this model worked the best

Question 2 -

  • Pick the overall best model from the first question

  • Calculate performance metrics like F1 score, accuracy, precision and recall excluding each attribute one at a time

  • Based on the above results estimate how significant each attribute and estimate the ranking order

Plan of Attack

Week Weekly Tasks Team members involved
Till November 8th Explore the dataset and the problem statement Everyone
- Complete the proposal Everyone
Nov. 9th - 14th Exploratory Data Analysis Abhishek, Shreya
- Data cleaning and Data pre-processing based on EDA Daniel, Ram
Nov. 15th - 21st Data Visualization Kashyap, Daniel
- Training and Testing Abhishek, Shreya
Nov. 22nd - 29th K-Nearest Neighbors Model implementation Kashyap, Ram
- Random Forest Model implementation Daniel, Ram
Nov. 30th - December 7th Model Evaluation Abhishek, Shreya
- Model Comparison Everyone
- Finding the attribute significance Kashyap, Ram
Dec. 8th - 11th Refining the code with comments and final changes Everyone
- Write-up and presentation for the project Everyone

Repo Organization

The following are the folders involved in the Project repository.

  • ‘data/’: Used for storing any necessary data files for the project, such as input files.

  • ‘images/’: Used for storing image files used in the project.

  • ‘_extra/’: Used to brainstorm our analysis which won’t impact our project workflow.

  • ‘_freeze/’: This folder is used to store the generated files during the build process. These files represent the frozen state of the website at a specific point in time.

  • ‘.github/’: Folder for storing github templates and workflow.