Code
if (!require("pacman"))
install.packages("pacman")
pacman::p_load(here,dplyr)Predicting Brain Stroke in a Patient
The central objective of this project is to evaluate the accuracy of various prediction models, shedding light on their effectiveness in early stroke detection, which could significantly impact patient outcomes and healthcare resource allocation.
The Brain Stroke Dataset is taken from Kaggle datasets and consists of 4982 rows and 11 columns. This dataset provides a substantial pool of patient records and attributes for analysis. Our motivation for this project is rooted in the pressing need to enhance early stroke detection and risk assessment, two critical factors directly impacting patient care and stroke prevention strategies.
| Column | Data Type | Description |
|---|---|---|
| Gender | character |
“Male”, “Female” or “Other” |
| Age | integer |
Age of the patient |
| Hypertension | boolean |
0 if the patient doesn’t have hypertension, 1 if the patient has hypertension |
| Ever-married | character |
“No” or “Yes” |
| Work Type | character |
“Children”, “Govt. job”, “Never worked”, “Private” or “Self-employed” |
| Residence type | character |
“Rural” or “Urban” |
| Avg glucose level | numeric |
Average glucose level in blood |
| BMI | numeric |
Body mass index |
| Smoking status | character |
“formerly smoked”, “never smoked”, “smokes” or “Unknown” |
| Stroke | boolean |
1 if the patient had a stroke or 0 if not |
We selected the “Brain Stroke Dataset” for its comprehensiveness and relevance to stroke prediction. The 11 columns encompass essential patient information, including demographic details, medical history, lifestyle factors, and the presence or absence of stroke. By harnessing the insights from this dataset, we aim to create a predictive model to accurately classify whether an individual is at a high risk of brain stroke, ultimately enabling timely interventions and tailored preventive measures.
We will consider the attributes for analysis and perform Exploratory Data Analysis to identify the general patterns in the dataset.
Develop classification models using K-Nearest Neighbors, Random Forest and Decision Tree and pick the best model in identifying the early stage risk of brain stroke in a patient.
Using the best model, we will check the significance of each attribute in classifying the risk level of a patient.
1) How accurate are various prediction models for detecting a stroke in a patient?
2) How are certain lifestyle changes ranked on the basis of their importance in reducing the possibility of a stroke?
Data Pruning and Cleaning
Implement various Imputation techniques to replace NA values
Possibly remove various NA values
Prune data values that are not needed
Exploratory Data Analysis
Summary statistics of the data
Create simple graphs such as box plot and histograms
Create Visualizations
Prediction Model Algorithms
K-Nearest Neighbors Model implementation
Random Forest Model implementation
Decision Tree Model implementation
Other models if need
Evaluations and Results
Compare F1 score, Accuracy, Precision and Recall between the models
Compare error metrics between models to determine which had the best performance
Choose overall best model and give explanation as to why this model worked the best
Pick the overall best model from the first question
Calculate performance metrics like F1 score, accuracy, precision and recall excluding each attribute one at a time
Based on the above results estimate how significant each attribute and estimate the ranking order
| Week | Weekly Tasks | Team members involved |
|---|---|---|
| Till November 8th | Explore the dataset and the problem statement | Everyone |
| - | Complete the proposal | Everyone |
| Nov. 9th - 14th | Exploratory Data Analysis | Abhishek, Shreya |
| - | Data cleaning and Data pre-processing based on EDA | Daniel, Ram |
| Nov. 15th - 21st | Data Visualization | Kashyap, Daniel |
| - | Training and Testing | Abhishek, Shreya |
| Nov. 22nd - 29th | K-Nearest Neighbors Model implementation | Kashyap, Ram |
| - | Random Forest Model implementation | Daniel, Ram |
| Nov. 30th - December 7th | Model Evaluation | Abhishek, Shreya |
| - | Model Comparison | Everyone |
| - | Finding the attribute significance | Kashyap, Ram |
| Dec. 8th - 11th | Refining the code with comments and final changes | Everyone |
| - | Write-up and presentation for the project | Everyone |
The following are the folders involved in the Project repository.
‘data/’: Used for storing any necessary data files for the project, such as input files.
‘images/’: Used for storing image files used in the project.
‘_extra/’: Used to brainstorm our analysis which won’t impact our project workflow.
‘_freeze/’: This folder is used to store the generated files during the build process. These files represent the frozen state of the website at a specific point in time.
‘.github/’: Folder for storing github templates and workflow.