Exploratory Data Analysis

Author

The Null Wranglers Team

Load Libraries
suppressWarnings(library(tidyverse))
library(knitr)
library(lubridate)
library(ggplot2)

This section covers an Exploratory Data Analysis of the data. All data for this project has been provided by the University of Arizona. This type of analysis is vital in order to gain a full understanding of the data that will guide future analysis techniques. The project will utilize two separate data sets.

Grade Value Data

The first one contains information about the breakdown of what grade values students received in a a course. A description for the columns of the data can be found in the proposal.

Read in Data
# read in data
dew_data <- read.csv("data/clean_dew_data.csv")
kable(head(dew_data))
X College Department Subject.Code Catalog.Number Course.Description Course.Level Total.Student.Count D_GRADE_COUNT FAIL_GRADE_COUNT WITHDRAW_GRADE_COUNT DEW_COUNT PASS_GRADE_COUNT WITHDRAW_FULLMED_GRADE_COUNT INCOMPLETE_UNGRADED_COUNT TERM_LD ACAD_YR_SID Percent.D.Grade Percent.E.Grade Percent.W.Grade Percent.DEW Percent.Passed Per.Full..Medical.Withdrawal Per.Ungraded..Incomplete
1 College of Engineering Aerospace & Mechanical Engr ABE 489A Fab Tech Micro+Nanodevic Upper Division 10 0 0 0 0 10 0 0 Fall 2018 2019 0.0 0.0 0.0 0.0 100.0 0.0 0
2 College of Engineering Aerospace & Mechanical Engr AME 105 Introduction to MATLAB I Lower Division 196 17 2 1 20 173 3 0 Fall 2018 2019 8.7 1.0 0.5 10.2 88.3 1.5 0
3 College of Engineering Aerospace & Mechanical Engr AME 105 Introduction to MATLAB I Lower Division 160 22 17 3 42 115 3 0 Fall 2019 2020 13.8 10.6 1.9 26.3 71.9 1.9 0
4 College of Engineering Aerospace & Mechanical Engr AME 105 Introduction to MATLAB I Lower Division 111 11 4 5 20 82 9 0 Fall 2020 2021 9.9 3.6 4.5 18.0 73.9 8.1 0
5 College of Engineering Aerospace & Mechanical Engr AME 105 Introduction to MATLAB I Lower Division 254 11 5 6 22 231 1 0 Spring 2019 2019 4.3 2.0 2.4 8.7 90.9 0.4 0
6 College of Engineering Aerospace & Mechanical Engr AME 105 Introduction to MATLAB I Lower Division 213 3 9 19 31 182 0 0 Spring 2020 2020 1.4 4.2 8.9 14.6 85.4 0.0 0

This data set contains 13283 observations with a total of 24 variables.

Courses Per Semester and Enrollment

The data set spans over 6 different semester that start from the Fall of 2018 and go until the Spring of 2021. During a cleaning process the winter and summer semester were removed. The table below shows the breakdown of course offerings by college.

summary of total courses
# add the column names
table1<- dew_data %>% group_by(College) %>% count(TERM_LD)
table1 <- table1 %>% pivot_wider(names_from = TERM_LD, values_from = n)
kable(table1, caption = "Total Course Offering for Each College per Semester")
Total Course Offering for Each College per Semester
College Fall 2018 Fall 2019 Fall 2020 Spring 2019 Spring 2020 Spring 2021
College of Agric and Life Sci 219 224 247 230 236 225
College of Applied Sci & Tech 124 117 112 123 119 107
College of Education 89 96 96 92 97 105
College of Engineering 149 157 143 165 171 169
College of Fine Arts 260 259 230 265 259 239
College of Humanities 242 245 257 244 272 276
College of Medicine - Tucson 22 27 28 29 35 36
College of Nursing 12 16 20 11 21 21
College of Public Health 27 31 30 27 31 34
College of Science 274 269 279 290 290 299
College of Social & Behav Sci 540 531 537 553 585 566
Colleges of Letters Arts & Sci 7 7 NA 5 4 NA
Eller College of Management 125 128 121 137 141 125
Graduate College 12 13 16 14 14 13
James E Rogers College of Law 20 20 25 21 24 29
R Ken Coit College of Pharmacy 5 6 7 5 8 8
W.A. Franke Honors College 15 11 10 21 17 18

After examining this data, the decision was made to remove colleges that offered less than 50 courses per semester.

remove low course offering colleges
dew_data <- dew_data %>% filter(College != "College of Medicine - Tucson",
                                College != "College of Nursing",
                                College != "College of Public Health",
                                College != "Colleges of Letters Arts & Sci",
                                College != "Graduate College",
                                College != "James E Rogers College of Law",
                                College != "R Ken Coit College of Pharmacy",
                                College != "W.A. Franke Honors College")

After removing these columns, an examination of the total enrollment in these courses was done.

summary of total enrollment
# total enrollment table
enrollment <- aggregate(Total.Student.Count~College + TERM_LD, data=dew_data, sum)
enrollment <- enrollment %>% pivot_wider(names_from = TERM_LD, values_from = Total.Student.Count)
kable(enrollment, caption =  "Total Student Enrollment for Each College per Semester")
Total Student Enrollment for Each College per Semester
College Fall 2018 Fall 2019 Fall 2020 Spring 2019 Spring 2020 Spring 2021
College of Agric and Life Sci 13310 14409 16574 11688 13396 13945
College of Applied Sci & Tech 2703 3094 3735 2766 2937 3772
College of Education 4630 4947 6200 4484 4935 5274
College of Engineering 9310 9515 8576 8298 8151 7874
College of Fine Arts 9899 10276 9243 9216 9686 8439
College of Humanities 19290 19456 19706 18553 18814 18276
College of Science 39301 39964 39898 35013 34275 34832
College of Social & Behav Sci 33478 31942 31961 31092 31231 30813
Eller College of Management 20425 19985 19586 18923 18712 18047

It was then important to explore the grade value outcomes. In this project, the grade value outcome will be used to determine success and failure for a class. A successful grade outcome will be a letter grade of C or higher. A failure grade will be a letter grade of D, E, or W. An E grade is the representation of an fail grade at the University of Arizona as this institution uses F grade values in Pass/Fail courses only and a W grade represents a withdraw from the course by the student. The data does not contain information for students who drop the course within the allowed drop/add period at the beginning of the semester. The table below shows the D.E.W. grade counts in each college per semester.

summary of total enrollment
dew_total <- aggregate(DEW_COUNT~College + TERM_LD, data=dew_data, sum)
dew_total <- dew_total %>% pivot_wider(names_from = TERM_LD, values_from = DEW_COUNT)
kable(dew_total, caption =  "Total D.E.W. Grade Counts for Each College per Semester")
Total D.E.W. Grade Counts for Each College per Semester
College Fall 2018 Fall 2019 Fall 2020 Spring 2019 Spring 2020 Spring 2021
College of Agric and Life Sci 1405 1458 2041 1160 1039 1625
College of Applied Sci & Tech 308 391 481 349 328 534
College of Education 325 397 606 323 292 429
College of Engineering 723 727 865 690 419 778
College of Fine Arts 832 813 896 669 717 636
College of Humanities 2760 2678 3081 2509 2377 2934
College of Science 7879 7389 7148 6705 3871 6023
College of Social & Behav Sci 4704 4296 5043 4263 3348 4615
Eller College of Management 2501 2175 2203 1906 950 1730

After these summaries were, a decision was made to further reduce the data to just the Top 5 Colleges for both enrollment and D.E.W. counts.

summary of total enrollment
dew_data <- dew_data %>% filter(College != "College of Applied Sci & Tech",
                                College != "College of Education",
                                College != "College of Engineering",
                                College != "College of Fine Arts")

Overall Averages

Once the data was reduced, the remaining data contained 8707 observations with 24 variables. Averages for course enrollment and well as D.E.W. averages were calculated and are displayed below.

average number of students per course
student_per_course <- aggregate(Total.Student.Count ~College + TERM_LD, data=dew_data, mean)
# round to nearest whole digit
student_per_course$Total.Student.Count <- round(student_per_course$Total.Student.Count)
# pivot wider
student_per_course <- student_per_course %>% pivot_wider(names_from = TERM_LD, values_from = Total.Student.Count)
# display table
kable(student_per_course, caption =  "Average Number of Students per Course by College Each Semester")
Average Number of Students per Course by College Each Semester
College Fall 2018 Fall 2019 Fall 2020 Spring 2019 Spring 2020 Spring 2021
College of Agric and Life Sci 61 64 67 51 57 62
College of Humanities 80 79 77 76 69 66
College of Science 143 149 143 121 118 116
College of Social & Behav Sci 62 60 60 56 53 54
Eller College of Management 163 156 162 138 133 144
average number of students per course
dew_per_course <- aggregate(Percent.DEW ~College + TERM_LD, data=dew_data, mean)
# round to 2 digits
dew_per_course$Percent.DEW <- round(dew_per_course$Percent.DEW, 2)
# pivot wider
dew_per_course <- dew_per_course %>% pivot_wider(names_from = TERM_LD, values_from = Percent.DEW)
# display table
kable(dew_per_course, caption =  "Average Number of D, F, and W Grade(percentage) per Course by College Each Semester")
Average Number of D, F, and W Grade(percentage) per Course by College Each Semester
College Fall 2018 Fall 2019 Fall 2020 Spring 2019 Spring 2020 Spring 2021
College of Agric and Life Sci 9.30 8.46 11.42 8.40 7.33 10.28
College of Humanities 13.85 13.62 14.79 13.30 12.92 15.32
College of Science 14.84 14.05 15.08 14.23 9.99 13.83
College of Social & Behav Sci 13.22 13.71 14.50 13.56 11.37 13.64
Eller College of Management 7.66 7.54 6.97 6.76 3.63 5.78

Despite enrollment staying steady, we see a noticeable decrease in DEW outcomes for the spring of 2020. Followed by an increase for the fall of 2020.

Averages by Percentage

As certain courses had total enrollment over 1000, the issue that the D.E.W. count from these courses could skew the data was considered. The decision to base further analysis on the D.E.W. percentage was made.

Mean DEW Counts
dew_data1 <- read.csv("data/study_data.csv")

# Define the order of variables
variable_order <- c( "Percent.Passed", "Percent.D.Grade", "Percent.E.Grade", "Percent.W.Grade", "Percent.DEW")

# Creating a data frame with the means
means_data <- data.frame(
  Variable = factor(variable_order, levels = variable_order),
  Mean_Value = c(
    mean(dew_data1$Percent.Passed),
    mean(dew_data1$Percent.D.Grade),
    mean(dew_data1$Percent.E.Grade),
    mean(dew_data1$Percent.W.Grade),
    mean(dew_data1$Percent.DEW)
  )
)

# Creating a bar plot
mean_percents <- ggplot(means_data, aes(x = Variable, y = Mean_Value, fill = Variable)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = round(Mean_Value, 2)), vjust = -0.5, size = 3) +  # Add text labels
  labs(title = "Mean of Percentages of Pass and D.E.W Grades",
       x = "Variables",
       y = "Mean Value") +
  theme_minimal()+
  theme_linedraw()+
  theme(legend.position = "bottom", panel.grid = element_blank())
ggsave("images/mean_percents.png", plot=mean_percents)
Saving 7 x 5 in image
Mean Table of DEW Counts
# Group by College and calculate the mean for each grade-related column
college_means_data <- dew_data %>%
  group_by(College) %>%
  summarise(PASS_GRADE_COUNT = mean(PASS_GRADE_COUNT),
            D_GRADE_COUNT = mean(D_GRADE_COUNT),
            FAIL_GRADE_COUNT = mean(FAIL_GRADE_COUNT),
            WITHDRAW_GRADE_COUNT = mean(WITHDRAW_GRADE_COUNT),
            DEW_COUNT = mean(DEW_COUNT))

# Remove underscores from column names
names(college_means_data) <- str_replace_all(names(college_means_data), "_", " ")

# Round off mean values to two decimal points
college_means_data[, -1] <- round(college_means_data[, -1], 2)

# Create a kable table for means of each grade-related column for each college
kable(college_means_data, 
               caption = "Means for Total Pass, D, E, W, and Total DEW Counts Across Colleges")
Means for Total Pass, D, E, W, and Total DEW Counts Across Colleges
College PASS GRADE COUNT D GRADE COUNT FAIL GRADE COUNT WITHDRAW GRADE COUNT DEW COUNT
College of Agric and Life Sci 53.17 1.81 2.74 1.77 6.32
College of Humanities 62.16 2.36 4.89 3.38 10.64
College of Science 105.84 7.36 8.82 6.76 22.94
College of Social & Behav Sci 48.42 1.90 3.60 2.43 7.93
Eller College of Management 132.50 5.62 4.97 4.16 14.76

Course Data

The second data set contain information about the attributes of the course. The first data set was merged with this using the subject code, course code and term offered. A full cleaning script can be found in the extra folder in the GitHub repository. The merged data contains 6568 observations and 70 variables.

Load data
# read in data
study_data <- read.csv("data/study_data.csv")
kable(head(study_data))
X Course.Identifier College Department Merged Subject.Code Catalog.Number Course.Description Course.Level Total.Student.Count D_GRADE_COUNT FAIL_GRADE_COUNT WITHDRAW_GRADE_COUNT DEW_COUNT PASS_GRADE_COUNT WITHDRAW_FULLMED_GRADE_COUNT INCOMPLETE_UNGRADED_COUNT TERM_LD ACAD_YR_SID Percent.D.Grade Percent.E.Grade Percent.W.Grade Percent.DEW Percent.Passed Per.Full..Medical.Withdrawal Per.Ungraded..Incomplete P.F.Opt Units Mode Class.. Sections Total.Enroll Max.Enroll Rm.Cap Early_Morning Mid_Morning Early_Afternoon Mid_Afternoon Evening Asynchronous Monday Tuesday Wednesday Thursday Friday Saturday Sunday Laboratory Lecture Colloquim Seminar Workshop Discussion Studio Practicum In_Person Full_Online IntractTV Hybrid Live_Online Reg_Session First_Half_Session Second_Half_Session First_Third_Session Second_Third_Session Third_Third_Session Ten_Week Thirteen_Week Other College_Number
1 Fall 2019_ACBS_102L College of Agric and Life Sci Animal&Biomedical Sciences-Ins ACBS_102L ACBS 102L Intro to Animal Sci Lab Lower Division 250 8 7 3 18 229 3 0 Fall 2019 2020 3.2 2.8 1.2 7.2 91.6 1.2 0 1 In Person 450809 8 250 280 8 0 0 8 0 0 0 0 4 0 4 0 0 0 8 0 0 0 0 0 0 0 8 0 0 0 0 8 0 0 0 0 0 0 0 0 1
2 Fall 2019_ACBS_102R College of Agric and Life Sci Animal&Biomedical Sciences-Ins ACBS_102R ACBS 102R Introd to Animal Science Lower Division 267 7 10 2 19 244 4 0 Fall 2019 2020 2.6 3.7 0.7 7.1 91.4 1.5 0 3 In Person 41166 1 267 299 300 1 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1
3 Fall 2019_ACBS_142 College of Agric and Life Sci Animal&Biomedical Sciences-Ins ACBS_142 ACBS 142 Intro Anml Racing Indus Lower Division 28 0 2 0 2 26 0 0 Fall 2019 2020 0.0 7.1 0.0 7.1 92.9 0.0 0 2 In Person 25697 1 28 20 80 1 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1
4 Fall 2019_ACBS_160D1 College of Agric and Life Sci Animal&Biomedical Sciences-Ins ACBS_160D1 ACBS 160D1 Hum+Anml Interl Dom-Pres Lower Division 681 30 72 8 110 561 10 0 Fall 2019 2020 4.4 10.6 1.2 16.2 82.4 1.5 0 3 In Person 95423 2 481 707 912 1 1 0 0 0 0 2 0 2 0 2 0 0 0 2 0 0 0 0 0 0 2 0 0 0 0 2 0 0 0 0 0 0 0 0 1
5 Fall 2019_ACBS_160D1 College of Agric and Life Sci Animal&Biomedical Sciences-Ins ACBS_160D1 ACBS 160D1 Hum+Anml Interl Dom-Pres Lower Division 681 30 72 8 110 561 10 0 Fall 2019 2020 4.4 10.6 1.2 16.2 82.4 1.5 0 3 FullOnline 67075 1 200 200 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1
6 Fall 2019_ACBS_195F College of Agric and Life Sci Animal&Biomedical Sciences-Ins ACBS_195F ACBS 195F Careers/Veterinary Sci Lower Division 205 11 17 1 29 173 3 0 Fall 2019 2020 5.4 8.3 0.5 14.1 84.4 1.5 0 1 In Person 38050 1 205 190 300 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1

Time and Day Totals

An exploration of the time of day courses were offered was done by each college. It is important to note that courses with more than one section may be offered at different times and are included this way in this data.

Total Courses for Time of day
# Create a table for each time slot
time_slots <- c("Early_Morning", "Mid_Morning", "Early_Afternoon", "Mid_Afternoon", "Evening", "Asynchronous")

tables_list <- lapply(time_slots, function(slot) {
  table_data <- study_data %>%
    group_by(College) %>%
    summarise(Mean_Classes_Attended = round(sum(get(slot), na.rm = TRUE), 2))
  
  names(table_data)[2] <- paste("", slot)
  table_data
})

# Merge tables into a single table by College name
merged_table <- Reduce(function(x, y) merge(x, y, by = "College", all = TRUE), tables_list)

# Display the merged table
kable(merged_table)
College Early_Morning Mid_Morning Early_Afternoon Mid_Afternoon Evening Asynchronous
College of Agric and Life Sci 242 265 281 314 52 733
College of Humanities 250 480 361 324 89 896
College of Science 807 779 569 929 506 446
College of Social & Behav Sci 717 1084 866 940 249 2143
Eller College of Management 239 196 154 263 24 148

We then considered the total number of classes offered on each day of the week and asynchronously.

Total courses for Day of Week
# Create a table for each day of the week
weekdays <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday", "Asynchronous")

tables_list <- lapply(weekdays, function(day) {
  table_data <- study_data %>%
    group_by(College) %>%
    summarise(Mean_Classes_Attended = round(sum(get(day), na.rm = TRUE), 2))
  
  names(table_data)[2] <- paste("Mean", day)
  table_data
})

# Merge tables into a single table by College name
merged_table <- Reduce(function(x, y) merge(x, y, by = "College", all = TRUE), tables_list)

# Display the merged table
kable(merged_table)
College Mean Monday Mean Tuesday Mean Wednesday Mean Thursday Mean Friday Mean Saturday Mean Sunday Mean Asynchronous
College of Agric and Life Sci 429 502 466 497 171 0 0 733
College of Humanities 938 954 950 954 425 0 0 896
College of Science 1158 1429 1509 1385 941 0 0 446
College of Social & Behav Sci 1569 1544 1607 1526 1337 0 0 2143
Eller College of Management 358 391 375 385 84 0 0 148

We removed Saturday and Sunday from future analysis.

Modality

We explored the totals for each mode of delivery for each college and semester.

Total Courses for Mode
# Mean Number of students enrolled in in-person vs online vs … (BC to BH) for different colleges - Utkarsha

# Create a table for each course type
course_types <- c("In_Person", "Full_Online", "IntractTV", "Hybrid", "Live_Online")

tables_list <- lapply(course_types, function(course) {
  table_data <- study_data %>%
    group_by(College) %>%
    summarise(Mean_Enrollment = round(sum(get(course), na.rm = TRUE), 2))
  
  names(table_data)[2] <- paste("Enrollment_", course)
  table_data
})

# Merge tables into a single table by College name
merged_table <- Reduce(function(x, y) merge(x, y, by = "College", all = TRUE), tables_list)

# Display the merged table
kable(merged_table)
College Enrollment_ In_Person Enrollment_ Full_Online Enrollment_ IntractTV Enrollment_ Hybrid Enrollment_ Live_Online
College of Agric and Life Sci 628 702 0 257 300
College of Humanities 935 850 0 94 521
College of Science 2100 429 0 760 747
College of Social & Behav Sci 2175 2118 0 324 1382
Eller College of Management 473 144 0 142 265

After this we no longer considered the mode of IntractTV.

Session

We then looked at the total courses offered for each different session length.

Total Courses for Each Session
# Create a table for each course type
course_types <- c("Reg_Session", "First_Half_Session", "Second_Half_Session", "First_Third_Session", "Second_Third_Session", "Third_Third_Session", "Ten_Week", "Thirteen_Week", "Other")

tables_list <- lapply(course_types, function(course) {
  table_data <- study_data %>%
    group_by(College) %>%
    summarise(Mean_Enrollment = round(sum(get(course), na.rm = TRUE), 2))
  
  names(table_data)[2] <- paste("", course)
  table_data
})

# Merge tables into a single table by College name
merged_table <- Reduce(function(x, y) merge(x, y, by = "College", all = TRUE), tables_list)

# Display the merged table
kable(merged_table)
College Reg_Session First_Half_Session Second_Half_Session First_Third_Session Second_Third_Session Third_Third_Session Ten_Week Thirteen_Week Other
College of Agric and Life Sci 1518 180 176 1 8 0 0 0 4
College of Humanities 1866 212 315 0 0 0 0 0 7
College of Science 3563 111 147 0 0 0 0 0 215
College of Social & Behav Sci 4458 566 846 13 13 14 0 0 89
Eller College of Management 894 65 65 0 0 0 0 0 0

After this, we decided to only consider Regular Session, First Half Session, and Second Half Session in further analysis.

Summary

After exploring the data, we were able to remove some colleges and course attributes from further analysis. We were also able to gain a full understanding of the information in the data and begin to develop ideas for further analysis.