Consumer Behaviour Analysis

INFO 523 - Project Final

Author

Affiliation

Pattern Pioneers - Vishal, Joel, Pranshu, Shashwat, Bharath

School of Information, University of Arizona

Abstract

This project aims to conduct a comprehensive analysis of consumer behavior using the Amazon Consumer Behavior Dataset obtained from Kaggle. By delving into the intricacies of customer interactions, browsing patterns, and reviews, our goal is to unearth nuanced insights that can be harnessed to not only enhance the customer experience but also to strategically optimize business approaches. Leveraging a combination of statistical analyses and advanced visualization techniques, this study aims to provide actionable intelligence for businesses navigating the ever-evolving landscape of e-commerce.

Introduction

In the fast-paced world of online shopping, understanding what customers want is crucial for success. The Amazon Consumer Behavior Dataset is like a goldmine of information about how people shop online. It’s not just numbers; it tells a story about how and why people make choices when shopping on the internet. It goes beyond just purchases, giving us a complete picture of how customers interact and move through their online shopping experience. This analysis focuses on two main things: figuring out why some people leave their shopping carts without buying anything and understanding which groups of people prefer different types of products. The goal is to provide useful information to businesses so they can make smart decisions in the complex world of online commerce.

Question 1: What are the factors influencing the customer’s decision to abandon a purchase in their cart on Amazon?

The first question addresses the factors influencing a customer’s decision to abandon a purchase in their cart on Amazon. The analysis commences with Exploratory Data Analysis (EDA) in R, focusing on key variables such as Purchase Frequency, Product Search Method, and Customer Reviews Importance.

Approach

Our analysis begins with precise exploratory data analysis (EDA) in R, focusing on key variables like Purchase_Frequency, Product_Search_Method, and Customer_Reviews_Importance to understand their impact on cart abandonment. During data cleaning, we ensure proper formatting and address missing values in age, Gender, and Browsing_Frequency. We then apply statistical techniques like neural networks and decision trees to assess the influence of factors such as Personalized_Recommendation_Frequency and Search_Result_Exploration on cart abandonment. Visualization utilizes ggplot2 for creating comprehensive graphics, including correlation heat-maps and bar charts, to reveal insights into the interplay between cart abandonment and customer behavior metrics, aiming to enhance the shopping experience and reduce cart abandonment rates effectively.

Analysis

Our objective is to investigate and pinpoint potential reasons for cart abandonment using the data at hand. Initially, we examined outliers and conducted a preliminary analysis to extract insights from various feature columns. Our exploration covered improvement areas, purchase frequency, and the age distribution of the dataset. Among the available features, we selected purchase frequency, product search method, customer reviews importance, and cart abandonment factors to explore their correlations and identify patterns related to cart abandonment. Analyzing the correlation plot, we observed that product search method exhibits a stronger correlation with cart abandonment factors compared to other features. Subsequently, we utilized this feature to uncover insights using methods such as nnet and decision tree.

Despite the diligent efforts we achieved accuracy of 43 percent indicates that the model’s performance may be limited by various factors. Potential reasons for the performance could include but are not limited to:

Limited Data: The dataset is small or insufficiently representative to capture the complexity of the problem.

Model Complexity: The chosen neural network architecture may not be suitable for the problem at hand. Experimentation with different architectures could be beneficial.

Feature Engineering: Feature selection or engineering techniques might need further exploration to improve the model’s ability to learn relevant patterns.

Moving forward, efforts will be focused on refining the model architecture, exploring different neural network configurations, experimenting with additional features, and potentially acquiring more data to improve the model’s accuracy. This result, while not meeting the desired performance threshold, serves as a valuable starting point for further iterations and enhancements to achieve better predictive capabilities.

Question 2: Which demographic (on the basis of gender and age) is most likely to purchase a particular product category?

The second question focuses on determining which demographic (based on gender and age) is most likely to purchase a particular product category on Amazon. The team follows a structured approach involving data cleaning, summary statistics, visualization, and statistical modeling using logistic regression to predict the likelihood of purchasing different product categories based on demographic variables.

Approach

The approach that we took to solve determine which demographic is most likely to purchase a particular product category is as follows:

Data Cleaning: We checked the data set for any inconsistencies like missing values or incorrect data. Then we tried to handle the outliers if they needed to be removed.
Summary statistics: After data cleaning we performed the summary statistics for important columns like purchase category, age group, gender which helped us understand the characteristics and the distribution of data.
Visualization: After a few steps of data transformation which included splitting the rows having multiple purchase categories into multiple rows with same data in other columns we performed some visualizations using ggplot to represent the data with the bar plots which helped us to figure out some trends and patterns of our data.
Modelling: In the final part we performed statistical modelling to assess the likelihood of purchasing different product categories based on demographic variables like age and gender. We achieved this goal by using logistic regression models to predict the probability of a particular demographic to purchase each category.

Analysis

The analysis of the relationship between age, gender and different purchase categories produces a lot of benefits for companies. It provides a deeper understanding of the preferences and behaviors of the customer, which allows the companies to make informed decisions on their product development, pricing and positioning. It also guides the companies in developing a targeted marketing campaign to reach a specific demographic. By applying the logistic regression we achieved an accuracy of 72.6% to predict the product category a particular demographic is likely to be interested in buying.

Discussion

Companies need to know who is most likely to buy their product so they can advertise to them effectively. By understanding the age and gender of their customers, companies can save money by focusing their marketing on the right people. This can help them beat their competition by getting more of the people they want to buy their product. This will also enhance the user experience of the customers which will lead to higher levels of customer satisfaction which will result in cost optimization of the companies marketing budget, customer loyalty and recurring purchases.

--- title: "Consumer Behaviour Analysis" subtitle: "INFO 523 - Project Final" author: - name: "Pattern Pioneers - Vishal, Joel, Pranshu, Shashwat, Bharath" affiliations: - name: "School of Information, University of Arizona" description: "" format: html: code-tools: true code-overflow: wrap embed-resources: true editor: visual execute: warning: false echo: false --- ```{r install_packages, include=FALSE} ### Loading the required libraries if (!require(pacman)) install.packages(pacman) pacman::p_load(tidyverse, # Data wrangling dlookr, # Exploratory Data Analysis formattable, # Present neat table format gt, # Alternating formatting for the tables gtsummary, here, nnet, janitor, cluster, corrplot, dplyr, caret, formattable, rpart, rpart.plot) ``` ## Abstract This project aims to conduct a comprehensive analysis of consumer behavior using the Amazon Consumer Behavior Dataset obtained from Kaggle. By delving into the intricacies of customer interactions, browsing patterns, and reviews, our goal is to unearth nuanced insights that can be harnessed to not only enhance the customer experience but also to strategically optimize business approaches. Leveraging a combination of statistical analyses and advanced visualization techniques, this study aims to provide actionable intelligence for businesses navigating the ever-evolving landscape of e-commerce. ```{r read_data, include=FALSE} # Using the original data # Loading the csv into a variable using read_csv data <- read_csv(here("data", "Amazon_Customer_Behavior_Survey.csv")) # Removing unwanted column like "timestamp". data <- data %>% select(-Timestamp) %>% clean_names() ``` ## Introduction In the fast-paced world of online shopping, understanding what customers want is crucial for success. The Amazon Consumer Behavior Dataset is like a goldmine of information about how people shop online. It's not just numbers; it tells a story about how and why people make choices when shopping on the internet. It goes beyond just purchases, giving us a complete picture of how customers interact and move through their online shopping experience. This analysis focuses on two main things: figuring out why some people leave their shopping carts without buying anything and understanding which groups of people prefer different types of products. The goal is to provide useful information to businesses so they can make smart decisions in the complex world of online commerce. ## Question 1: What are the factors influencing the customer's decision to abandon a purchase in their cart on Amazon? The first question addresses the factors influencing a customer's decision to abandon a purchase in their cart on Amazon. The analysis commences with Exploratory Data Analysis (EDA) in R, focusing on key variables such as Purchase Frequency, Product Search Method, and Customer Reviews Importance. ### Approach Our analysis begins with precise exploratory data analysis (EDA) in R, focusing on key variables like Purchase_Frequency, Product_Search_Method, and Customer_Reviews_Importance to understand their impact on cart abandonment. During data cleaning, we ensure proper formatting and address missing values in age, Gender, and Browsing_Frequency. We then apply statistical techniques like neural networks and decision trees to assess the influence of factors such as Personalized_Recommendation_Frequency and Search_Result_Exploration on cart abandonment. Visualization utilizes ggplot2 for creating comprehensive graphics, including correlation heat-maps and bar charts, to reveal insights into the interplay between cart abandonment and customer behavior metrics, aiming to enhance the shopping experience and reduce cart abandonment rates effectively. ```{r get_cols, include=FALSE} colnames(data) ``` ```{r eda_1} # Calculate the counts of each category improvement_counts <- data %>% group_by(improvement_areas) %>% summarize(count = n()) %>% arrange(desc(count)) %>% ungroup() # Get the top 4 categories top_improvement_areas <- head(improvement_counts, 4)$improvement_areas # Create a new column 'grouped_improvement_areas' where categories outside the top 4 are labeled 'Others' data <- data %>% mutate(grouped_improvement_areas = ifelse(improvement_areas %in% top_improvement_areas, improvement_areas, 'Others')) # Reorder the factor levels to ensure "Others" is at the end data$grouped_improvement_areas <- factor(data$grouped_improvement_areas, levels = c(top_improvement_areas, "Others")) # Plot with top 4 categories and 'Others' at the end ggplot(data, aes(x=grouped_improvement_areas, fill=grouped_improvement_areas)) + geom_bar(color="black") + theme_minimal(base_size = 16) + theme(axis.text.x = element_text(angle = 0), legend.position = "none") + labs(title = "Distribution of Top 4 Improvement Areas and Others", x = "Improvement Areas", y = "Count") + scale_fill_brewer(palette="Set3") # Using a ColorBrewer palette for fill colors ``` ```{r eda_2} # Distribution of Age Among Respondents with a color ggplot(data, aes(x=age)) + geom_histogram(binwidth = 1, fill="skyblue", color="black") + # Setting fill and outline color for the bars theme_minimal(base_size = 16) + labs(title = "Distribution of Age Among Respondents", x = "Age", y = "Count") # Distribution of Purchase Frequency with colors ggplot(data, aes(x=purchase_frequency, fill=purchase_frequency)) + geom_bar(color="black") + scale_fill_brewer(palette="Set3") + # Using a ColorBrewer palette for fill colors theme_minimal(base_size = 16) + theme(axis.text.x = element_text(angle = 0), legend.position = "none") + labs(title = "Distribution of Purchase Frequency", x = "Purchase Frequency", y = "Count") ``` ```{r new_df, include=FALSE} # creating a new dataframe by selecting only the columns of our focus new_df <- data %>% select( purchase_frequency, product_search_method, customer_reviews_importance, cart_abandonment_factors ) ``` ```{r get_outliers, include=FALSE} # getting some idea on the dataset by visualizing the outliers new_df |> plot_outlier() # Interpretation: There are not many outliers such that data transformation is needed for this data ``` ```{r refactor_new_df, include=FALSE} # modifying the column data type to factor and then to integer new_df$purchase_frequency <- as.factor(new_df$purchase_frequency) new_df$purchase_frequency <- as.integer(new_df$purchase_frequency) unique(new_df$purchase_frequency) # modifying the column data type to factor and then to integer new_df$product_search_method <- as.factor(new_df$product_search_method) new_df$product_search_method <- as.integer(new_df$product_search_method) unique(new_df$product_search_method) # modifying the column data type to factor and then to integer new_df$cart_abandonment_factors <- as.factor(new_df$cart_abandonment_factors) new_df$cart_abandonment_factors <- as.integer(new_df$cart_abandonment_factors) unique(new_df$cart_abandonment_factors) # modifying the column data type to factor and then to integer new_df$customer_reviews_importance <- as.factor(new_df$customer_reviews_importance) new_df$customer_reviews_importance <- as.integer(new_df$customer_reviews_importance) unique(new_df$customer_reviews_importance) # removing missing values from the dataframe new_df = na.omit(new_df) ``` ```{r correlation, include=FALSE} # creating correlation matrix using cor() correlation_matrix <- cor(new_df) # Display the correlation matrix correlation_matrix ``` ```{r corrplot} # Visualize correlation matrix using corrplot corrplot(correlation_matrix, method = 'color') ``` ```{r bar_plot} # generating plot product_search_method vs cart_abandonment_factors data %>% select(product_search_method, cart_abandonment_factors) %>% # removing missing values na.omit() %>% # using ggplot to generate the plot ggplot(aes(y = cart_abandonment_factors, fill = product_search_method)) + # using geom_bar() to generate 100% bar plot geom_bar(position = "fill", width = 0.5, color = "black") + labs(fill = "Product Search Method") + # scaling x axis to have percentage scale_x_continuous("Percentage in each category", labels = scales::percent, expand = c(0, 0)) + # using viridis color scale - colorblind friendly scale_fill_brewer(palette="Set3") + labs(y = NULL, title = "") + theme_minimal(base_size = 20) # theme minimal ``` ### Analysis Our objective is to investigate and pinpoint potential reasons for cart abandonment using the data at hand. Initially, we examined outliers and conducted a preliminary analysis to extract insights from various feature columns. Our exploration covered `improvement areas`, `purchase frequency`, and the `age` distribution of the dataset. Among the available features, we selected `purchase frequency`, `product search method`, `customer reviews importance`, and `cart abandonment factors` to explore their correlations and identify patterns related to cart abandonment. Analyzing the correlation plot, we observed that `product search method` exhibits a stronger correlation with cart abandonment factors compared to other features. Subsequently, we utilized this feature to uncover insights using methods such as `nnet` and `decision tree`. ```{r traning_1, include=FALSE} set.seed(123) # For reproducibility # training the dataset by splitting the data to 80% training train_index <- sample(1:nrow(new_df), 0.8 * nrow(new_df)) # 80% for training # getting train and test data train_data <- new_df[train_index,] test_data <- new_df[-train_index,] ncol(train_data) ``` ```{r model, include=FALSE} # calling the nnet model with Cart_Abandonment_Factors and Product_Search_Method being our focus model <- nnet( cart_abandonment_factors ~ product_search_method, data = train_data, size = 100, maxit = 1000 ) ``` ```{r predictions, include=FALSE} # generating predictions on the test data using the created model above predictions <- predict(model, newdata = test_data) # getting accuracy of our predictions accuracy <- mean(predictions == test_data$cart_abandonment_factors) * 100 accuracy ``` Despite the diligent efforts we achieved accuracy of 43 percent indicates that the model's performance may be limited by various factors. Potential reasons for the performance could include but are not limited to: Limited Data: The dataset is small or insufficiently representative to capture the complexity of the problem. Model Complexity: The chosen neural network architecture may not be suitable for the problem at hand. Experimentation with different architectures could be beneficial. Feature Engineering: Feature selection or engineering techniques might need further exploration to improve the model's ability to learn relevant patterns. Moving forward, efforts will be focused on refining the model architecture, exploring different neural network configurations, experimenting with additional features, and potentially acquiring more data to improve the model's accuracy. This result, while not meeting the desired performance threshold, serves as a valuable starting point for further iterations and enhancements to achieve better predictive capabilities. ```{r dtree, message=FALSE} # Using a decision tree classifier # Convert factors to categorical data$cart_abandonment_factors <- as.factor(data$cart_abandonment_factors) data$product_search_method <- as.factor(data$product_search_method) # Splitting the data (example using 70%-30% split) set.seed(123) # for reproducibility splitIndex <- createDataPartition(data$cart_abandonment_factors, p = 0.70, list = FALSE) train <- data[splitIndex, ] test <- data[-splitIndex, ] # Train the model tree_model <- rpart(cart_abandonment_factors ~ product_search_method, data = train, method = "class") # Evaluate the model (using test set) predictions <- predict(tree_model, test, type = "class") # Model accuracy accuracy <- sum(predictions == test$cart_abandonment_factors) / nrow(test) # Visualize the tree rpart.plot(tree_model) ``` ### ## Question 2: Which demographic (on the basis of gender and age) is most likely to purchase a particular product category? The second question focuses on determining which demographic (based on gender and age) is most likely to purchase a particular product category on Amazon. The team follows a structured approach involving data cleaning, summary statistics, visualization, and statistical modeling using logistic regression to predict the likelihood of purchasing different product categories based on demographic variables. ### Approach The approach that we took to solve determine which demographic is most likely to purchase a particular product category is as follows: - **Data Cleaning:** We checked the data set for any inconsistencies like missing values or incorrect data. Then we tried to handle the outliers if they needed to be removed. - **Summary statistics:** After data cleaning we performed the summary statistics for important columns like purchase category, age group, gender which helped us understand the characteristics and the distribution of data. - **Visualization:** After a few steps of data transformation which included splitting the rows having multiple purchase categories into multiple rows with same data in other columns we performed some visualizations using ggplot to represent the data with the bar plots which helped us to figure out some trends and patterns of our data. - **Modelling:** In the final part we performed statistical modelling to assess the likelihood of purchasing different product categories based on demographic variables like age and gender. We achieved this goal by using logistic regression models to predict the probability of a particular demographic to purchase each category. ```{r check_data, include=FALSE} # Check data structure str(data) # First 6 rows of dataframe data |> head() |> formattable() ``` ```{r diagnose, include=FALSE} data |> diagnose() |> formattable() ``` ```{r outliers_2, include=FALSE} diagnose_outlier(data) |> filter(outliers_ratio > 0) |> formattable() # Selecting desired columns data |> plot_outlier() ``` ```{r plot_data} # Age distribution ggplot(data, aes(x=age)) + geom_histogram(bins=30, fill="blue", color="black") + labs(title="Age Distribution", x="Age", y="Frequency") + theme_minimal(base_size = 16) # Gender distribution ggplot(data, aes(x=gender)) + geom_bar(fill="orange") + labs(title="Gender Distribution", x="Gender", y="Count") + theme_minimal(base_size = 16) ``` ```{r age_group_data, include=FALSE} # Creating afe group based on age data$age_group <- cut(data$age, breaks=c(0, 18, 25, 35, 45, 55, 65, 100), labels=c('0-18', '19-25', '26-35', '36-45', '46-55', '56-65', '65+')) ``` ```{r splitting purchase categories, include=FALSE} # We are going to split the rows having multiple purchase categories into multiple rows with same data in other columns but split up with respect to purchasing categories. # Splitting 'Purchase_Categories' df_expanded <- data %>% separate_rows(purchase_categories, sep=";") %>% rename(purchase_category = purchase_categories) ``` ```{r age_gender} # Plotting a stacked barplot for representing purchase categories across different age groups ggplot(df_expanded, aes(x=purchase_category, fill=age_group)) + geom_bar(position = "fill") + coord_flip() + scale_fill_brewer(palette="Set3") + labs(title="Purchase Categories Across Age Groups", x="Purchase Category", y="Count") + theme_minimal(base_size = 16) # Plotting a stacked barplot for representing purchase categories across different genders ggplot(df_expanded, aes(x=purchase_category, fill=gender)) + geom_bar(position = "fill") + coord_flip() + scale_fill_brewer(palette="Set3") + labs(title="Purchase Categories Across Genders", x="Purchase Category", y="Count") + theme_minimal(base_size = 16) ``` ### Analysis The analysis of the relationship between age, gender and different purchase categories produces a lot of benefits for companies. It provides a deeper understanding of the preferences and behaviors of the customer, which allows the companies to make informed decisions on their product development, pricing and positioning. It also guides the companies in developing a targeted marketing campaign to reach a specific demographic. By applying the logistic regression we achieved an accuracy of 72.6% to predict the product category a particular demographic is likely to be interested in buying. ```{r log_regr, include=FALSE} # Preparing the data for logistic regression df_expanded <- df_expanded |> mutate(purchase_category = ifelse(is.na(purchase_category), NA, tolower(gsub(" ", "_", purchase_category)))) unique_categories <- unique(df_expanded$purchase_category) models <- list() reports <- list() ``` ```{r predictions_2, include=FALSE} for (category in unique_categories) { # Binary variable for each category df_expanded[[category]] <- ifelse(df_expanded$purchase_category == category, 1, 0) # Model model_formula <- as.formula(paste(category, "~ age_group + gender")) model <- multinom(model_formula, data=df_expanded) # Store the model models[[category]] <- model # Evaluation predictions <- predict(model, df_expanded) reports[[category]] <- confusionMatrix(data=factor(predictions, levels=c(0,1)), reference=factor(df_expanded[[category]], levels=c(0,1))) } # Displaying the report for an example category print(reports[[unique_categories[1]]]) ``` ## Discussion Companies need to know who is most likely to buy their product so they can advertise to them effectively. By understanding the age and gender of their customers, companies can save money by focusing their marketing on the right people. This can help them beat their competition by getting more of the people they want to buy their product. This will also enhance the user experience of the customers which will lead to higher levels of customer satisfaction which will result in cost optimization of the companies marketing budget, customer loyalty and recurring purchases.