This project aims to conduct a comprehensive analysis of consumer behavior using the Amazon Consumer Behavior Dataset obtained from Kaggle. By delving into the intricacies of customer interactions, browsing patterns, and reviews, our goal is to unearth nuanced insights that can be harnessed to not only enhance the customer experience but also to strategically optimize business approaches. Leveraging a combination of statistical analyses and advanced visualization techniques, this study aims to provide actionable intelligence for businesses navigating the ever-evolving landscape of e-commerce.
Introduction
In the fast-paced world of online shopping, understanding what customers want is crucial for success. The Amazon Consumer Behavior Dataset is like a goldmine of information about how people shop online. It’s not just numbers; it tells a story about how and why people make choices when shopping on the internet. It goes beyond just purchases, giving us a complete picture of how customers interact and move through their online shopping experience. This analysis focuses on two main things: figuring out why some people leave their shopping carts without buying anything and understanding which groups of people prefer different types of products. The goal is to provide useful information to businesses so they can make smart decisions in the complex world of online commerce.
Question 1: What are the factors influencing the customer’s decision to abandon a purchase in their cart on Amazon?
The first question addresses the factors influencing a customer’s decision to abandon a purchase in their cart on Amazon. The analysis commences with Exploratory Data Analysis (EDA) in R, focusing on key variables such as Purchase Frequency, Product Search Method, and Customer Reviews Importance.
Approach
Our analysis begins with precise exploratory data analysis (EDA) in R, focusing on key variables like Purchase_Frequency, Product_Search_Method, and Customer_Reviews_Importance to understand their impact on cart abandonment. During data cleaning, we ensure proper formatting and address missing values in age, Gender, and Browsing_Frequency. We then apply statistical techniques like neural networks and decision trees to assess the influence of factors such as Personalized_Recommendation_Frequency and Search_Result_Exploration on cart abandonment. Visualization utilizes ggplot2 for creating comprehensive graphics, including correlation heat-maps and bar charts, to reveal insights into the interplay between cart abandonment and customer behavior metrics, aiming to enhance the shopping experience and reduce cart abandonment rates effectively.
Analysis
Our objective is to investigate and pinpoint potential reasons for cart abandonment using the data at hand. Initially, we examined outliers and conducted a preliminary analysis to extract insights from various feature columns. Our exploration covered improvement areas, purchase frequency, and the age distribution of the dataset. Among the available features, we selected purchase frequency, product search method, customer reviews importance, and cart abandonment factors to explore their correlations and identify patterns related to cart abandonment. Analyzing the correlation plot, we observed that product search method exhibits a stronger correlation with cart abandonment factors compared to other features. Subsequently, we utilized this feature to uncover insights using methods such as nnet and decision tree.
Despite the diligent efforts we achieved accuracy of 43 percent indicates that the model’s performance may be limited by various factors. Potential reasons for the performance could include but are not limited to:
Limited Data: The dataset is small or insufficiently representative to capture the complexity of the problem.
Model Complexity: The chosen neural network architecture may not be suitable for the problem at hand. Experimentation with different architectures could be beneficial.
Feature Engineering: Feature selection or engineering techniques might need further exploration to improve the model’s ability to learn relevant patterns.
Moving forward, efforts will be focused on refining the model architecture, exploring different neural network configurations, experimenting with additional features, and potentially acquiring more data to improve the model’s accuracy. This result, while not meeting the desired performance threshold, serves as a valuable starting point for further iterations and enhancements to achieve better predictive capabilities.
Question 2: Which demographic (on the basis of gender and age) is most likely to purchase a particular product category?
The second question focuses on determining which demographic (based on gender and age) is most likely to purchase a particular product category on Amazon. The team follows a structured approach involving data cleaning, summary statistics, visualization, and statistical modeling using logistic regression to predict the likelihood of purchasing different product categories based on demographic variables.
Approach
The approach that we took to solve determine which demographic is most likely to purchase a particular product category is as follows:
Data Cleaning: We checked the data set for any inconsistencies like missing values or incorrect data. Then we tried to handle the outliers if they needed to be removed.
Summary statistics: After data cleaning we performed the summary statistics for important columns like purchase category, age group, gender which helped us understand the characteristics and the distribution of data.
Visualization: After a few steps of data transformation which included splitting the rows having multiple purchase categories into multiple rows with same data in other columns we performed some visualizations using ggplot to represent the data with the bar plots which helped us to figure out some trends and patterns of our data.
Modelling: In the final part we performed statistical modelling to assess the likelihood of purchasing different product categories based on demographic variables like age and gender. We achieved this goal by using logistic regression models to predict the probability of a particular demographic to purchase each category.
Analysis
The analysis of the relationship between age, gender and different purchase categories produces a lot of benefits for companies. It provides a deeper understanding of the preferences and behaviors of the customer, which allows the companies to make informed decisions on their product development, pricing and positioning. It also guides the companies in developing a targeted marketing campaign to reach a specific demographic. By applying the logistic regression we achieved an accuracy of 72.6% to predict the product category a particular demographic is likely to be interested in buying.
Discussion
Companies need to know who is most likely to buy their product so they can advertise to them effectively. By understanding the age and gender of their customers, companies can save money by focusing their marketing on the right people. This can help them beat their competition by getting more of the people they want to buy their product. This will also enhance the user experience of the customers which will lead to higher levels of customer satisfaction which will result in cost optimization of the companies marketing budget, customer loyalty and recurring purchases.
Source Code
---title: "Consumer Behaviour Analysis"subtitle: "INFO 523 - Project Final"author: - name: "Pattern Pioneers - Vishal, Joel, Pranshu, Shashwat, Bharath" affiliations: - name: "School of Information, University of Arizona"description: ""format: html: code-tools: true code-overflow: wrap embed-resources: trueeditor: visualexecute: warning: false echo: false---```{r install_packages, include=FALSE}### Loading the required librariesif (!require(pacman))install.packages(pacman)pacman::p_load(tidyverse, # Data wrangling dlookr, # Exploratory Data Analysis formattable, # Present neat table format gt, # Alternating formatting for the tables gtsummary, here, nnet, janitor, cluster, corrplot, dplyr, caret, formattable, rpart, rpart.plot)```## AbstractThis project aims to conduct a comprehensive analysis of consumer behavior using the Amazon Consumer Behavior Dataset obtained from Kaggle. By delving into the intricacies of customer interactions, browsing patterns, and reviews, our goal is to unearth nuanced insights that can be harnessed to not only enhance the customer experience but also to strategically optimize business approaches. Leveraging a combination of statistical analyses and advanced visualization techniques, this study aims to provide actionable intelligence for businesses navigating the ever-evolving landscape of e-commerce.```{r read_data, include=FALSE}# Using the original data# Loading the csv into a variable using read_csvdata <-read_csv(here("data", "Amazon_Customer_Behavior_Survey.csv"))# Removing unwanted column like "timestamp".data <- data %>%select(-Timestamp) %>%clean_names()```## IntroductionIn the fast-paced world of online shopping, understanding what customers want is crucial for success. The Amazon Consumer Behavior Dataset is like a goldmine of information about how people shop online. It's not just numbers; it tells a story about how and why people make choices when shopping on the internet. It goes beyond just purchases, giving us a complete picture of how customers interact and move through their online shopping experience. This analysis focuses on two main things: figuring out why some people leave their shopping carts without buying anything and understanding which groups of people prefer different types of products. The goal is to provide useful information to businesses so they can make smart decisions in the complex world of online commerce.## Question 1: What are the factors influencing the customer's decision to abandon a purchase in their cart on Amazon?The first question addresses the factors influencing a customer's decision to abandon a purchase in their cart on Amazon. The analysis commences with Exploratory Data Analysis (EDA) in R, focusing on key variables such as Purchase Frequency, Product Search Method, and Customer Reviews Importance.### ApproachOur analysis begins with precise exploratory data analysis (EDA) in R, focusing on key variables like Purchase_Frequency, Product_Search_Method, and Customer_Reviews_Importance to understand their impact on cart abandonment. During data cleaning, we ensure proper formatting and address missing values in age, Gender, and Browsing_Frequency. We then apply statistical techniques like neural networks and decision trees to assess the influence of factors such as Personalized_Recommendation_Frequency and Search_Result_Exploration on cart abandonment. Visualization utilizes ggplot2 for creating comprehensive graphics, including correlation heat-maps and bar charts, to reveal insights into the interplay between cart abandonment and customer behavior metrics, aiming to enhance the shopping experience and reduce cart abandonment rates effectively.```{r get_cols, include=FALSE}colnames(data)``````{r eda_1}# Calculate the counts of each categoryimprovement_counts <- data %>%group_by(improvement_areas) %>%summarize(count =n()) %>%arrange(desc(count)) %>%ungroup()# Get the top 4 categoriestop_improvement_areas <-head(improvement_counts, 4)$improvement_areas# Create a new column 'grouped_improvement_areas' where categories outside the top 4 are labeled 'Others'data <- data %>%mutate(grouped_improvement_areas =ifelse(improvement_areas %in% top_improvement_areas, improvement_areas, 'Others'))# Reorder the factor levels to ensure "Others" is at the enddata$grouped_improvement_areas <-factor(data$grouped_improvement_areas,levels =c(top_improvement_areas, "Others"))# Plot with top 4 categories and 'Others' at the endggplot(data, aes(x=grouped_improvement_areas, fill=grouped_improvement_areas)) +geom_bar(color="black") +theme_minimal(base_size =16) +theme(axis.text.x =element_text(angle =0),legend.position ="none") +labs(title ="Distribution of Top 4 Improvement Areas and Others", x ="Improvement Areas", y ="Count") +scale_fill_brewer(palette="Set3") # Using a ColorBrewer palette for fill colors``````{r eda_2}# Distribution of Age Among Respondents with a colorggplot(data, aes(x=age)) +geom_histogram(binwidth =1, fill="skyblue", color="black") +# Setting fill and outline color for the barstheme_minimal(base_size =16) +labs(title ="Distribution of Age Among Respondents", x ="Age", y ="Count")# Distribution of Purchase Frequency with colorsggplot(data, aes(x=purchase_frequency, fill=purchase_frequency)) +geom_bar(color="black") +scale_fill_brewer(palette="Set3") +# Using a ColorBrewer palette for fill colorstheme_minimal(base_size =16) +theme(axis.text.x =element_text(angle =0),legend.position ="none") +labs(title ="Distribution of Purchase Frequency", x ="Purchase Frequency", y ="Count")``````{r new_df, include=FALSE}# creating a new dataframe by selecting only the columns of our focusnew_df <- data %>%select( purchase_frequency, product_search_method, customer_reviews_importance, cart_abandonment_factors )``````{r get_outliers, include=FALSE}# getting some idea on the dataset by visualizing the outliersnew_df |>plot_outlier()# Interpretation: There are not many outliers such that data transformation is needed for this data``````{r refactor_new_df, include=FALSE}# modifying the column data type to factor and then to integernew_df$purchase_frequency <-as.factor(new_df$purchase_frequency)new_df$purchase_frequency <-as.integer(new_df$purchase_frequency)unique(new_df$purchase_frequency)# modifying the column data type to factor and then to integernew_df$product_search_method <-as.factor(new_df$product_search_method)new_df$product_search_method <-as.integer(new_df$product_search_method)unique(new_df$product_search_method)# modifying the column data type to factor and then to integernew_df$cart_abandonment_factors <-as.factor(new_df$cart_abandonment_factors)new_df$cart_abandonment_factors <-as.integer(new_df$cart_abandonment_factors)unique(new_df$cart_abandonment_factors)# modifying the column data type to factor and then to integernew_df$customer_reviews_importance <-as.factor(new_df$customer_reviews_importance)new_df$customer_reviews_importance <-as.integer(new_df$customer_reviews_importance)unique(new_df$customer_reviews_importance)# removing missing values from the dataframenew_df =na.omit(new_df)``````{r correlation, include=FALSE}# creating correlation matrix using cor()correlation_matrix <-cor(new_df)# Display the correlation matrixcorrelation_matrix``````{r corrplot}# Visualize correlation matrix using corrplotcorrplot(correlation_matrix, method ='color')``````{r bar_plot}# generating plot product_search_method vs cart_abandonment_factorsdata %>%select(product_search_method, cart_abandonment_factors) %>%# removing missing valuesna.omit() %>%# using ggplot to generate the plotggplot(aes(y = cart_abandonment_factors, fill = product_search_method)) +# using geom_bar() to generate 100% bar plotgeom_bar(position ="fill", width =0.5, color ="black") +labs(fill ="Product Search Method") +# scaling x axis to have percentagescale_x_continuous("Percentage in each category", labels = scales::percent,expand =c(0, 0)) +# using viridis color scale - colorblind friendlyscale_fill_brewer(palette="Set3") +labs(y =NULL, title ="") +theme_minimal(base_size =20) # theme minimal```### AnalysisOur objective is to investigate and pinpoint potential reasons for cart abandonment using the data at hand. Initially, we examined outliers and conducted a preliminary analysis to extract insights from various feature columns. Our exploration covered `improvement areas`, `purchase frequency`, and the `age` distribution of the dataset. Among the available features, we selected `purchase frequency`, `product search method`, `customer reviews importance`, and `cart abandonment factors` to explore their correlations and identify patterns related to cart abandonment. Analyzing the correlation plot, we observed that `product search method` exhibits a stronger correlation with cart abandonment factors compared to other features. Subsequently, we utilized this feature to uncover insights using methods such as `nnet` and `decision tree`.```{r traning_1, include=FALSE}set.seed(123) # For reproducibility# training the dataset by splitting the data to 80% trainingtrain_index <-sample(1:nrow(new_df), 0.8*nrow(new_df)) # 80% for training# getting train and test datatrain_data <- new_df[train_index,]test_data <- new_df[-train_index,]ncol(train_data)``````{r model, include=FALSE}# calling the nnet model with Cart_Abandonment_Factors and Product_Search_Method being our focusmodel <-nnet( cart_abandonment_factors ~ product_search_method,data = train_data,size =100,maxit =1000)``````{r predictions, include=FALSE}# generating predictions on the test data using the created model abovepredictions <-predict(model, newdata = test_data)# getting accuracy of our predictionsaccuracy <-mean(predictions == test_data$cart_abandonment_factors) *100accuracy```Despite the diligent efforts we achieved accuracy of 43 percent indicates that the model's performance may be limited by various factors. Potential reasons for the performance could include but are not limited to:Limited Data: The dataset is small or insufficiently representative to capture the complexity of the problem.Model Complexity: The chosen neural network architecture may not be suitable for the problem at hand. Experimentation with different architectures could be beneficial.Feature Engineering: Feature selection or engineering techniques might need further exploration to improve the model's ability to learn relevant patterns.Moving forward, efforts will be focused on refining the model architecture, exploring different neural network configurations, experimenting with additional features, and potentially acquiring more data to improve the model's accuracy. This result, while not meeting the desired performance threshold, serves as a valuable starting point for further iterations and enhancements to achieve better predictive capabilities.```{r dtree, message=FALSE}# Using a decision tree classifier# Convert factors to categoricaldata$cart_abandonment_factors <-as.factor(data$cart_abandonment_factors)data$product_search_method <-as.factor(data$product_search_method)# Splitting the data (example using 70%-30% split)set.seed(123) # for reproducibilitysplitIndex <-createDataPartition(data$cart_abandonment_factors, p =0.70, list =FALSE)train <- data[splitIndex, ]test <- data[-splitIndex, ]# Train the modeltree_model <-rpart(cart_abandonment_factors ~ product_search_method, data = train, method ="class")# Evaluate the model (using test set)predictions <-predict(tree_model, test, type ="class")# Model accuracyaccuracy <-sum(predictions == test$cart_abandonment_factors) /nrow(test)# Visualize the treerpart.plot(tree_model)```### ## Question 2: Which demographic (on the basis of gender and age) is most likely to purchase a particular product category?The second question focuses on determining which demographic (based on gender and age) is most likely to purchase a particular product category on Amazon. The team follows a structured approach involving data cleaning, summary statistics, visualization, and statistical modeling using logistic regression to predict the likelihood of purchasing different product categories based on demographic variables.### ApproachThe approach that we took to solve determine which demographic is most likely to purchase a particular product category is as follows:- **Data Cleaning:** We checked the data set for any inconsistencies like missing values or incorrect data. Then we tried to handle the outliers if they needed to be removed.- **Summary statistics:** After data cleaning we performed the summary statistics for important columns like purchase category, age group, gender which helped us understand the characteristics and the distribution of data.- **Visualization:** After a few steps of data transformation which included splitting the rows having multiple purchase categories into multiple rows with same data in other columns we performed some visualizations using ggplot to represent the data with the bar plots which helped us to figure out some trends and patterns of our data.- **Modelling:** In the final part we performed statistical modelling to assess the likelihood of purchasing different product categories based on demographic variables like age and gender. We achieved this goal by using logistic regression models to predict the probability of a particular demographic to purchase each category.```{r check_data, include=FALSE}# Check data structurestr(data)# First 6 rows of dataframedata |>head() |>formattable()``````{r diagnose, include=FALSE}data |>diagnose() |>formattable()``````{r outliers_2, include=FALSE}diagnose_outlier(data) |>filter(outliers_ratio >0) |>formattable()# Selecting desired columns data |>plot_outlier()``````{r plot_data}# Age distributionggplot(data, aes(x=age)) +geom_histogram(bins=30, fill="blue", color="black") +labs(title="Age Distribution", x="Age", y="Frequency") +theme_minimal(base_size =16)# Gender distributionggplot(data, aes(x=gender)) +geom_bar(fill="orange") +labs(title="Gender Distribution", x="Gender", y="Count") +theme_minimal(base_size =16)``````{r age_group_data, include=FALSE}# Creating afe group based on agedata$age_group <-cut(data$age, breaks=c(0, 18, 25, 35, 45, 55, 65, 100),labels=c('0-18', '19-25', '26-35', '36-45', '46-55', '56-65', '65+'))``````{r splitting purchase categories, include=FALSE}# We are going to split the rows having multiple purchase categories into multiple rows with same data in other columns but split up with respect to purchasing categories.# Splitting 'Purchase_Categories'df_expanded <- data %>%separate_rows(purchase_categories, sep=";") %>%rename(purchase_category = purchase_categories)``````{r age_gender}# Plotting a stacked barplot for representing purchase categories across different age groupsggplot(df_expanded, aes(x=purchase_category, fill=age_group)) +geom_bar(position ="fill") +coord_flip() +scale_fill_brewer(palette="Set3") +labs(title="Purchase Categories Across Age Groups", x="Purchase Category", y="Count") +theme_minimal(base_size =16)# Plotting a stacked barplot for representing purchase categories across different gendersggplot(df_expanded, aes(x=purchase_category, fill=gender)) +geom_bar(position ="fill") +coord_flip() +scale_fill_brewer(palette="Set3") +labs(title="Purchase Categories Across Genders", x="Purchase Category", y="Count") +theme_minimal(base_size =16)```### AnalysisThe analysis of the relationship between age, gender and different purchase categories produces a lot of benefits for companies. It provides a deeper understanding of the preferences and behaviors of the customer, which allows the companies to make informed decisions on their product development, pricing and positioning. It also guides the companies in developing a targeted marketing campaign to reach a specific demographic. By applying the logistic regression we achieved an accuracy of 72.6% to predict the product category a particular demographic is likely to be interested in buying.```{r log_regr, include=FALSE}# Preparing the data for logistic regressiondf_expanded <- df_expanded |>mutate(purchase_category =ifelse(is.na(purchase_category), NA, tolower(gsub(" ", "_", purchase_category))))unique_categories <-unique(df_expanded$purchase_category)models <-list()reports <-list()``````{r predictions_2, include=FALSE}for (category in unique_categories) {# Binary variable for each category df_expanded[[category]] <-ifelse(df_expanded$purchase_category == category, 1, 0)# Model model_formula <-as.formula(paste(category, "~ age_group + gender")) model <-multinom(model_formula, data=df_expanded)# Store the model models[[category]] <- model# Evaluation predictions <-predict(model, df_expanded) reports[[category]] <-confusionMatrix(data=factor(predictions, levels=c(0,1)), reference=factor(df_expanded[[category]], levels=c(0,1)))}# Displaying the report for an example categoryprint(reports[[unique_categories[1]]])```## DiscussionCompanies need to know who is most likely to buy their product so they can advertise to them effectively. By understanding the age and gender of their customers, companies can save money by focusing their marketing on the right people. This can help them beat their competition by getting more of the people they want to buy their product. This will also enhance the user experience of the customers which will lead to higher levels of customer satisfaction which will result in cost optimization of the companies marketing budget, customer loyalty and recurring purchases.