Consumer Behaviour Analysis

Proposal

Data Mining

Author

Affiliation

Pattern Pioneers - Vishal, Joel, Pranshu, Shashwat, Bharath

School of Information, University of Arizona

Load required packages

Installed Packages

### GETTING THE LIBRARIES
if (!require(pacman))
  install.packages(pacman)

pacman::p_load(tidyverse,   # Data wrangling
               dlookr,      # Exploratory Data Analysis
               formattable, # Present neat table format
               gt,          # Alternating formatting for the tables
               gtsummary)   # Summary for the tables

Goal

Our main motivation for selecting the dataset Amazon consumer Behaviour Dataset that we came across in Kaggle was to unveil some customer insights which can be used for enhancing the customer experience or the improving the business implementation after analyzing this dataset.

Dataset

Code

# Using the original data
# Loading the csv into a variable using read_csv

data <- read_csv("data/Amazon_Customer_Behavior_Survey.csv")

# Removing unwanted column like "timestamp".

data <- data %>% select(-Timestamp)

This is a dataset collected from kaggle for analyzing the behavioral analysis of Amazon’s consumers consists of a comprehensive collection of customer interactions, browsing patterns within the Amazon ecosystem. It includes a wide range of variables such as customer demographics, user interaction, and reviews. The dataset aims to provide insights into customer preferences, shopping habits, and decision-making processes on the Amazon platform. By analyzing this dataset, researchers and analysts can gain a deeper understanding of consumer behavior, identify trends, optimize marketing strategies, and improve the overall customer experience on Amazon. The Dataset contains N=602 observations.

Examine data

Using dlookr’s describe() and diagnose() - some basic EDA

Code

# Summary statistics of numerical column

data |> 
    describe() |>
  formattable()

described_variables	n	mean	sd	se_mean	IQR	skewness	kurtosis	p00	p01	p05	p10	p20	p25	p30	p40	p50	p60	p70	p75	p80	p90	p95	p99	p100
age	602	30.790698	10.1932760	0.41544691	13	1.0078573	0.5563217	3	16	21	22	23	23	23	25	26	32	34	36	39.8	45	50	60	67
Customer_Reviews_Importance	602	2.480066	1.1852257	0.04830619	2	0.3033064	-0.7098681	1	1	1	1	1	1	2	2	3	3	3	3	3.0	4	5	5	5
Personalized_Recommendation_Frequency…18	602	2.699336	1.0420284	0.04246991	1	0.2272009	-0.2877537	1	1	1	1	2	2	2	2	3	3	3	3	3.0	4	5	5	5
Rating_Accuracy	602	2.672757	0.8997441	0.03667083	1	0.1827622	0.2772995	1	1	1	2	2	2	2	3	3	3	3	3	3.0	4	4	5	5
Shopping_Satisfaction	602	2.463455	1.0121525	0.04125225	1	0.2785844	-0.4033100	1	1	1	1	2	2	2	2	2	3	3	3	3.0	4	4	5	5

Code

# using Diagnose from dlookr to look for column summary
data |> 
  diagnose() |>
  formattable()

variables	types	missing_count	missing_percent	unique_count	unique_rate
age	numeric	0	0.0000000	50	0.083056478
Gender	character	0	0.0000000	4	0.006644518
Purchase_Frequency	character	0	0.0000000	5	0.008305648
Purchase_Categories	character	0	0.0000000	29	0.048172757
Personalized_Recommendation_Frequency…6	character	0	0.0000000	3	0.004983389
Browsing_Frequency	character	0	0.0000000	4	0.006644518
Product_Search_Method	character	2	0.3322259	5	0.008305648
Search_Result_Exploration	character	0	0.0000000	2	0.003322259
Customer_Reviews_Importance	numeric	0	0.0000000	5	0.008305648
Add_to_Cart_Browsing	character	0	0.0000000	3	0.004983389
Cart_Completion_Frequency	character	0	0.0000000	5	0.008305648
Cart_Abandonment_Factors	character	0	0.0000000	4	0.006644518
Saveforlater_Frequency	character	0	0.0000000	5	0.008305648
Review_Left	character	0	0.0000000	2	0.003322259
Review_Reliability	character	0	0.0000000	5	0.008305648
Review_Helpfulness	character	0	0.0000000	3	0.004983389
Personalized_Recommendation_Frequency…18	numeric	0	0.0000000	5	0.008305648
Recommendation_Helpfulness	character	0	0.0000000	3	0.004983389
Rating_Accuracy	numeric	0	0.0000000	5	0.008305648
Shopping_Satisfaction	numeric	0	0.0000000	5	0.008305648
Service_Appreciation	character	0	0.0000000	8	0.013289037
Improvement_Areas	character	0	0.0000000	18	0.029900332

Checking the number of rows and columns with nrow and ncol:

Code

# Number of rows in the data
nrow(data)

[1] 602

So we have totally 602 data points in the Amazon Consumer behavior dataset. One important point to note here is that some rows contain multiple entries for

Code

# Number of columns in the data
ncol(data)

[1] 22

And we have 22 columns in the dataset.

Categorical variable summary

Using gtsummary for table summary (tbl_summary())of selected categorical columns:

Code

# Selecting the required columns for summary 
 new_data <-data %>% select(Browsing_Frequency,Purchase_Frequency,Purchase_Categories)
  
# Using gtsummary
 
new_data |>
  gtsummary::tbl_summary()

Characteristic	N = 602¹
Browsing_Frequency
Few times a month	199 (33%)
Few times a week	249 (41%)
Multiple times a day	77 (13%)
Rarely	77 (13%)
Purchase_Frequency
Few times a month	203 (34%)
Less than once a month	124 (21%)
Multiple times a week	56 (9.3%)
Once a month	107 (18%)
Once a week	112 (19%)
Purchase_Categories
Beauty and Personal Care	106 (18%)
Beauty and Personal Care;Clothing and Fashion	46 (7.6%)
Beauty and Personal Care;Clothing and Fashion;Home and Kitchen	42 (7.0%)
Beauty and Personal Care;Clothing and Fashion;Home and Kitchen;others	8 (1.3%)
Beauty and Personal Care;Clothing and Fashion;others	12 (2.0%)
Beauty and Personal Care;Home and Kitchen	21 (3.5%)
Beauty and Personal Care;Home and Kitchen;others	5 (0.8%)
Beauty and Personal Care;others	7 (1.2%)
Clothing and Fashion	106 (18%)
Clothing and Fashion;Home and Kitchen	27 (4.5%)
Clothing and Fashion;Home and Kitchen;others	16 (2.7%)
Clothing and Fashion;others	14 (2.3%)
Groceries and Gourmet Food	14 (2.3%)
Groceries and Gourmet Food;Beauty and Personal Care	7 (1.2%)
Groceries and Gourmet Food;Beauty and Personal Care;Clothing and Fashion	10 (1.7%)
Groceries and Gourmet Food;Beauty and Personal Care;Clothing and Fashion;Home and Kitchen	14 (2.3%)
Groceries and Gourmet Food;Beauty and Personal Care;Clothing and Fashion;Home and Kitchen;others	32 (5.3%)
Groceries and Gourmet Food;Beauty and Personal Care;Clothing and Fashion;others	1 (0.2%)
Groceries and Gourmet Food;Beauty and Personal Care;Home and Kitchen	4 (0.7%)
Groceries and Gourmet Food;Beauty and Personal Care;others	3 (0.5%)
Groceries and Gourmet Food;Clothing and Fashion	6 (1.0%)
Groceries and Gourmet Food;Clothing and Fashion;Home and Kitchen	4 (0.7%)
Groceries and Gourmet Food;Clothing and Fashion;Home and Kitchen;others	3 (0.5%)
Groceries and Gourmet Food;Clothing and Fashion;others	2 (0.3%)
Groceries and Gourmet Food;Home and Kitchen	5 (0.8%)
Groceries and Gourmet Food;Home and Kitchen;others	6 (1.0%)
Home and Kitchen	24 (4.0%)
Home and Kitchen;others	9 (1.5%)
others	48 (8.0%)
¹ n (%)

Questions

Question 1

In our first question “What are the factors influencing the customer’s decision to abandon a purchase in their cart on Amazon?” we are attempting to understand the reasons behind the customer abandoning the purchase in their cart for increasing the conversion rate(the percentage of users who actually complete a purchase) for amazon. It will also help us in enhancing the customer experience by making the application or the website more user-friendly and intuitive so that the user is able to find the right product and proceed to complete his purchase in an effortless manner.

Question 2

For our second question “Which demographic (on the basis of gender and age) is most likely to purchase a particular product category?” we attempt to determine the demographic which is most likely to purchase a particular product category on the basis of their age and gender which will help companies to tailor their marketing strategies so that their messages are able to reach the right group of customers, leading to cost optimization of their marketing budget. By identifying the right demographic to target amazon can gain a competitive advantage by attracting a larger share of the target audience. This will also lead to higher levels of customer satisfaction which will result in customer loyalty and recurring purchases.

Analysis plan

Approach for question 1

Our analysis will commence with a precise exploratory data analysis (EDA) using R, where we’ll focus on key variables such as Purchase_Frequency, Product_Search_Method, and Customer_Reviews_Importance, to unearth their potential impact on cart abandonment. During the data cleaning stage, we will ensure that columns like age, Gender, and Browsing_Frequency are correctly formatted and free of missing values. Subsequently, statistical techniques—such as logistic regression or decision trees—will be applied to assess the influence of factors like Personalized_Recommendation_Frequency and Search_Result_Exploration on cart abandonment behavior.

For visualization, we will harness the capabilities of R’s ggplot2 package to create comprehensive and interpretable graphics, such as correlation heat-maps and bar charts, showcasing the interplay between cart abandonment and various customer behavior metrics. This meticulous approach is designed to provide us with robust insights, empowering us to enhance the shopping experience and curtail cart abandonment rates effectively.

Approach for question 2

We will further continue our analysis where we will be focusing majorly on the key features such as age, Gender, Purchase_Categories and Cart_Completion_Frequency to predict which product is of particular interest to a demography. Again, we will make sure the data is clean for the concerned features and address any missing or incorrect data.

Post that we will identify important summary statistics for key variables such as Purchase_Categories. Once the data is clean, we will employ Statistical Analysis tools to assess the likelihood of purchasing product categories based on demographic variables (age and gender). For visualization, we will be using heat-maps and bar charts to illustrate the relationship between the target variables.

Variables of focus for both questions

Variable	Description
Purchase_Frequency	How frequently does the user make purchases on Amazon?
Product_Search_Method	How does the user search for products on Amazon?
Customer_Reviews_Importance	How important are customer reviews in users decision-making process?
age	age
Gender	gender
Browsing_Frequency	How often does the user browse Amazon’s website or app?
Purchase_Categories	What product categories does the user typically purchase on Amazon?
Cart_Completion_Frequency	How often do user complete the purchase after adding products to their cart?
Personalized_Recommendation_Frequency	How often do user receive personalized product recommendations from Amazon?
Search_Result_Exploration	Does the user tend to explore multiple pages of search results or focus on the first page?

Organization

Plan of Attack

Week	Weekly Tasks	Persons in Charge	Backup
until November 8^th	Explore and finalize the dataset and the problem statements	Everyone	Everyone
-	Complete the proposal and assign some high-level tasks	Everyone	Everyone
November 9^th to 15^th	Exploratory Data Analysis	Shashwat	Bharath
-	Data cleaning and Data pre-processing based on EDA	Bharath	Pranshu
-	Question specific exploration and identify initial trends and patterns	Joel	Vishal
November 16^th to 22^nd	Model training for Q1	Vishal	Shashwat
-	Model training for Q2	Pranshu	Joel
November 23^rd to 29^th	Continue Model training and testing for Q1 and Q2	Vishal	Pranshu
-	Improving the models if there is a need	Joel	Bharath
November 30^th to December 6^th	Refining the code for code review with comments	Bharath	Vishal
-	Generate insights from the model output	Shashwat	Joel
December 7^th to 13^th	Review the generated models	Pranshu	Shashwat
-	Write-up and presentation for the project	Everyone	Everyone

Repo Organization

The following are the folders involved in the Project repository.

‘data/’: Used for storing any necessary data files for the project, such as input files.
‘images/’: Used for storing image files used in the project.
‘presentation_files/’: Folder for having presentation related files.
‘_extra/’: Used to brainstorm our analysis which won’t impact our project workflow.
‘_freeze/’: This folder is used to store the generated files during the build process. These files represent the frozen state of the website at a specific point in time.
‘_site/’: Folder used to store the generated static website files after the site generator processes the quarto document.
‘.github/’: Folder for storing github templates and workflow.

We will be creating few folders inside images/ folder for storing question specific images and presentation related images which are generated through out the plot. We will be creating images/Q1, images/Q2 and images/Presentation for those respective files.

Note:

These are the planned approaches, and we intend to explore and solve the problem statement which we came up with. Parts of our approach might change in the final implementation.