Consumer Behaviour Analysis

Proposal

Data Mining
Author
Affiliation
Pattern Pioneers - Vishal, Joel, Pranshu, Shashwat, Bharath

School of Information, University of Arizona

Load required packages

Installed Packages
### GETTING THE LIBRARIES
if (!require(pacman))
  install.packages(pacman)

pacman::p_load(tidyverse,   # Data wrangling
               dlookr,      # Exploratory Data Analysis
               formattable, # Present neat table format
               gt,          # Alternating formatting for the tables
               gtsummary)   # Summary for the tables

Goal

Our main motivation for selecting the dataset Amazon consumer Behaviour Dataset that we came across in Kaggle was to unveil some customer insights which can be used for enhancing the customer experience or the improving the business implementation after analyzing this dataset.

Dataset

Code
# Using the original data
# Loading the csv into a variable using read_csv

data <- read_csv("data/Amazon_Customer_Behavior_Survey.csv")

# Removing unwanted column like "timestamp".

data <- data %>% select(-Timestamp)

This is a dataset collected from kaggle for analyzing the behavioral analysis of Amazon’s consumers consists of a comprehensive collection of customer interactions, browsing patterns within the Amazon ecosystem. It includes a wide range of variables such as customer demographics, user interaction, and reviews. The dataset aims to provide insights into customer preferences, shopping habits, and decision-making processes on the Amazon platform. By analyzing this dataset, researchers and analysts can gain a deeper understanding of consumer behavior, identify trends, optimize marketing strategies, and improve the overall customer experience on Amazon. The Dataset contains N=602 observations.

Examine data

Using dlookr’s describe() and diagnose() - some basic EDA

Code
# Summary statistics of numerical column

data |> 
    describe() |>
  formattable()
described_variables n na mean sd se_mean IQR skewness kurtosis p00 p01 p05 p10 p20 p25 p30 p40 p50 p60 p70 p75 p80 p90 p95 p99 p100
age 602 0 30.790698 10.1932760 0.41544691 13 1.0078573 0.5563217 3 16 21 22 23 23 23 25 26 32 34 36 39.8 45 50 60 67
Customer_Reviews_Importance 602 0 2.480066 1.1852257 0.04830619 2 0.3033064 -0.7098681 1 1 1 1 1 1 2 2 3 3 3 3 3.0 4 5 5 5
Personalized_Recommendation_Frequency…18 602 0 2.699336 1.0420284 0.04246991 1 0.2272009 -0.2877537 1 1 1 1 2 2 2 2 3 3 3 3 3.0 4 5 5 5
Rating_Accuracy 602 0 2.672757 0.8997441 0.03667083 1 0.1827622 0.2772995 1 1 1 2 2 2 2 3 3 3 3 3 3.0 4 4 5 5
Shopping_Satisfaction 602 0 2.463455 1.0121525 0.04125225 1 0.2785844 -0.4033100 1 1 1 1 2 2 2 2 2 3 3 3 3.0 4 4 5 5
Code
# using Diagnose from dlookr to look for column summary
data |> 
  diagnose() |>
  formattable()
variables types missing_count missing_percent unique_count unique_rate
age numeric 0 0.0000000 50 0.083056478
Gender character 0 0.0000000 4 0.006644518
Purchase_Frequency character 0 0.0000000 5 0.008305648
Purchase_Categories character 0 0.0000000 29 0.048172757
Personalized_Recommendation_Frequency…6 character 0 0.0000000 3 0.004983389
Browsing_Frequency character 0 0.0000000 4 0.006644518
Product_Search_Method character 2 0.3322259 5 0.008305648
Search_Result_Exploration character 0 0.0000000 2 0.003322259
Customer_Reviews_Importance numeric 0 0.0000000 5 0.008305648
Add_to_Cart_Browsing character 0 0.0000000 3 0.004983389
Cart_Completion_Frequency character 0 0.0000000 5 0.008305648
Cart_Abandonment_Factors character 0 0.0000000 4 0.006644518
Saveforlater_Frequency character 0 0.0000000 5 0.008305648
Review_Left character 0 0.0000000 2 0.003322259
Review_Reliability character 0 0.0000000 5 0.008305648
Review_Helpfulness character 0 0.0000000 3 0.004983389
Personalized_Recommendation_Frequency…18 numeric 0 0.0000000 5 0.008305648
Recommendation_Helpfulness character 0 0.0000000 3 0.004983389
Rating_Accuracy numeric 0 0.0000000 5 0.008305648
Shopping_Satisfaction numeric 0 0.0000000 5 0.008305648
Service_Appreciation character 0 0.0000000 8 0.013289037
Improvement_Areas character 0 0.0000000 18 0.029900332

Checking the number of rows and columns with nrow and ncol:

Code
# Number of rows in the data
nrow(data)
[1] 602
  • So we have totally 602 data points in the Amazon Consumer behavior dataset. One important point to note here is that some rows contain multiple entries for
Code
# Number of columns in the data
ncol(data)
[1] 22
  • And we have 22 columns in the dataset.

Categorical variable summary

Using gtsummary for table summary (tbl_summary())of selected categorical columns:

Code
# Selecting the required columns for summary 
 new_data <-data %>% select(Browsing_Frequency,Purchase_Frequency,Purchase_Categories)
  
# Using gtsummary
 
new_data |>
  gtsummary::tbl_summary()
Characteristic N = 6021
Browsing_Frequency
    Few times a month 199 (33%)
    Few times a week 249 (41%)
    Multiple times a day 77 (13%)
    Rarely 77 (13%)
Purchase_Frequency
    Few times a month 203 (34%)
    Less than once a month 124 (21%)
    Multiple times a week 56 (9.3%)
    Once a month 107 (18%)
    Once a week 112 (19%)
Purchase_Categories
    Beauty and Personal Care 106 (18%)
    Beauty and Personal Care;Clothing and Fashion 46 (7.6%)
    Beauty and Personal Care;Clothing and Fashion;Home and Kitchen 42 (7.0%)
    Beauty and Personal Care;Clothing and Fashion;Home and Kitchen;others 8 (1.3%)
    Beauty and Personal Care;Clothing and Fashion;others 12 (2.0%)
    Beauty and Personal Care;Home and Kitchen 21 (3.5%)
    Beauty and Personal Care;Home and Kitchen;others 5 (0.8%)
    Beauty and Personal Care;others 7 (1.2%)
    Clothing and Fashion 106 (18%)
    Clothing and Fashion;Home and Kitchen 27 (4.5%)
    Clothing and Fashion;Home and Kitchen;others 16 (2.7%)
    Clothing and Fashion;others 14 (2.3%)
    Groceries and Gourmet Food 14 (2.3%)
    Groceries and Gourmet Food;Beauty and Personal Care 7 (1.2%)
    Groceries and Gourmet Food;Beauty and Personal Care;Clothing and Fashion 10 (1.7%)
    Groceries and Gourmet Food;Beauty and Personal Care;Clothing and Fashion;Home and Kitchen 14 (2.3%)
    Groceries and Gourmet Food;Beauty and Personal Care;Clothing and Fashion;Home and Kitchen;others 32 (5.3%)
    Groceries and Gourmet Food;Beauty and Personal Care;Clothing and Fashion;others 1 (0.2%)
    Groceries and Gourmet Food;Beauty and Personal Care;Home and Kitchen 4 (0.7%)
    Groceries and Gourmet Food;Beauty and Personal Care;others 3 (0.5%)
    Groceries and Gourmet Food;Clothing and Fashion 6 (1.0%)
    Groceries and Gourmet Food;Clothing and Fashion;Home and Kitchen 4 (0.7%)
    Groceries and Gourmet Food;Clothing and Fashion;Home and Kitchen;others 3 (0.5%)
    Groceries and Gourmet Food;Clothing and Fashion;others 2 (0.3%)
    Groceries and Gourmet Food;Home and Kitchen 5 (0.8%)
    Groceries and Gourmet Food;Home and Kitchen;others 6 (1.0%)
    Home and Kitchen 24 (4.0%)
    Home and Kitchen;others 9 (1.5%)
    others 48 (8.0%)
1 n (%)

Questions

Question 1

In our first question “What are the factors influencing the customer’s decision to abandon a purchase in their cart on Amazon?” we are attempting to understand the reasons behind the customer abandoning the purchase in their cart for increasing the conversion rate(the percentage of users who actually complete a purchase) for amazon. It will also help us in enhancing the customer experience by making the application or the website more user-friendly and intuitive so that the user is able to find the right product and proceed to complete his purchase in an effortless manner.

Question 2

For our second question “Which demographic (on the basis of gender and age) is most likely to purchase a particular product category?” we attempt to determine the demographic which is most likely to purchase a particular product category on the basis of their age and gender which will help companies to tailor their marketing strategies so that their messages are able to reach the right group of customers, leading to cost optimization of their marketing budget. By identifying the right demographic to target amazon can gain a competitive advantage by attracting a larger share of the target audience. This will also lead to higher levels of customer satisfaction which will result in customer loyalty and recurring purchases.

Analysis plan

Approach for question 1

Our analysis will commence with a precise exploratory data analysis (EDA) using R, where we’ll focus on key variables such as Purchase_Frequency, Product_Search_Method, and Customer_Reviews_Importance, to unearth their potential impact on cart abandonment. During the data cleaning stage, we will ensure that columns like age, Gender, and Browsing_Frequency are correctly formatted and free of missing values. Subsequently, statistical techniques—such as logistic regression or decision trees—will be applied to assess the influence of factors like Personalized_Recommendation_Frequency and Search_Result_Exploration on cart abandonment behavior.

For visualization, we will harness the capabilities of R’s ggplot2 package to create comprehensive and interpretable graphics, such as correlation heat-maps and bar charts, showcasing the interplay between cart abandonment and various customer behavior metrics. This meticulous approach is designed to provide us with robust insights, empowering us to enhance the shopping experience and curtail cart abandonment rates effectively.

Approach for question 2

We will further continue our analysis where we will be focusing majorly on the key features such as age, Gender, Purchase_Categories and Cart_Completion_Frequency to predict which product is of particular interest to a demography. Again, we will make sure the data is clean for the concerned features and address any missing or incorrect data.

Post that we will identify important summary statistics for key variables such as Purchase_Categories. Once the data is clean, we will employ Statistical Analysis tools to assess the likelihood of purchasing product categories based on demographic variables (age and gender). For visualization, we will be using heat-maps and bar charts to illustrate the relationship between the target variables.

Variables of focus for both questions

Variable Description
Purchase_Frequency How frequently does the user make purchases on Amazon?
Product_Search_Method How does the user search for products on Amazon?
Customer_Reviews_Importance How important are customer reviews in users decision-making process?
age age
Gender gender
Browsing_Frequency How often does the user browse Amazon’s website or app?
Purchase_Categories What product categories does the user typically purchase on Amazon?
Cart_Completion_Frequency How often do user complete the purchase after adding products to their cart?
Personalized_Recommendation_Frequency How often do user receive personalized product recommendations from Amazon?
Search_Result_Exploration Does the user tend to explore multiple pages of search results or focus on the first page?

Organization

Plan of Attack

Week Weekly Tasks Persons in Charge Backup
until November 8th Explore and finalize the dataset and the problem statements Everyone Everyone
- Complete the proposal and assign some high-level tasks Everyone Everyone
November 9th to 15th Exploratory Data Analysis Shashwat Bharath
- Data cleaning and Data pre-processing based on EDA Bharath Pranshu
- Question specific exploration and identify initial trends and patterns Joel Vishal
November 16th to 22nd Model training for Q1 Vishal Shashwat
- Model training for Q2 Pranshu Joel
November 23rd to 29th Continue Model training and testing for Q1 and Q2 Vishal Pranshu
- Improving the models if there is a need Joel Bharath
November 30th to December 6th Refining the code for code review with comments Bharath Vishal
- Generate insights from the model output Shashwat Joel
December 7th to 13th Review the generated models Pranshu Shashwat
- Write-up and presentation for the project Everyone Everyone

Repo Organization

The following are the folders involved in the Project repository.

  • ‘data/’: Used for storing any necessary data files for the project, such as input files.

  • ‘images/’: Used for storing image files used in the project.

  • ‘presentation_files/’: Folder for having presentation related files.

  • ‘_extra/’: Used to brainstorm our analysis which won’t impact our project workflow.

  • ‘_freeze/’: This folder is used to store the generated files during the build process. These files represent the frozen state of the website at a specific point in time.

  • ‘_site/’: Folder used to store the generated static website files after the site generator processes the quarto document.

  • ‘.github/’: Folder for storing github templates and workflow.

We will be creating few folders inside images/ folder for storing question specific images and presentation related images which are generated through out the plot. We will be creating images/Q1, images/Q2 and images/Presentation for those respective files.

Note:

These are the planned approaches, and we intend to explore and solve the problem statement which we came up with. Parts of our approach might change in the final implementation.