Machine Learning Made Easy: October 2020

Survival Analyses on Retail Coupon Redemption

Introduction

Brand promotion are a common sight in retail stores. Each customer is targetted with the product information and is normally incentivised with a coupon for his participation in the activity. Now there are coupons rolled out to different individuals and it then becomes very important to know how these were redeemed with time.This would help us understand whether the campaign was able to drive brand awareness and generate revenue. This would also mean that coupons which were not redeemed with time or which were redeemed late must have missed the bus. The current blog focusses on identifying which features of customers impacted coupon redemption under the light of survivial analyses plots.

Survival Analyses is mostly used in clinical studies to determine the probability of survival within the cohorts of patients after recievning drug treatment. This can be taken across to marketing set up as each respondents(patient cohort) gets a coupon(drug) and how he redeems the coupon is studied for campaign effectiveness. It is very important to note that most feel that since survival analyses represents results in the form of a plot, it is a descriptive statistics.Through this blog I would like to state that survival analyses results is an inferential statistics and uses log rank test to test a null hypothese of mean survivial across different levels(unique values) of a categorical variable

Step 1:Installing libraries

For this blog,we will install the completejourney library which houses the campaign dataset for 2469 households(cohort) that participated in the study.survival and survminer packages will be used to perform the survival analyses

package.name<-c("dplyr","tidyr","completejourney","survival","survminer")

for(i in package.name){
  
  if(!require(i,character.only = T)){
    
    install.packages(i)
  }
  library(i,character.only = T)
  
}

Step 2:Importing the dataset

We will import the coupon redemption dataset from the completejourney package.It has 4 columns namely:

household_id:Uniquely identifies each household
coupon_upc: Uniquely identifies each coupon (unique to household and campaign)
campaign_id: Uniquely identifies each campaign
redemption_date: Date when the coupon was redeemed

In th subsequent section, we will merge this dataset with the demographic data to get additional insights into customer behaviour

interim.df<-coupon_redemptions
head(interim.df)

# A tibble: 6 x 4
  household_id coupon_upc  campaign_id redemption_date
  <chr>        <chr>       <chr>       <date>         
1 1029         51380041013 26          2017-01-01     
2 1029         51380041313 26          2017-01-01     
3 165          53377610033 26          2017-01-03     
4 712          51380041013 26          2017-01-07     
5 712          54300016033 26          2017-01-07     
6 2488         51200092776 26          2017-01-10

Step 3:Calculating the redeption time

Taking 1st Jan 2017 as the starting date, we will now calcaulte the difference in time between the starting date and the redemption date.This will give us the number of days before which the coupon got used up.We will also map it with the demographics table to get the household related features

# Importing the demographics dataset
demo.df<-demographics

final.df<-interim.df%>%
  mutate(diff_days=as.numeric(interim.df[["redemption_date"]]-as.Date("2017-01-01")))%>%
  mutate(CNSR=1)%>%
  left_join(demo.df,by="household_id")
  



head(final.df)

# A tibble: 6 x 13
  household_id coupon_upc campaign_id redemption_date diff_days  CNSR age  
  <chr>        <chr>      <chr>       <date>              <dbl> <dbl> <ord>
1 1029         513800410~ 26          2017-01-01              0     1 <NA> 
2 1029         513800413~ 26          2017-01-01              0     1 <NA> 
3 165          533776100~ 26          2017-01-03              2     1 55-64
4 712          513800410~ 26          2017-01-07              6     1 65+  
5 712          543000160~ 26          2017-01-07              6     1 65+  
6 2488         512000927~ 26          2017-01-10              9     1 45-54
# ... with 6 more variables: income <ord>, home_ownership <ord>,
#   marital_status <ord>, household_size <ord>, household_comp <ord>,
#   kids_count <ord>

It can be seen than CNSR variable was added to the final.df data frame.This represents the set of records/individuals that were not tracked effectively or for whom there is no follo up information available.CNSR can take two set of values:

'0': The individual was censored and its value will not be considered in the analyses
'1': The individual whose value will be considered in the analyses

For this particular example,this is a housekeeping variable and should be made as 1 for all the records

Step 4:Coupon redemption by home_ownership

Lets look at how different types of home ownerships differ in terms of retail campaigning

surv_object <- Surv(time = final.df$diff_days, event = final.df$CNSR)

fit1 <- survfit(surv_object ~ household_comp, data = final.df)
ggsurvplot(fit1, data = final.df, pval = TRUE,
           legend = "right")

Insights from the graph:

The graph starts from a probability of 1(which means that the probability of redemption is 0 at the start)
As we move along Time(X-axis), the probability reduces(which essentially means the probability of redeeming coupon increases)
The plot which is able to quickly reach 0 on Y axis means that it has a greater propensity of getting redeemed
We can see from the plot that various household sizes are more or less similar in the way they redeem coupons.
- This means that household size cant be a metric while designing campaings
A p value of 0.41 on the graph confirms that there is no difference between different household sizes in the time taken to redeem coupons

Now we will perform the same step for one more variable to drive home the method used:

Step 4:Coupon redemption by age

surv_object <- Surv(time = final.df$diff_days, event = final.df$CNSR)


fit1 <- survfit(surv_object ~ age, data = final.df)
ggsurvplot(fit1, data = final.df, pval = TRUE,
         legend = "right")

We can clearly see that individuals in the age bracket 19-24(represented by red line) tend to redeem their coupons fairly quickly in comparison to folks from other age bracket.Also the difference between various age brackets is significant as suggested by a p value of 0.0011.Hence it can be concluded that age impacts propensity to redeem coupons and can be made an imprtant indicator while designing campaings.

Link to Previous R Blogs

https://www.aimlmadeeasy.com/2020/06/r-complete-guide.html

List of Datasets for Practise

https://hofmann.public.iastate.edu/data_in_r_sortable.html

https://vincentarelbundock.github.io/Rdatasets/datasets.html

Machine Learning Made Easy

Tuesday, October 27, 2020

Step 12 Handling NA Values in R

Saturday, October 17, 2020

Blog 38: Survival Analyses in R