Tuesday, October 27, 2020
Saturday, October 17, 2020
Blog 38: Survival Analyses in R
Survival Analyses on Retail Coupon Redemption
Parag Verma
Introduction
Brand promotion are a common sight in retail stores. Each customer is targetted with the product information and is normally incentivised with a coupon for his participation in the activity. Now there are coupons rolled out to different individuals and it then becomes very important to know how these were redeemed with time.This would help us understand whether the campaign was able to drive brand awareness and generate revenue. This would also mean that coupons which were not redeemed with time or which were redeemed late must have missed the bus. The current blog focusses on identifying which features of customers impacted coupon redemption under the light of survivial analyses plots.
Survival Analyses is mostly used in clinical studies to determine the probability of survival within the cohorts of patients after recievning drug treatment. This can be taken across to marketing set up as each respondents(patient cohort) gets a coupon(drug) and how he redeems the coupon is studied for campaign effectiveness. It is very important to note that most feel that since survival analyses represents results in the form of a plot, it is a descriptive statistics.Through this blog I would like to state that survival analyses results is an inferential statistics and uses log rank test to test a null hypothese of mean survivial across different levels(unique values) of a categorical variable
Step 1:Installing libraries
For this blog,we will install the completejourney library which houses the campaign dataset for 2469 households(cohort) that participated in the study.survival and survminer packages will be used to perform the survival analyses
package.name<-c("dplyr","tidyr","completejourney","survival","survminer")
for(i in package.name){
if(!require(i,character.only = T)){
install.packages(i)
}
library(i,character.only = T)
}
Step 2:Importing the dataset
We will import the coupon redemption dataset from the completejourney package.It has 4 columns namely:
- household_id:Uniquely identifies each household
- coupon_upc: Uniquely identifies each coupon (unique to household and campaign)
- campaign_id: Uniquely identifies each campaign
- redemption_date: Date when the coupon was redeemed
In th subsequent section, we will merge this dataset with the demographic data to get additional insights into customer behaviour
interim.df<-coupon_redemptions
head(interim.df)
# A tibble: 6 x 4
household_id coupon_upc campaign_id redemption_date
<chr> <chr> <chr> <date>
1 1029 51380041013 26 2017-01-01
2 1029 51380041313 26 2017-01-01
3 165 53377610033 26 2017-01-03
4 712 51380041013 26 2017-01-07
5 712 54300016033 26 2017-01-07
6 2488 51200092776 26 2017-01-10
Step 3:Calculating the redeption time
Taking 1st Jan 2017 as the starting date, we will now calcaulte the difference in time between the starting date and the redemption date.This will give us the number of days before which the coupon got used up.We will also map it with the demographics table to get the household related features
# Importing the demographics dataset
demo.df<-demographics
final.df<-interim.df%>%
mutate(diff_days=as.numeric(interim.df[["redemption_date"]]-as.Date("2017-01-01")))%>%
mutate(CNSR=1)%>%
left_join(demo.df,by="household_id")
head(final.df)
# A tibble: 6 x 13
household_id coupon_upc campaign_id redemption_date diff_days CNSR age
<chr> <chr> <chr> <date> <dbl> <dbl> <ord>
1 1029 513800410~ 26 2017-01-01 0 1 <NA>
2 1029 513800413~ 26 2017-01-01 0 1 <NA>
3 165 533776100~ 26 2017-01-03 2 1 55-64
4 712 513800410~ 26 2017-01-07 6 1 65+
5 712 543000160~ 26 2017-01-07 6 1 65+
6 2488 512000927~ 26 2017-01-10 9 1 45-54
# ... with 6 more variables: income <ord>, home_ownership <ord>,
# marital_status <ord>, household_size <ord>, household_comp <ord>,
# kids_count <ord>
It can be seen than CNSR variable was added to the final.df data frame.This represents the set of records/individuals that were not tracked effectively or for whom there is no follo up information available.CNSR can take two set of values:
- '0': The individual was censored and its value will not be considered in the analyses
- '1': The individual whose value will be considered in the analyses
For this particular example,this is a housekeeping variable and should be made as 1 for all the records
Step 4:Coupon redemption by home_ownership
Lets look at how different types of home ownerships differ in terms of retail campaigning
surv_object <- Surv(time = final.df$diff_days, event = final.df$CNSR)
fit1 <- survfit(surv_object ~ household_comp, data = final.df)
ggsurvplot(fit1, data = final.df, pval = TRUE,
legend = "right")
Insights from the graph:
- The graph starts from a probability of 1(which means that the probability of redemption is 0 at the start)
- As we move along Time(X-axis), the probability reduces(which essentially means the probability of redeeming coupon increases)
- The plot which is able to quickly reach 0 on Y axis means that it has a greater propensity of getting redeemed
- We can see from the plot that various household sizes are more or less similar in the way they redeem coupons.
- This means that household size cant be a metric while designing campaings
- A p value of 0.41 on the graph confirms that there is no difference between different household sizes in the time taken to redeem coupons
Now we will perform the same step for one more variable to drive home the method used:
Step 4:Coupon redemption by age
surv_object <- Surv(time = final.df$diff_days, event = final.df$CNSR)
fit1 <- survfit(surv_object ~ age, data = final.df)
ggsurvplot(fit1, data = final.df, pval = TRUE,
legend = "right")
We can clearly see that individuals in the age bracket 19-24(represented by red line) tend to redeem their coupons fairly quickly in comparison to folks from other age bracket.Also the difference between various age brackets is significant as suggested by a p value of 0.0011.Hence it can be concluded that age impacts propensity to redeem coupons and can be made an imprtant indicator while designing campaings.
Link to Previous R Blogs
List of Datasets for Practise
https://hofmann.public.iastate.edu/data_in_r_sortable.html
https://vincentarelbundock.github.io/Rdatasets/datasets.html
Word Cloud using R
Word Cloud Using R Word Cloud Using R 2024-09-16 Introduction I...
-
Web Scraping using Rselenium Web Scraping using Rselenium Parag Verma...
-
Complete List of various topics in R Complete List of various topics in R Parag Verma Basics o...
-
Sensors are used in a lot of industrial applications to measure properties of a process. This can be temperature, pressure, humidity, den...