Demand Forecasting using Xgboost
Parag Verma
27th Dec, 2022
Introduction
All industries either sell product or services or both.In the case of products, there is always a constant question around how much to produce. Successful anticipation of demand helps to remove uncertainty in business operations and enables to streamline operations. Its impact on total revenue and profit is also profound as the process of predicting demand helps to remove guess work from daily operations thereby reducing cost and increasing service levels
To get a hang of things, we will look at how to leverage the data for liquor sales and create a forecasting model. We will be using xgboost model to forecast as it is more efficient than some of the classical models such as ARIMA, SARIMA, ARIMAX etc.
For our analysis, we will be using the liquor_sales data set from ialiquor library.
Step 0: Installing libraries
package.name<-c("dplyr","lubridate","openxlsx","tidyr",
"ialiquor","stringr",
"mlr", # for hyper parameter tuning
"xgboost")
for(i in package.name){
if(!require(i,character.only = T)){
install.packages(i)
}
library(i,character.only = T)
}
Step 1: Importing the dataset
Importing the liquor data set
data("liquor_sales")
df<-liquor_sales
head(df)
# A tibble: 6 × 10
year year_month county popul…¹ type categ…² state…³ state…⁴ bottl…⁵
<dbl> <dttm> <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
1 2015 2015-01-01 00:00:00 adair 7145 vodka 100 pr… 254. 381. 54
2 2015 2015-01-01 00:00:00 adair 7145 other americ… 54 81 6
3 2015 2015-01-01 00:00:00 adair 7145 liqu… americ… 89.0 134. 18
4 2015 2015-01-01 00:00:00 adair 7145 cock… americ… 182. 277. 26
5 2015 2015-01-01 00:00:00 adair 7145 liqu… americ… 346. 519. 36
6 2015 2015-01-01 00:00:00 adair 7145 gin americ… 99.6 149. 24
# … with 1 more variable: volume <dbl>, and abbreviated variable names
# ¹population, ²category, ³state_cost, ⁴state_revenue, ⁵bottles_sold
The attributes are as follows:
- year:Year of sales
- year_month: time of sale in year and month
- county:Name of the county
- population:Total population in county
- type:Type of liquor
- category:Category of liquor
- state_cost:Total cost to the state for preparing the liquor
- state_revenue:Total revenue obtained by selling the volume of liquor
- bottles_sold:Number of bottles sold
- volume:Total volume sold
For this blog, we will focus on predicting the total volume sold and try to create a model with rest of the variables available in the model
If you unable to import the data, then you can download the liquor_sales dataset form my github repo repository.
Step 2: Freq profiling
If you look at the data set, then there are only a few selected fields that we can leverage for forecasting. To predict volume for future months, we would need the future values of independent variables. Variables such as cost, revenue and population are non-deterministic and hence cant be taken into model building.Instead, we will focus on using time features such as year, month and categorical attributes such as county and type.
Lets look at the distribution of these variables using freq profiling
# Converting year month to categorical data for profiling purposes
df$year_month2<-str_c("_",df$year_month)
# Independent variables
nm<-c("year","year_month2","county","type","category")
# Dependent variables
dv<-"volume"
df.interim<-df%>%
select(c(nm,dv))
Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
ℹ Please use `all_of()` or `any_of()` instead.
# Was:
data %>% select(nm)
# Now:
data %>% select(all_of(nm))
See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
ℹ Please use `all_of()` or `any_of()` instead.
# Was:
data %>% select(dv)
# Now:
data %>% select(all_of(dv))
See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
head(df.interim)
# A tibble: 6 × 6
year year_month2 county type category volume
<dbl> <chr> <chr> <chr> <chr> <dbl>
1 2015 _2015-01-01 adair vodka 100 proof vodka 58.5
2 2015 _2015-01-01 adair other american alcohol 4.5
3 2015 _2015-01-01 adair liqueur american amaretto 19.5
4 2015 _2015-01-01 adair cocktail american cocktails 45.5
5 2015 _2015-01-01 adair liqueur american cordials & liqueur 27
6 2015 _2015-01-01 adair gin american dry gins 15
Looking the total number of records for each level of a variable along with total volume across it
l1<-lapply(nm,function(x){
z<-df%>%
select(x,dv)
colnames(z)[1]<-"Level"
z1<-z%>%
group_by(Level)%>%
summarise(Total=n(),
Total_Volume=sum(volume))%>%
ungroup()%>%
mutate(Feature=x)%>%
select(Feature,Level,everything())
return(z1)
})
Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
ℹ Please use `all_of()` or `any_of()` instead.
# Was:
data %>% select(x)
# Now:
data %>% select(all_of(x))
See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
df.final<-do.call(rbind.data.frame,l1)
head(df.final)
# A tibble: 6 × 4
Feature Level Total Total_Volume
<chr> <chr> <int> <dbl>
1 year 2015 59064 19609391.
2 year 2016 53129 19528006.
3 year 2017 43487 20670621.
4 year 2018 43830 21868138.
5 year 2019 43961 22262339.
6 year 2020 37265 19985800.
write.xlsx(df.final,"Freq_profile.xlsx")
Now if you pivot the data from the excel output and create the chart, it will somthing like the following charts
There is variation in total liquor consumption across years with the volume peaking around 2018 and 2019 periods. We should definitely take year into our model as it would be indicative of the trend in consumption
Monthly demand alternates through cycles of increase and decrease. More or less it settles around the 1700k mark as shown in the graph.This variable is also a good indicator of seasonality
Also if you look closely, the date values correspond to first of every month.Hence taking the week information from the year_month field will not make sense as the week that we get out of it will be like 1,5,9,13 and so on. Hence we can see that the weeks value will not be available for all the 52 weeks of a given year
Most the consumption is done in the top 10 counties.Hence for our model, we can just focus on the top 10 counties.
Whisky along with Vodka sells the most.In order ot develop the model for a given product, lets focus on Vodka initially.The modelling concepts that we develop for whiskey can be extended to vodka as well
Step 3: Derive Seasonality: An illustration
Seasonality is something that repeats regulary for each time period.In our case, it could be some pattern that repeats every year.Now we have already seen from the plot of year_month that the volume goes through a cycle of increase and decrease. What we intend to do here is to derive that pattern for every county across a single product(Vodka).For illustration, lets look at polk county and Vodka
polk county and Vodka
polk.df<-df%>%
select(year,year_month,county,category,volume)%>%
mutate(Month_Nm = month(year_month))%>% # extracting the month info
filter(county %in% c("polk"),
category=="american vodka")%>%
arrange(year,Month_Nm)%>%
filter(year != 2016) # removing 2016 data so that we start from Jan 2017
head(polk.df)
# A tibble: 6 × 6
year year_month county category volume Month_Nm
<dbl> <dttm> <chr> <chr> <dbl> <dbl>
1 2017 2017-01-01 00:00:00 polk american vodka 64247. 1
2 2017 2017-02-01 00:00:00 polk american vodka 73150. 2
3 2017 2017-03-01 00:00:00 polk american vodka 77927. 3
4 2017 2017-04-01 00:00:00 polk american vodka 81923. 4
5 2017 2017-05-01 00:00:00 polk american vodka 85994. 5
6 2017 2017-06-01 00:00:00 polk american vodka 102798. 6
We will be converting the volume into time series and then decompose the time series to get trend, seasonality and random components
# Converting volume into a time series
ts_val<-ts(polk.df$volume,frequency = 12)
# Decomposing into trend, seasonality and random components
plot(decompose(ts_val))
Observations from the Plot:
- The seasonality component repeats for each time period
- There is an upward trend in volume.We will use the year column as a proxy for trend
- Observed and random components we dont want in our model
# Extracting the seasonality component from the decomposed part
ts_seasonal<-as.numeric(decompose(ts_val)$seasonal)
# Normalizing the component so that it becomes a ratio instead of absolute values
seas_factor<-ts_seasonal/max(ts_seasonal)
seas_factor
[1] -0.7946703 -0.6616139 -0.3358038 -0.1644492 1.0000000 0.4525524
[7] 0.2120918 0.9767894 -0.6340173 0.0751423 0.1862966 -0.3123180
[13] -0.7946703 -0.6616139 -0.3358038 -0.1644492 1.0000000 0.4525524
[19] 0.2120918 0.9767894 -0.6340173 0.0751423 0.1862966 -0.3123180
[25] -0.7946703 -0.6616139 -0.3358038 -0.1644492 1.0000000 0.4525524
[31] 0.2120918 0.9767894 -0.6340173 0.0751423 0.1862966 -0.3123180
[37] -0.7946703 -0.6616139 -0.3358038 -0.1644492 1.0000000 0.4525524
[43] 0.2120918 0.9767894 -0.6340173 0.0751423
Adding the column into the data frame for polk
polk.df%>%
mutate(Seasonality=seas_factor)
# A tibble: 46 × 7
year year_month county category volume Month_Nm Seasonality
<dbl> <dttm> <chr> <chr> <dbl> <dbl> <dbl>
1 2017 2017-01-01 00:00:00 polk american vodka 64247. 1 -0.795
2 2017 2017-02-01 00:00:00 polk american vodka 73150. 2 -0.662
3 2017 2017-03-01 00:00:00 polk american vodka 77927. 3 -0.336
4 2017 2017-04-01 00:00:00 polk american vodka 81923. 4 -0.164
5 2017 2017-05-01 00:00:00 polk american vodka 85994. 5 1
6 2017 2017-06-01 00:00:00 polk american vodka 102798. 6 0.453
7 2017 2017-07-01 00:00:00 polk american vodka 84088. 7 0.212
8 2017 2017-08-01 00:00:00 polk american vodka 98003. 8 0.977
9 2017 2017-09-01 00:00:00 polk american vodka 81482. 9 -0.634
10 2017 2017-10-01 00:00:00 polk american vodka 89401. 10 0.0751
# … with 36 more rows
We saw how we were able to create the seasonality column for one county and product.Now the goal is to create seasonality column for more counties.
Step 4: Derive Seasonality for more counties
cnty<-c('polk','linn','scott',
'johnson','black hawk','woodbury',
'dubuque','story','pottawatta',
'dallas','cerro gord','dickinson',
'clinton','des moines','lee',
'webster','muscatine',
'marshall','warren','wapello')
# Running through data for various counties to get seasonality
l2<-list()
l2<-lapply(cnty,function(x){
county.df<-df%>%
select(year,year_month,county,category,volume)%>%
mutate(Month_Nm = month(year_month))%>% # extracting the month info
filter(county %in% x,
category=="american vodka")%>%
arrange(year,Month_Nm)%>%
filter(year != 2016) # removing 2016 data so that we start from Jan 2017
# Converting volume into a time series
ts_val<-ts(county.df$volume,frequency = 12)
# Decomposing into trend, seasonality and random components
plot(decompose(ts_val))
# Extracting the seasonality component from the decomposed part
ts_seasonal<-as.numeric(decompose(ts_val)$seasonal)
# Normalizing the component so that it becomes a ratio instead of absolute values
seas_factor<-ts_seasonal/max(ts_seasonal)
seas_df<-county.df%>%
mutate(Seasonality=seas_factor)
return(seas_df)
})