Sunday, March 5, 2023

Demand Forecasting using Xgboost Model

Demand Forecasting using Xgboost


Introduction

All industries either sell product or services or both.In the case of products, there is always a constant question around how much to produce. Successful anticipation of demand helps to remove uncertainty in business operations and enables to streamline operations. Its impact on total revenue and profit is also profound as the process of predicting demand helps to remove guess work from daily operations thereby reducing cost and increasing service levels

To get a hang of things, we will look at how to leverage the data for liquor sales and create a forecasting model. We will be using xgboost model to forecast as it is more efficient than some of the classical models such as ARIMA, SARIMA, ARIMAX etc.


For our analysis, we will be using the liquor_sales data set from ialiquor library.


Step 0: Installing libraries

package.name<-c("dplyr","lubridate","openxlsx","tidyr",
                "ialiquor","stringr",
                "mlr", # for hyper parameter tuning
                "xgboost")

for(i in package.name){

  if(!require(i,character.only = T)){

    install.packages(i)
  }
  library(i,character.only = T)

}


Step 1: Importing the dataset

Importing the liquor data set

data("liquor_sales")
df<-liquor_sales
head(df)
# A tibble: 6 × 10
   year year_month          county popul…¹ type  categ…² state…³ state…⁴ bottl…⁵
  <dbl> <dttm>              <chr>    <dbl> <chr> <chr>     <dbl>   <dbl>   <dbl>
1  2015 2015-01-01 00:00:00 adair     7145 vodka 100 pr…   254.     381.      54
2  2015 2015-01-01 00:00:00 adair     7145 other americ…    54       81        6
3  2015 2015-01-01 00:00:00 adair     7145 liqu… americ…    89.0    134.      18
4  2015 2015-01-01 00:00:00 adair     7145 cock… americ…   182.     277.      26
5  2015 2015-01-01 00:00:00 adair     7145 liqu… americ…   346.     519.      36
6  2015 2015-01-01 00:00:00 adair     7145 gin   americ…    99.6    149.      24
# … with 1 more variable: volume <dbl>, and abbreviated variable names
#   ¹​population, ²​category, ³​state_cost, ⁴​state_revenue, ⁵​bottles_sold

The attributes are as follows:

  • year:Year of sales
  • year_month: time of sale in year and month
  • county:Name of the county
  • population:Total population in county
  • type:Type of liquor
  • category:Category of liquor
  • state_cost:Total cost to the state for preparing the liquor
  • state_revenue:Total revenue obtained by selling the volume of liquor
  • bottles_sold:Number of bottles sold
  • volume:Total volume sold

For this blog, we will focus on predicting the total volume sold and try to create a model with rest of the variables available in the model

If you unable to import the data, then you can download the liquor_sales dataset form my github repo repository.


Step 2: Freq profiling

If you look at the data set, then there are only a few selected fields that we can leverage for forecasting. To predict volume for future months, we would need the future values of independent variables. Variables such as cost, revenue and population are non-deterministic and hence cant be taken into model building.Instead, we will focus on using time features such as year, month and categorical attributes such as county and type.

Lets look at the distribution of these variables using freq profiling

# Converting year month to categorical data for profiling purposes
df$year_month2<-str_c("_",df$year_month)

# Independent variables
nm<-c("year","year_month2","county","type","category")

# Dependent variables
dv<-"volume"

df.interim<-df%>%
  select(c(nm,dv))
Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
ℹ Please use `all_of()` or `any_of()` instead.
  # Was:
  data %>% select(nm)

  # Now:
  data %>% select(all_of(nm))

See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
ℹ Please use `all_of()` or `any_of()` instead.
  # Was:
  data %>% select(dv)

  # Now:
  data %>% select(all_of(dv))

See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
head(df.interim)
# A tibble: 6 × 6
   year year_month2 county type     category                    volume
  <dbl> <chr>       <chr>  <chr>    <chr>                        <dbl>
1  2015 _2015-01-01 adair  vodka    100 proof vodka               58.5
2  2015 _2015-01-01 adair  other    american alcohol               4.5
3  2015 _2015-01-01 adair  liqueur  american amaretto             19.5
4  2015 _2015-01-01 adair  cocktail american cocktails            45.5
5  2015 _2015-01-01 adair  liqueur  american cordials & liqueur   27  
6  2015 _2015-01-01 adair  gin      american dry gins             15  

Looking the total number of records for each level of a variable along with total volume across it

l1<-lapply(nm,function(x){
  
  z<-df%>%
    select(x,dv)
  
  colnames(z)[1]<-"Level"
  
  z1<-z%>%
    group_by(Level)%>%
    summarise(Total=n(),
              Total_Volume=sum(volume))%>%
    ungroup()%>%
    mutate(Feature=x)%>%
    select(Feature,Level,everything())
  
  return(z1)
})
Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
ℹ Please use `all_of()` or `any_of()` instead.
  # Was:
  data %>% select(x)

  # Now:
  data %>% select(all_of(x))

See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
df.final<-do.call(rbind.data.frame,l1)
head(df.final)
# A tibble: 6 × 4
  Feature Level Total Total_Volume
  <chr>   <chr> <int>        <dbl>
1 year    2015  59064    19609391.
2 year    2016  53129    19528006.
3 year    2017  43487    20670621.
4 year    2018  43830    21868138.
5 year    2019  43961    22262339.
6 year    2020  37265    19985800.
write.xlsx(df.final,"Freq_profile.xlsx")

Now if you pivot the data from the excel output and create the chart, it will somthing like the following charts


There is variation in total liquor consumption across years with the volume peaking around 2018 and 2019 periods. We should definitely take year into our model as it would be indicative of the trend in consumption


Monthly demand alternates through cycles of increase and decrease. More or less it settles around the 1700k mark as shown in the graph.This variable is also a good indicator of seasonality

Also if you look closely, the date values correspond to first of every month.Hence taking the week information from the year_month field will not make sense as the week that we get out of it will be like 1,5,9,13 and so on. Hence we can see that the weeks value will not be available for all the 52 weeks of a given year


Most the consumption is done in the top 10 counties.Hence for our model, we can just focus on the top 10 counties.


Whisky along with Vodka sells the most.In order ot develop the model for a given product, lets focus on Vodka initially.The modelling concepts that we develop for whiskey can be extended to vodka as well


Step 3: Derive Seasonality: An illustration

Seasonality is something that repeats regulary for each time period.In our case, it could be some pattern that repeats every year.Now we have already seen from the plot of year_month that the volume goes through a cycle of increase and decrease. What we intend to do here is to derive that pattern for every county across a single product(Vodka).For illustration, lets look at polk county and Vodka


polk county and Vodka

polk.df<-df%>%
  select(year,year_month,county,category,volume)%>%
  mutate(Month_Nm = month(year_month))%>% # extracting the month info
  filter(county %in% c("polk"),
         category=="american vodka")%>%
  arrange(year,Month_Nm)%>%
  filter(year != 2016) # removing 2016 data so that we start from Jan 2017

head(polk.df)
# A tibble: 6 × 6
   year year_month          county category        volume Month_Nm
  <dbl> <dttm>              <chr>  <chr>            <dbl>    <dbl>
1  2017 2017-01-01 00:00:00 polk   american vodka  64247.        1
2  2017 2017-02-01 00:00:00 polk   american vodka  73150.        2
3  2017 2017-03-01 00:00:00 polk   american vodka  77927.        3
4  2017 2017-04-01 00:00:00 polk   american vodka  81923.        4
5  2017 2017-05-01 00:00:00 polk   american vodka  85994.        5
6  2017 2017-06-01 00:00:00 polk   american vodka 102798.        6


We will be converting the volume into time series and then decompose the time series to get trend, seasonality and random components

# Converting volume into a time series
ts_val<-ts(polk.df$volume,frequency = 12)

# Decomposing into trend, seasonality and random components
plot(decompose(ts_val))


Observations from the Plot:

  • The seasonality component repeats for each time period
  • There is an upward trend in volume.We will use the year column as a proxy for trend
  • Observed and random components we dont want in our model
# Extracting the seasonality component from the decomposed part
ts_seasonal<-as.numeric(decompose(ts_val)$seasonal)

# Normalizing the component so that it becomes a ratio instead of absolute values
seas_factor<-ts_seasonal/max(ts_seasonal)
seas_factor
 [1] -0.7946703 -0.6616139 -0.3358038 -0.1644492  1.0000000  0.4525524
 [7]  0.2120918  0.9767894 -0.6340173  0.0751423  0.1862966 -0.3123180
[13] -0.7946703 -0.6616139 -0.3358038 -0.1644492  1.0000000  0.4525524
[19]  0.2120918  0.9767894 -0.6340173  0.0751423  0.1862966 -0.3123180
[25] -0.7946703 -0.6616139 -0.3358038 -0.1644492  1.0000000  0.4525524
[31]  0.2120918  0.9767894 -0.6340173  0.0751423  0.1862966 -0.3123180
[37] -0.7946703 -0.6616139 -0.3358038 -0.1644492  1.0000000  0.4525524
[43]  0.2120918  0.9767894 -0.6340173  0.0751423


Adding the column into the data frame for polk

polk.df%>%
  mutate(Seasonality=seas_factor)
# A tibble: 46 × 7
    year year_month          county category        volume Month_Nm Seasonality
   <dbl> <dttm>              <chr>  <chr>            <dbl>    <dbl>       <dbl>
 1  2017 2017-01-01 00:00:00 polk   american vodka  64247.        1     -0.795 
 2  2017 2017-02-01 00:00:00 polk   american vodka  73150.        2     -0.662 
 3  2017 2017-03-01 00:00:00 polk   american vodka  77927.        3     -0.336 
 4  2017 2017-04-01 00:00:00 polk   american vodka  81923.        4     -0.164 
 5  2017 2017-05-01 00:00:00 polk   american vodka  85994.        5      1     
 6  2017 2017-06-01 00:00:00 polk   american vodka 102798.        6      0.453 
 7  2017 2017-07-01 00:00:00 polk   american vodka  84088.        7      0.212 
 8  2017 2017-08-01 00:00:00 polk   american vodka  98003.        8      0.977 
 9  2017 2017-09-01 00:00:00 polk   american vodka  81482.        9     -0.634 
10  2017 2017-10-01 00:00:00 polk   american vodka  89401.       10      0.0751
# … with 36 more rows

We saw how we were able to create the seasonality column for one county and product.Now the goal is to create seasonality column for more counties.


Step 4: Derive Seasonality for more counties

cnty<-c('polk','linn','scott',
       'johnson','black hawk','woodbury',
       'dubuque','story','pottawatta',
       'dallas','cerro gord','dickinson',
       'clinton','des moines','lee',
       'webster','muscatine',
       'marshall','warren','wapello')

# Running through data for various counties to get seasonality

l2<-list()
l2<-lapply(cnty,function(x){
  
  county.df<-df%>%
  select(year,year_month,county,category,volume)%>%
  mutate(Month_Nm = month(year_month))%>% # extracting the month info
  filter(county %in% x,
         category=="american vodka")%>%
  arrange(year,Month_Nm)%>%
  filter(year != 2016) # removing 2016 data so that we start from Jan 2017
  
  
  # Converting volume into a time series
  ts_val<-ts(county.df$volume,frequency = 12)
  
  # Decomposing into trend, seasonality and random components
  plot(decompose(ts_val))
  
  # Extracting the seasonality component from the decomposed part
  ts_seasonal<-as.numeric(decompose(ts_val)$seasonal)
  
  # Normalizing the component so that it becomes a ratio instead of absolute values
  seas_factor<-ts_seasonal/max(ts_seasonal)
  
  
  seas_df<-county.df%>%
  mutate(Seasonality=seas_factor)
  
  
  return(seas_df)
  
})