Sunday, March 5, 2023

Demand Forecasting using Xgboost Model

Demand Forecasting using Xgboost


Introduction

All industries either sell product or services or both.In the case of products, there is always a constant question around how much to produce. Successful anticipation of demand helps to remove uncertainty in business operations and enables to streamline operations. Its impact on total revenue and profit is also profound as the process of predicting demand helps to remove guess work from daily operations thereby reducing cost and increasing service levels

To get a hang of things, we will look at how to leverage the data for liquor sales and create a forecasting model. We will be using xgboost model to forecast as it is more efficient than some of the classical models such as ARIMA, SARIMA, ARIMAX etc.


For our analysis, we will be using the liquor_sales data set from ialiquor library.


Step 0: Installing libraries

package.name<-c("dplyr","lubridate","openxlsx","tidyr",
                "ialiquor","stringr",
                "mlr", # for hyper parameter tuning
                "xgboost")

for(i in package.name){

  if(!require(i,character.only = T)){

    install.packages(i)
  }
  library(i,character.only = T)

}


Step 1: Importing the dataset

Importing the liquor data set

data("liquor_sales")
df<-liquor_sales
head(df)
# A tibble: 6 × 10
   year year_month          county popul…¹ type  categ…² state…³ state…⁴ bottl…⁵
  <dbl> <dttm>              <chr>    <dbl> <chr> <chr>     <dbl>   <dbl>   <dbl>
1  2015 2015-01-01 00:00:00 adair     7145 vodka 100 pr…   254.     381.      54
2  2015 2015-01-01 00:00:00 adair     7145 other americ…    54       81        6
3  2015 2015-01-01 00:00:00 adair     7145 liqu… americ…    89.0    134.      18
4  2015 2015-01-01 00:00:00 adair     7145 cock… americ…   182.     277.      26
5  2015 2015-01-01 00:00:00 adair     7145 liqu… americ…   346.     519.      36
6  2015 2015-01-01 00:00:00 adair     7145 gin   americ…    99.6    149.      24
# … with 1 more variable: volume <dbl>, and abbreviated variable names
#   ¹​population, ²​category, ³​state_cost, ⁴​state_revenue, ⁵​bottles_sold

The attributes are as follows:

  • year:Year of sales
  • year_month: time of sale in year and month
  • county:Name of the county
  • population:Total population in county
  • type:Type of liquor
  • category:Category of liquor
  • state_cost:Total cost to the state for preparing the liquor
  • state_revenue:Total revenue obtained by selling the volume of liquor
  • bottles_sold:Number of bottles sold
  • volume:Total volume sold

For this blog, we will focus on predicting the total volume sold and try to create a model with rest of the variables available in the model

If you unable to import the data, then you can download the liquor_sales dataset form my github repo repository.


Step 2: Freq profiling

If you look at the data set, then there are only a few selected fields that we can leverage for forecasting. To predict volume for future months, we would need the future values of independent variables. Variables such as cost, revenue and population are non-deterministic and hence cant be taken into model building.Instead, we will focus on using time features such as year, month and categorical attributes such as county and type.

Lets look at the distribution of these variables using freq profiling

# Converting year month to categorical data for profiling purposes
df$year_month2<-str_c("_",df$year_month)

# Independent variables
nm<-c("year","year_month2","county","type","category")

# Dependent variables
dv<-"volume"

df.interim<-df%>%
  select(c(nm,dv))
Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
ℹ Please use `all_of()` or `any_of()` instead.
  # Was:
  data %>% select(nm)

  # Now:
  data %>% select(all_of(nm))

See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
ℹ Please use `all_of()` or `any_of()` instead.
  # Was:
  data %>% select(dv)

  # Now:
  data %>% select(all_of(dv))

See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
head(df.interim)
# A tibble: 6 × 6
   year year_month2 county type     category                    volume
  <dbl> <chr>       <chr>  <chr>    <chr>                        <dbl>
1  2015 _2015-01-01 adair  vodka    100 proof vodka               58.5
2  2015 _2015-01-01 adair  other    american alcohol               4.5
3  2015 _2015-01-01 adair  liqueur  american amaretto             19.5
4  2015 _2015-01-01 adair  cocktail american cocktails            45.5
5  2015 _2015-01-01 adair  liqueur  american cordials & liqueur   27  
6  2015 _2015-01-01 adair  gin      american dry gins             15  

Looking the total number of records for each level of a variable along with total volume across it

l1<-lapply(nm,function(x){
  
  z<-df%>%
    select(x,dv)
  
  colnames(z)[1]<-"Level"
  
  z1<-z%>%
    group_by(Level)%>%
    summarise(Total=n(),
              Total_Volume=sum(volume))%>%
    ungroup()%>%
    mutate(Feature=x)%>%
    select(Feature,Level,everything())
  
  return(z1)
})
Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
ℹ Please use `all_of()` or `any_of()` instead.
  # Was:
  data %>% select(x)

  # Now:
  data %>% select(all_of(x))

See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
df.final<-do.call(rbind.data.frame,l1)
head(df.final)
# A tibble: 6 × 4
  Feature Level Total Total_Volume
  <chr>   <chr> <int>        <dbl>
1 year    2015  59064    19609391.
2 year    2016  53129    19528006.
3 year    2017  43487    20670621.
4 year    2018  43830    21868138.
5 year    2019  43961    22262339.
6 year    2020  37265    19985800.
write.xlsx(df.final,"Freq_profile.xlsx")

Now if you pivot the data from the excel output and create the chart, it will somthing like the following charts


There is variation in total liquor consumption across years with the volume peaking around 2018 and 2019 periods. We should definitely take year into our model as it would be indicative of the trend in consumption


Monthly demand alternates through cycles of increase and decrease. More or less it settles around the 1700k mark as shown in the graph.This variable is also a good indicator of seasonality

Also if you look closely, the date values correspond to first of every month.Hence taking the week information from the year_month field will not make sense as the week that we get out of it will be like 1,5,9,13 and so on. Hence we can see that the weeks value will not be available for all the 52 weeks of a given year


Most the consumption is done in the top 10 counties.Hence for our model, we can just focus on the top 10 counties.


Whisky along with Vodka sells the most.In order ot develop the model for a given product, lets focus on Vodka initially.The modelling concepts that we develop for whiskey can be extended to vodka as well


Step 3: Derive Seasonality: An illustration

Seasonality is something that repeats regulary for each time period.In our case, it could be some pattern that repeats every year.Now we have already seen from the plot of year_month that the volume goes through a cycle of increase and decrease. What we intend to do here is to derive that pattern for every county across a single product(Vodka).For illustration, lets look at polk county and Vodka


polk county and Vodka

polk.df<-df%>%
  select(year,year_month,county,category,volume)%>%
  mutate(Month_Nm = month(year_month))%>% # extracting the month info
  filter(county %in% c("polk"),
         category=="american vodka")%>%
  arrange(year,Month_Nm)%>%
  filter(year != 2016) # removing 2016 data so that we start from Jan 2017

head(polk.df)
# A tibble: 6 × 6
   year year_month          county category        volume Month_Nm
  <dbl> <dttm>              <chr>  <chr>            <dbl>    <dbl>
1  2017 2017-01-01 00:00:00 polk   american vodka  64247.        1
2  2017 2017-02-01 00:00:00 polk   american vodka  73150.        2
3  2017 2017-03-01 00:00:00 polk   american vodka  77927.        3
4  2017 2017-04-01 00:00:00 polk   american vodka  81923.        4
5  2017 2017-05-01 00:00:00 polk   american vodka  85994.        5
6  2017 2017-06-01 00:00:00 polk   american vodka 102798.        6


We will be converting the volume into time series and then decompose the time series to get trend, seasonality and random components

# Converting volume into a time series
ts_val<-ts(polk.df$volume,frequency = 12)

# Decomposing into trend, seasonality and random components
plot(decompose(ts_val))


Observations from the Plot:

  • The seasonality component repeats for each time period
  • There is an upward trend in volume.We will use the year column as a proxy for trend
  • Observed and random components we dont want in our model
# Extracting the seasonality component from the decomposed part
ts_seasonal<-as.numeric(decompose(ts_val)$seasonal)

# Normalizing the component so that it becomes a ratio instead of absolute values
seas_factor<-ts_seasonal/max(ts_seasonal)
seas_factor
 [1] -0.7946703 -0.6616139 -0.3358038 -0.1644492  1.0000000  0.4525524
 [7]  0.2120918  0.9767894 -0.6340173  0.0751423  0.1862966 -0.3123180
[13] -0.7946703 -0.6616139 -0.3358038 -0.1644492  1.0000000  0.4525524
[19]  0.2120918  0.9767894 -0.6340173  0.0751423  0.1862966 -0.3123180
[25] -0.7946703 -0.6616139 -0.3358038 -0.1644492  1.0000000  0.4525524
[31]  0.2120918  0.9767894 -0.6340173  0.0751423  0.1862966 -0.3123180
[37] -0.7946703 -0.6616139 -0.3358038 -0.1644492  1.0000000  0.4525524
[43]  0.2120918  0.9767894 -0.6340173  0.0751423


Adding the column into the data frame for polk

polk.df%>%
  mutate(Seasonality=seas_factor)
# A tibble: 46 × 7
    year year_month          county category        volume Month_Nm Seasonality
   <dbl> <dttm>              <chr>  <chr>            <dbl>    <dbl>       <dbl>
 1  2017 2017-01-01 00:00:00 polk   american vodka  64247.        1     -0.795 
 2  2017 2017-02-01 00:00:00 polk   american vodka  73150.        2     -0.662 
 3  2017 2017-03-01 00:00:00 polk   american vodka  77927.        3     -0.336 
 4  2017 2017-04-01 00:00:00 polk   american vodka  81923.        4     -0.164 
 5  2017 2017-05-01 00:00:00 polk   american vodka  85994.        5      1     
 6  2017 2017-06-01 00:00:00 polk   american vodka 102798.        6      0.453 
 7  2017 2017-07-01 00:00:00 polk   american vodka  84088.        7      0.212 
 8  2017 2017-08-01 00:00:00 polk   american vodka  98003.        8      0.977 
 9  2017 2017-09-01 00:00:00 polk   american vodka  81482.        9     -0.634 
10  2017 2017-10-01 00:00:00 polk   american vodka  89401.       10      0.0751
# … with 36 more rows

We saw how we were able to create the seasonality column for one county and product.Now the goal is to create seasonality column for more counties.


Step 4: Derive Seasonality for more counties

cnty<-c('polk','linn','scott',
       'johnson','black hawk','woodbury',
       'dubuque','story','pottawatta',
       'dallas','cerro gord','dickinson',
       'clinton','des moines','lee',
       'webster','muscatine',
       'marshall','warren','wapello')

# Running through data for various counties to get seasonality

l2<-list()
l2<-lapply(cnty,function(x){
  
  county.df<-df%>%
  select(year,year_month,county,category,volume)%>%
  mutate(Month_Nm = month(year_month))%>% # extracting the month info
  filter(county %in% x,
         category=="american vodka")%>%
  arrange(year,Month_Nm)%>%
  filter(year != 2016) # removing 2016 data so that we start from Jan 2017
  
  
  # Converting volume into a time series
  ts_val<-ts(county.df$volume,frequency = 12)
  
  # Decomposing into trend, seasonality and random components
  plot(decompose(ts_val))
  
  # Extracting the seasonality component from the decomposed part
  ts_seasonal<-as.numeric(decompose(ts_val)$seasonal)
  
  # Normalizing the component so that it becomes a ratio instead of absolute values
  seas_factor<-ts_seasonal/max(ts_seasonal)
  
  
  seas_df<-county.df%>%
  mutate(Seasonality=seas_factor)
  
  
  return(seas_df)
  
})

seas.final.df<-do.call(rbind.data.frame,l2)
head(seas.final.df)
# A tibble: 6 × 7
   year year_month          county category        volume Month_Nm Seasonality
  <dbl> <dttm>              <chr>  <chr>            <dbl>    <dbl>       <dbl>
1  2017 2017-01-01 00:00:00 polk   american vodka  64247.        1      -0.795
2  2017 2017-02-01 00:00:00 polk   american vodka  73150.        2      -0.662
3  2017 2017-03-01 00:00:00 polk   american vodka  77927.        3      -0.336
4  2017 2017-04-01 00:00:00 polk   american vodka  81923.        4      -0.164
5  2017 2017-05-01 00:00:00 polk   american vodka  85994.        5       1    
6  2017 2017-06-01 00:00:00 polk   american vodka 102798.        6       0.453

Now if youlook closely in the seas.final.df data frame, there are text values in it. However, for modelling using Xgboost, all the data should be in numeric form since the model cant understand text.Hence we will be converting text data to column using one hot encoding.


Step 5: Converting text values into Numeric data

interim.df<- seas.final.df%>%
  select(-year_month)%>%
  gather(key="Attributes",value = "Values",county:category)%>%
  mutate(Count=1,
         Index=1:n())%>%
  spread(Values,Count,fill=0)%>%
  select(-Attributes)%>%
  group_by(year,Month_Nm,volume,Seasonality)%>%
  summarise(across(everything(),list(sum)))%>%
  select(-'american vodka_1',-Index_1) # as the column just has a single value and wont add much info to the model
`summarise()` has grouped output by 'year', 'Month_Nm', 'volume'. You can
override using the `.groups` argument.
head(interim.df)
# A tibble: 6 × 24
# Groups:   year, Month_Nm, volume [6]
   year Month_Nm volume Season…¹ black…² cerro…³ clint…⁴ dalla…⁵ des m…⁶ dicki…⁷
  <dbl>    <dbl>  <dbl>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
1  2017        1  1817.   -0.215       0       0       0       0       0       0
2  2017        1  2656.   -0.725       0       0       0       0       0       0
3  2017        1  3007.   -0.124       0       0       0       0       0       0
4  2017        1  3217.   -0.288       0       0       0       0       0       0
5  2017        1  3275.   -0.863       0       0       0       0       0       0
6  2017        1  3861.   -0.811       0       0       0       0       1       0
# … with 14 more variables: dubuque_1 <dbl>, johnson_1 <dbl>, lee_1 <dbl>,
#   linn_1 <dbl>, marshall_1 <dbl>, muscatine_1 <dbl>, polk_1 <dbl>,
#   pottawatta_1 <dbl>, scott_1 <dbl>, story_1 <dbl>, wapello_1 <dbl>,
#   warren_1 <dbl>, webster_1 <dbl>, woodbury_1 <dbl>, and abbreviated variable
#   names ¹​Seasonality, ²​`black hawk_1`, ³​`cerro gord_1`, ⁴​clinton_1,
#   ⁵​dallas_1, ⁶​`des moines_1`, ⁷​dickinson_1

We see that we now have individual columns for each of the counties and the data is the table is all numeric. Now it can be used for modelling purpose


Step 6: Introduction to Xgboost Model

https://www.geeksforgeeks.org/xgboost/ https://analyticsindiamag.com/xgboost-internal-working-to-make-decision-trees-and-deduce-predictions/

It consists of building successive decision trees in series where subsequent tree tries to reduce the error from the previous tree.The final output is taken from the output of last decison tree.

It is different from a random forest model where the output from each tree is averaged out at the end.

Before we go into an illustration, we should take note of the fact that there are few parameters that are used in the model.These are

\(\lambda\) regularization parameter eta how much model will converge(lets say it is 0.3) Æ” for auto tree pruning

Lets look at a small example to understand

Lets say we are trying to predict the IQ using xgboost. As the first decision tree, lets use mean IQ as the estimate. The mean IQ from the above table is around 30. Lets call this model as M0

Hence the error which is difference between Actual IQ and Mean IQ can be represented as below

Lets calculate the Similarity Score for M0 which is equal to Square of residual/(N+\(\lambda\) ),where N is the total number of residuals.At first,lets put Æ” as 0 =(-10 + 4 + 8)/(3+0) =1.33


Now lets create subsequent decision tree and call it M1

Lets use Age as the splitting variable and split at Age > 10.As a consequence, the decision tree will become something as shown below

Lets calculate the Similarity Score for each of these branches

Branch to the left: =(-10)^2/1 =100

Branch to the right: =(4+8)^2/2 =144/2 =72

Now the Gain as a result of the splitting

Gain = (100 + 72) - 1.33 Gain = 170.67

Now lets say that the threshold to create branching is 130(which is specified by the Æ” pruning parameter is 130). Branching will only take place if the gain is greater than 130. Since in our case, it is greater than 130, therefore branching will take place.Higher values of pruning parameters will prevent branching from taking place and will prune the tree

Now the interesting part about xgboost is that each successive tree takes the value of residuals as the dependent variable.Hence, new prediction are:

For the left branch =Old prediction + learning rate * average residual = 30 + 0.3*-10 =30 - 3 =27

For the right branch first record =Old prediction + learning rate * average residual =30 + 0.3*6 =30 + 1.8 =31.8

For the right branch second record =Old prediction + learning rate * average residual =30 + 0.3*6 =30 + 1.8 =31.8

Now the new table will become

You can see that the residuals have decreased in comparison to the Model M0

knitr::include_graphics("ex1_5.PNG")


We will now look at how to tune the parameters of an xgboost model.First we will divide the dataset into train and test


Step 7: Train and Test Dataset

# Cleaning column names

nm<-gsub("_1","",colnames(interim.df))
colnames(interim.df)<-nm

nm2<-gsub(" ","",colnames(interim.df))
colnames(interim.df)<-nm2

set.seed(1)
smpl<-sample(1:nrow(interim.df),0.70*nrow(interim.df),replace = F)

# Create the trianing dataset
train.df<-interim.df[smpl,]%>%
  as.data.frame()%>%
  as.matrix()
train.df2<-train.df[,colnames(train.df) != "volume"]

class(train.df)
[1] "matrix" "array" 
# Create the test dataset
test.df<-as.matrix(interim.df[-smpl,])
test.df2<-test.df[,colnames(test.df) != "volume"]

y_train <- train.df%>%
  as.data.frame()%>%
  select(volume)%>%
  pull(volume)%>%
  as.numeric()

y_test <- test.df%>%
  as.data.frame()%>%
  select(volume)%>%
  pull(volume)%>%
  as.numeric()

dim(train.df2) # 644 records
[1] 644  23

Step 8: Hyper parameter tuning

Before we start and create a model, we have to first tune the parameters to get best results out of it.You can think of how we tune the radio to get to a particular channel

There is a library in R by the name mlr which is used to tune the parameters of all ml models.There are couple of standard steps for tuning the parameters of the model.They are:

  • Initiate a learner
  • Set a grid
  • Controlling the number of iterations
  • Setting up cross validation
  • Create a task
  • Tune Parameters

Step 8A: Initiate a regression learner

xgb_learner <- makeLearner(
  "regr.xgboost",
  predict.type = "response",
  par.vals = list(
    objective = "reg:squarederror",
    eval_metric = "rmse",
    nrounds = 200
  )
)

Since the problem is a regression one, we have used regr.xgboost and predict.type as equal to response

Step 8B: Set grid parameters

xgb_params <- makeParamSet(
  # The number of trees in the model (each one built sequentially)
  
  makeIntegerParam("early_stopping_rounds", lower = 5, upper = 15),
  makeIntegerParam("nrounds", lower = 100, upper = 500),
  # number of splits in each tree
  makeIntegerParam("max_depth", lower = 5, upper = 10),
  # "shrinkage" - prevents overfitting
  makeNumericParam("eta", lower = .001, upper = .25)
  
)

Step 8C: Controlling the number of iterations

control <- makeTuneControlRandom(maxit = 100)

Step 8D: Setting up Cross Validation (CV)

# Create a description of the resampling plan
resample_desc <- makeResampleDesc("CV", iters = 5)

Step 8E: Create a task

forecast_data_task<-makeRegrTask(data=train.df%>%
                                             as.data.frame(),target="volume")

Step 8F: Tune Parameters

tuned_params <- tuneParams(
  learner = xgb_learner,
  task = forecast_data_task,
  resampling = resample_desc,
  par.set = xgb_params,
  control = control
)
[Tune] Started tuning learner regr.xgboost for parameter set:
                         Type len Def        Constr Req Tunable Trafo
early_stopping_rounds integer   -   -       5 to 15   -    TRUE     -
nrounds               integer   -   -    100 to 500   -    TRUE     -
max_depth             integer   -   -       5 to 10   -    TRUE     -
eta                   numeric   -   - 0.001 to 0.25   -    TRUE     -
With control class: TuneControlRandom
Imputation value: Inf
[Tune-x] 1: early_stopping_rounds=7; nrounds=276; max_depth=9; eta=0.0874
[Tune-y] 1: mse.test.mean=9759633.5783240; time: 0.2 min
[Tune-x] 2: early_stopping_rounds=6; nrounds=340; max_depth=10; eta=0.0978
[Tune-y] 2: mse.test.mean=9919169.3346103; time: 0.3 min
[Tune-x] 3: early_stopping_rounds=6; nrounds=423; max_depth=5; eta=0.191
[Tune-y] 3: mse.test.mean=9131655.5997733; time: 0.1 min
[Tune-x] 4: early_stopping_rounds=6; nrounds=442; max_depth=8; eta=0.149
[Tune-y] 4: mse.test.mean=9381837.0573715; time: 0.2 min
[Tune-x] 5: early_stopping_rounds=14; nrounds=447; max_depth=5; eta=0.118
[Tune-y] 5: mse.test.mean=9061322.5056555; time: 0.1 min
[Tune-x] 6: early_stopping_rounds=8; nrounds=430; max_depth=8; eta=0.11
[Tune-y] 6: mse.test.mean=9722901.3475809; time: 0.2 min
[Tune-x] 7: early_stopping_rounds=5; nrounds=199; max_depth=7; eta=0.0287
[Tune-y] 7: mse.test.mean=9405569.1843093; time: 0.1 min
[Tune-x] 8: early_stopping_rounds=9; nrounds=195; max_depth=10; eta=0.0575
[Tune-y] 8: mse.test.mean=10109152.9012836; time: 0.1 min
[Tune-x] 9: early_stopping_rounds=13; nrounds=387; max_depth=5; eta=0.12
[Tune-y] 9: mse.test.mean=9051325.2836154; time: 0.1 min
[Tune-x] 10: early_stopping_rounds=6; nrounds=205; max_depth=6; eta=0.0469
[Tune-y] 10: mse.test.mean=8706463.4047389; time: 0.1 min
[Tune-x] 11: early_stopping_rounds=5; nrounds=474; max_depth=5; eta=0.102
[Tune-y] 11: mse.test.mean=8919214.0957784; time: 0.1 min
[Tune-x] 12: early_stopping_rounds=13; nrounds=142; max_depth=7; eta=0.0841
[Tune-y] 12: mse.test.mean=8938006.7405959; time: 0.1 min
[Tune-x] 13: early_stopping_rounds=13; nrounds=315; max_depth=9; eta=0.2
[Tune-y] 13: mse.test.mean=9644585.0316658; time: 0.2 min
[Tune-x] 14: early_stopping_rounds=7; nrounds=286; max_depth=6; eta=0.137
[Tune-y] 14: mse.test.mean=9299234.3889818; time: 0.1 min
[Tune-x] 15: early_stopping_rounds=15; nrounds=159; max_depth=7; eta=0.197
[Tune-y] 15: mse.test.mean=9351119.0973083; time: 0.1 min
[Tune-x] 16: early_stopping_rounds=11; nrounds=316; max_depth=10; eta=0.193
[Tune-y] 16: mse.test.mean=9675316.8635050; time: 0.3 min
[Tune-x] 17: early_stopping_rounds=9; nrounds=476; max_depth=8; eta=0.143
[Tune-y] 17: mse.test.mean=9588756.3034094; time: 0.3 min
[Tune-x] 18: early_stopping_rounds=6; nrounds=255; max_depth=10; eta=0.0561
[Tune-y] 18: mse.test.mean=10065097.1353350; time: 0.2 min
[Tune-x] 19: early_stopping_rounds=7; nrounds=213; max_depth=5; eta=0.0459
[Tune-y] 19: mse.test.mean=8586267.9754261; time: 0.1 min
[Tune-x] 20: early_stopping_rounds=8; nrounds=358; max_depth=10; eta=0.0187
[Tune-y] 20: mse.test.mean=9830458.6477461; time: 0.2 min
[Tune-x] 21: early_stopping_rounds=5; nrounds=116; max_depth=7; eta=0.0638
[Tune-y] 21: mse.test.mean=9000524.7231902; time: 0.0 min
[Tune-x] 22: early_stopping_rounds=6; nrounds=445; max_depth=5; eta=0.138
[Tune-y] 22: mse.test.mean=9211781.8761617; time: 0.1 min
[Tune-x] 23: early_stopping_rounds=10; nrounds=224; max_depth=9; eta=0.187
[Tune-y] 23: mse.test.mean=9340147.3757327; time: 0.2 min
[Tune-x] 24: early_stopping_rounds=15; nrounds=245; max_depth=7; eta=0.114
[Tune-y] 24: mse.test.mean=9511333.8930402; time: 0.1 min
[Tune-x] 25: early_stopping_rounds=12; nrounds=312; max_depth=5; eta=0.216
[Tune-y] 25: mse.test.mean=8966000.4567943; time: 0.1 min
[Tune-x] 26: early_stopping_rounds=11; nrounds=461; max_depth=10; eta=0.225
[Tune-y] 26: mse.test.mean=9849430.1026096; time: 0.3 min
[Tune-x] 27: early_stopping_rounds=12; nrounds=435; max_depth=7; eta=0.158
[Tune-y] 27: mse.test.mean=9449230.0497988; time: 0.2 min
[Tune-x] 28: early_stopping_rounds=9; nrounds=488; max_depth=6; eta=0.245
[Tune-y] 28: mse.test.mean=9543746.5124885; time: 0.2 min
[Tune-x] 29: early_stopping_rounds=12; nrounds=464; max_depth=9; eta=0.104
[Tune-y] 29: mse.test.mean=10074564.2291761; time: 0.3 min
[Tune-x] 30: early_stopping_rounds=15; nrounds=159; max_depth=5; eta=0.145
[Tune-y] 30: mse.test.mean=8409795.4878290; time: 0.0 min
[Tune-x] 31: early_stopping_rounds=10; nrounds=216; max_depth=6; eta=0.152
[Tune-y] 31: mse.test.mean=9338310.5182170; time: 0.1 min
[Tune-x] 32: early_stopping_rounds=9; nrounds=216; max_depth=7; eta=0.206
[Tune-y] 32: mse.test.mean=9111410.5612743; time: 0.1 min
[Tune-x] 33: early_stopping_rounds=7; nrounds=495; max_depth=8; eta=0.208
[Tune-y] 33: mse.test.mean=9486996.2727286; time: 0.3 min
[Tune-x] 34: early_stopping_rounds=7; nrounds=409; max_depth=5; eta=0.129
[Tune-y] 34: mse.test.mean=9038940.1075297; time: 0.1 min
[Tune-x] 35: early_stopping_rounds=14; nrounds=495; max_depth=7; eta=0.118
[Tune-y] 35: mse.test.mean=9376826.7168538; time: 0.2 min
[Tune-x] 36: early_stopping_rounds=9; nrounds=467; max_depth=7; eta=0.236
[Tune-y] 36: mse.test.mean=9388504.9499630; time: 0.2 min
[Tune-x] 37: early_stopping_rounds=6; nrounds=326; max_depth=9; eta=0.00114
[Tune-y] 37: mse.test.mean=315762911.3693092; time: 0.1 min
[Tune-x] 38: early_stopping_rounds=11; nrounds=248; max_depth=5; eta=0.169
[Tune-y] 38: mse.test.mean=9010457.1250567; time: 0.1 min
[Tune-x] 39: early_stopping_rounds=11; nrounds=138; max_depth=7; eta=0.0255
[Tune-y] 39: mse.test.mean=11772205.7411346; time: 0.0 min
[Tune-x] 40: early_stopping_rounds=13; nrounds=427; max_depth=7; eta=0.00164
[Tune-y] 40: mse.test.mean=177810567.9447607; time: 0.1 min
[Tune-x] 41: early_stopping_rounds=11; nrounds=337; max_depth=10; eta=0.0991
[Tune-y] 41: mse.test.mean=10377173.8212600; time: 0.3 min
[Tune-x] 42: early_stopping_rounds=15; nrounds=202; max_depth=5; eta=0.132
[Tune-y] 42: mse.test.mean=8712549.6542992; time: 0.1 min
[Tune-x] 43: early_stopping_rounds=9; nrounds=309; max_depth=5; eta=0.15
[Tune-y] 43: mse.test.mean=9374192.4683403; time: 0.1 min
[Tune-x] 44: early_stopping_rounds=12; nrounds=320; max_depth=8; eta=0.0748
[Tune-y] 44: mse.test.mean=9541312.5051237; time: 0.2 min
[Tune-x] 45: early_stopping_rounds=8; nrounds=341; max_depth=6; eta=0.229
[Tune-y] 45: mse.test.mean=9379919.8475423; time: 0.1 min
[Tune-x] 46: early_stopping_rounds=15; nrounds=399; max_depth=7; eta=0.233
[Tune-y] 46: mse.test.mean=9131112.9625156; time: 0.2 min
[Tune-x] 47: early_stopping_rounds=11; nrounds=257; max_depth=6; eta=0.18
[Tune-y] 47: mse.test.mean=9323293.8399238; time: 0.1 min
[Tune-x] 48: early_stopping_rounds=15; nrounds=377; max_depth=10; eta=0.0579
[Tune-y] 48: mse.test.mean=10170544.1343835; time: 0.3 min
[Tune-x] 49: early_stopping_rounds=11; nrounds=183; max_depth=8; eta=0.242
[Tune-y] 49: mse.test.mean=9636149.5644085; time: 0.1 min
[Tune-x] 50: early_stopping_rounds=13; nrounds=181; max_depth=7; eta=0.207
[Tune-y] 50: mse.test.mean=9518563.4519001; time: 0.1 min
[Tune-x] 51: early_stopping_rounds=9; nrounds=258; max_depth=5; eta=0.0348
[Tune-y] 51: mse.test.mean=8506692.6637379; time: 0.1 min
[Tune-x] 52: early_stopping_rounds=12; nrounds=125; max_depth=10; eta=0.0293
[Tune-y] 52: mse.test.mean=10249077.1249290; time: 0.1 min
[Tune-x] 53: early_stopping_rounds=11; nrounds=110; max_depth=5; eta=0.0479
[Tune-y] 53: mse.test.mean=10886102.2646692; time: 0.0 min
[Tune-x] 54: early_stopping_rounds=13; nrounds=434; max_depth=9; eta=0.0709
[Tune-y] 54: mse.test.mean=9872622.3394087; time: 0.3 min
[Tune-x] 55: early_stopping_rounds=12; nrounds=484; max_depth=8; eta=0.108
[Tune-y] 55: mse.test.mean=10033723.8212178; time: 0.3 min
[Tune-x] 56: early_stopping_rounds=12; nrounds=169; max_depth=6; eta=0.0976
[Tune-y] 56: mse.test.mean=8765298.9059857; time: 0.1 min
[Tune-x] 57: early_stopping_rounds=5; nrounds=211; max_depth=9; eta=0.0828
[Tune-y] 57: mse.test.mean=9675908.9168039; time: 0.1 min
[Tune-x] 58: early_stopping_rounds=11; nrounds=310; max_depth=10; eta=0.134
[Tune-y] 58: mse.test.mean=10057024.4673807; time: 0.2 min
[Tune-x] 59: early_stopping_rounds=6; nrounds=411; max_depth=5; eta=0.177
[Tune-y] 59: mse.test.mean=9137584.3335131; time: 0.1 min
[Tune-x] 60: early_stopping_rounds=5; nrounds=413; max_depth=9; eta=0.139
[Tune-y] 60: mse.test.mean=9888009.8595234; time: 0.3 min
[Tune-x] 61: early_stopping_rounds=14; nrounds=184; max_depth=6; eta=0.169
[Tune-y] 61: mse.test.mean=9019567.7280693; time: 0.1 min
[Tune-x] 62: early_stopping_rounds=9; nrounds=364; max_depth=5; eta=0.0865
[Tune-y] 62: mse.test.mean=8657003.9123326; time: 0.1 min
[Tune-x] 63: early_stopping_rounds=13; nrounds=239; max_depth=5; eta=0.216
[Tune-y] 63: mse.test.mean=8986833.9177502; time: 0.1 min
[Tune-x] 64: early_stopping_rounds=11; nrounds=464; max_depth=9; eta=0.137
[Tune-y] 64: mse.test.mean=9720535.4019169; time: 0.3 min
[Tune-x] 65: early_stopping_rounds=10; nrounds=239; max_depth=8; eta=0.238
[Tune-y] 65: mse.test.mean=9675939.5121680; time: 0.1 min
[Tune-x] 66: early_stopping_rounds=7; nrounds=142; max_depth=6; eta=0.046
[Tune-y] 66: mse.test.mean=9312890.9201391; time: 0.0 min
[Tune-x] 67: early_stopping_rounds=10; nrounds=421; max_depth=7; eta=0.0452
[Tune-y] 67: mse.test.mean=9292563.8669782; time: 0.2 min
[Tune-x] 68: early_stopping_rounds=9; nrounds=426; max_depth=10; eta=0.0442
[Tune-y] 68: mse.test.mean=10332404.8956508; time: 0.3 min
[Tune-x] 69: early_stopping_rounds=8; nrounds=382; max_depth=10; eta=0.108
[Tune-y] 69: mse.test.mean=10496271.1894498; time: 0.3 min
[Tune-x] 70: early_stopping_rounds=6; nrounds=426; max_depth=9; eta=0.227
[Tune-y] 70: mse.test.mean=9243843.7546483; time: 0.3 min
[Tune-x] 71: early_stopping_rounds=12; nrounds=283; max_depth=5; eta=0.183
[Tune-y] 71: mse.test.mean=9225009.7384530; time: 0.1 min
[Tune-x] 72: early_stopping_rounds=15; nrounds=114; max_depth=5; eta=0.0997
[Tune-y] 72: mse.test.mean=8329617.3171094; time: 0.0 min
[Tune-x] 73: early_stopping_rounds=6; nrounds=145; max_depth=6; eta=0.146
[Tune-y] 73: mse.test.mean=8705292.4750901; time: 0.0 min
[Tune-x] 74: early_stopping_rounds=6; nrounds=298; max_depth=7; eta=0.102
[Tune-y] 74: mse.test.mean=9198731.7818394; time: 0.1 min
[Tune-x] 75: early_stopping_rounds=8; nrounds=270; max_depth=7; eta=0.207
[Tune-y] 75: mse.test.mean=9553370.3012008; time: 0.1 min
[Tune-x] 76: early_stopping_rounds=8; nrounds=275; max_depth=5; eta=0.242
[Tune-y] 76: mse.test.mean=9734527.9340654; time: 0.1 min
[Tune-x] 77: early_stopping_rounds=8; nrounds=111; max_depth=7; eta=0.16
[Tune-y] 77: mse.test.mean=8824551.0876736; time: 0.0 min
[Tune-x] 78: early_stopping_rounds=8; nrounds=494; max_depth=5; eta=0.0329
[Tune-y] 78: mse.test.mean=8394268.7106736; time: 0.1 min
[Tune-x] 79: early_stopping_rounds=7; nrounds=185; max_depth=10; eta=0.087
[Tune-y] 79: mse.test.mean=10225026.2280932; time: 0.1 min
[Tune-x] 80: early_stopping_rounds=9; nrounds=315; max_depth=6; eta=0.102
[Tune-y] 80: mse.test.mean=9109585.7187011; time: 0.1 min
[Tune-x] 81: early_stopping_rounds=9; nrounds=229; max_depth=5; eta=0.243
[Tune-y] 81: mse.test.mean=9370137.1522283; time: 0.1 min
[Tune-x] 82: early_stopping_rounds=6; nrounds=319; max_depth=10; eta=0.216
[Tune-y] 82: mse.test.mean=10259877.1528035; time: 0.3 min
[Tune-x] 83: early_stopping_rounds=11; nrounds=100; max_depth=6; eta=0.218
[Tune-y] 83: mse.test.mean=8880339.1142502; time: 0.0 min
[Tune-x] 84: early_stopping_rounds=10; nrounds=463; max_depth=10; eta=0.0794
[Tune-y] 84: mse.test.mean=10397728.9055980; time: 0.4 min
[Tune-x] 85: early_stopping_rounds=5; nrounds=264; max_depth=9; eta=0.238
[Tune-y] 85: mse.test.mean=9735613.3164361; time: 0.2 min
[Tune-x] 86: early_stopping_rounds=12; nrounds=307; max_depth=9; eta=0.213
[Tune-y] 86: mse.test.mean=9787244.9161651; time: 0.2 min
[Tune-x] 87: early_stopping_rounds=14; nrounds=389; max_depth=7; eta=0.11
[Tune-y] 87: mse.test.mean=9470438.5209672; time: 0.2 min
[Tune-x] 88: early_stopping_rounds=14; nrounds=254; max_depth=9; eta=0.0204
[Tune-y] 88: mse.test.mean=9476739.6403603; time: 0.1 min
[Tune-x] 89: early_stopping_rounds=14; nrounds=106; max_depth=5; eta=0.123
[Tune-y] 89: mse.test.mean=8350854.8763840; time: 0.0 min
[Tune-x] 90: early_stopping_rounds=12; nrounds=343; max_depth=10; eta=0.0776
[Tune-y] 90: mse.test.mean=10408315.5189620; time: 0.3 min
[Tune-x] 91: early_stopping_rounds=10; nrounds=395; max_depth=5; eta=0.171
[Tune-y] 91: mse.test.mean=9375547.4927445; time: 0.1 min
[Tune-x] 92: early_stopping_rounds=14; nrounds=247; max_depth=5; eta=0.13
[Tune-y] 92: mse.test.mean=8958574.7415755; time: 0.1 min
[Tune-x] 93: early_stopping_rounds=6; nrounds=461; max_depth=6; eta=0.156
[Tune-y] 93: mse.test.mean=9340260.2677384; time: 0.2 min
[Tune-x] 94: early_stopping_rounds=8; nrounds=416; max_depth=10; eta=0.172
[Tune-y] 94: mse.test.mean=10602911.5919019; time: 0.3 min
[Tune-x] 95: early_stopping_rounds=5; nrounds=484; max_depth=7; eta=0.136
[Tune-y] 95: mse.test.mean=9354475.2557501; time: 0.2 min
[Tune-x] 96: early_stopping_rounds=5; nrounds=164; max_depth=9; eta=0.0813
[Tune-y] 96: mse.test.mean=9768378.0063813; time: 0.1 min
[Tune-x] 97: early_stopping_rounds=11; nrounds=338; max_depth=5; eta=0.0991
[Tune-y] 97: mse.test.mean=9006327.1094235; time: 0.1 min
[Tune-x] 98: early_stopping_rounds=8; nrounds=273; max_depth=7; eta=0.187
[Tune-y] 98: mse.test.mean=9441931.2334022; time: 0.1 min
[Tune-x] 99: early_stopping_rounds=15; nrounds=187; max_depth=9; eta=0.00784
[Tune-y] 99: mse.test.mean=50989605.8255714; time: 0.1 min
[Tune-x] 100: early_stopping_rounds=15; nrounds=492; max_depth=9; eta=0.167
[Tune-y] 100: mse.test.mean=9521843.6192693; time: 0.3 min
[Tune] Result: early_stopping_rounds=15; nrounds=114; max_depth=5; eta=0.0997 : mse.test.mean=8329617.3171094

Step 9: Getting the optimal parameters

smpl<-0.5 # for sampling during training 
stping_rounds<-tuned_params[[3]][[1]]
eta_par<-tuned_params[[3]][[4]]
nrounds_par<-tuned_params[[3]][[2]]
max_depth_par<-tuned_params[[3]][[3]]

txt<-str_c("subsample:",smpl,"_early_stopping_rounds:",stping_rounds,"_max_depth:",
           max_depth_par,"_eta:",eta_par,"_nrounds:",nrounds_par)

txt
[1] "subsample:0.5_early_stopping_rounds:15_max_depth:5_eta:0.0997008429991547_nrounds:114"


Step 10: Training the model

xgb.model.best=   xgboost(params =list(subsample = smpl),
                          data=train.df2,     #train sparse matrix
                          label= y_train,          #output vector to be predicted
                          eval.metric= 'rmse',        #model minimizes Root Mean Squared Error
                          objective= "reg:squarederror",     #regression
                          #tuning parameters
                          early_stopping_rounds = stping_rounds,
                          max_depth = max_depth_par,
                          eta = eta_par,
                          nrounds = nrounds_par,
                          verbose=1
                          
)
[1] train-rmse:22784.155878 
Will train until train_rmse hasn't improved in 15 rounds.

[2] train-rmse:20790.280270 
[3] train-rmse:18973.552349 
[4] train-rmse:17267.477319 
[5] train-rmse:15798.425832 
[6] train-rmse:14487.459047 
[7] train-rmse:13290.274150 
[8] train-rmse:12261.317795 
[9] train-rmse:11287.366067 
[10]    train-rmse:10436.551972 
[11]    train-rmse:9695.970194 
[12]    train-rmse:9011.618926 
[13]    train-rmse:8410.848331 
[14]    train-rmse:7846.456868 
[15]    train-rmse:7302.792459 
[16]    train-rmse:6821.362806 
[17]    train-rmse:6454.191725 
[18]    train-rmse:6137.399849 
[19]    train-rmse:5854.895449 
[20]    train-rmse:5566.010415 
[21]    train-rmse:5295.302757 
[22]    train-rmse:5057.472774 
[23]    train-rmse:4811.764895 
[24]    train-rmse:4628.713444 
[25]    train-rmse:4457.816523 
[26]    train-rmse:4330.500426 
[27]    train-rmse:4190.613580 
[28]    train-rmse:4063.126750 
[29]    train-rmse:3933.762081 
[30]    train-rmse:3801.123928 
[31]    train-rmse:3684.602996 
[32]    train-rmse:3588.009602 
[33]    train-rmse:3507.782160 
[34]    train-rmse:3420.027090 
[35]    train-rmse:3359.482143 
[36]    train-rmse:3277.035913 
[37]    train-rmse:3195.524083 
[38]    train-rmse:3129.046678 
[39]    train-rmse:3057.615364 
[40]    train-rmse:3001.290765 
[41]    train-rmse:2930.848441 
[42]    train-rmse:2875.330856 
[43]    train-rmse:2826.674819 
[44]    train-rmse:2774.621092 
[45]    train-rmse:2726.212941 
[46]    train-rmse:2680.299123 
[47]    train-rmse:2633.844855 
[48]    train-rmse:2589.091061 
[49]    train-rmse:2537.431850 
[50]    train-rmse:2504.764415 
[51]    train-rmse:2469.064400 
[52]    train-rmse:2430.357590 
[53]    train-rmse:2400.879781 
[54]    train-rmse:2363.350657 
[55]    train-rmse:2333.978394 
[56]    train-rmse:2299.104966 
[57]    train-rmse:2268.021843 
[58]    train-rmse:2228.533381 
[59]    train-rmse:2199.257141 
[60]    train-rmse:2177.466456 
[61]    train-rmse:2132.963736 
[62]    train-rmse:2109.978926 
[63]    train-rmse:2093.805951 
[64]    train-rmse:2063.849087 
[65]    train-rmse:2040.386483 
[66]    train-rmse:2016.678816 
[67]    train-rmse:1993.706914 
[68]    train-rmse:1978.022977 
[69]    train-rmse:1948.613613 
[70]    train-rmse:1932.409334 
[71]    train-rmse:1916.209449 
[72]    train-rmse:1902.606214 
[73]    train-rmse:1886.405835 
[74]    train-rmse:1872.948578 
[75]    train-rmse:1860.896393 
[76]    train-rmse:1848.343301 
[77]    train-rmse:1835.585713 
[78]    train-rmse:1826.229335 
[79]    train-rmse:1816.928760 
[80]    train-rmse:1804.911331 
[81]    train-rmse:1794.877595 
[82]    train-rmse:1785.904686 
[83]    train-rmse:1771.771758 
[84]    train-rmse:1762.537625 
[85]    train-rmse:1743.674746 
[86]    train-rmse:1727.092388 
[87]    train-rmse:1716.174311 
[88]    train-rmse:1707.267788 
[89]    train-rmse:1701.654570 
[90]    train-rmse:1693.814856 
[91]    train-rmse:1670.058637 
[92]    train-rmse:1644.101291 
[93]    train-rmse:1637.279127 
[94]    train-rmse:1620.925174 
[95]    train-rmse:1616.062892 
[96]    train-rmse:1609.234957 
[97]    train-rmse:1593.808599 
[98]    train-rmse:1586.103599 
[99]    train-rmse:1581.036361 
[100]   train-rmse:1572.868570 
[101]   train-rmse:1559.990474 
[102]   train-rmse:1547.864086 
[103]   train-rmse:1537.398933 
[104]   train-rmse:1528.824548 
[105]   train-rmse:1523.308226 
[106]   train-rmse:1515.679599 
[107]   train-rmse:1504.320164 
[108]   train-rmse:1500.319942 
[109]   train-rmse:1494.758685 
[110]   train-rmse:1488.237479 
[111]   train-rmse:1472.578346 
[112]   train-rmse:1466.255644 
[113]   train-rmse:1458.498172 
[114]   train-rmse:1453.800531 

Step 11: Checking accuracy on train data

fitted<-xgb.model.best%>%
  predict(train.df2)


final.df2<-cbind(train.df2,y_train,fitted)
head(final.df2)
     year Month_Nm Seasonality blackhawk cerrogord clinton dallas desmoines
[1,] 2020        6  0.21297876         1         0       0      0         0
[2,] 2019       10  0.05222602         0         0       0      0         0
[3,] 2017        7  1.00000000         0         0       0      0         0
[4,] 2019        2 -0.72872995         0         0       1      0         0
[5,] 2018       12 -0.63933027         0         0       0      0         0
[6,] 2018        3 -0.61333964         0         0       0      0         0
     dickinson dubuque johnson lee linn marshall muscatine polk pottawatta
[1,]         0       0       0   0    0        0         0    0          0
[2,]         0       0       0   0    1        0         0    0          0
[3,]         1       0       0   0    0        0         0    0          0
[4,]         0       0       0   0    0        0         0    0          0
[5,]         0       0       0   0    0        0         0    0          0
[6,]         0       0       0   0    1        0         0    0          0
     scott story wapello warren webster woodbury  y_train    fitted
[1,]     0     0       0      0       0        0 23488.39 22384.041
[2,]     0     0       0      0       0        0 40141.11 41048.602
[3,]     0     0       0      0       0        0  9014.65  8287.662
[4,]     0     0       0      0       0        0  5852.30  5302.017
[5,]     0     1       0      0       0        0  9530.67 10278.403
[6,]     0     0       0      0       0        0 31092.09 30795.586


Gathering county data to make a single column for all counties

train.mape <- final.df2%>%
  as.data.frame()%>%
  ungroup()%>%
  # head()%>%
  mutate(Mape=(y_train - fitted)/y_train)%>%
  mutate(Mape_Prec=abs(Mape))%>%
  gather(key="County",value = "Values",blackhawk :scott)%>%
  filter(Values==1)%>%
  group_by(County)%>%
  summarise(Accuracy_Train=1 - mean(Mape_Prec  ))

head(train.mape)
# A tibble: 6 × 2
  County    Accuracy_Train
  <chr>              <dbl>
1 blackhawk          0.957
2 cerrogord          0.887
3 clinton            0.895
4 dallas             0.806
5 desmoines          0.905
6 dickinson          0.871


Step 12: Checking accuracy on test data

fitted<-xgb.model.best%>%
  predict(test.df2)


final.df2<-cbind(test.df2,y_test,fitted)
head(final.df2)
     year Month_Nm Seasonality blackhawk cerrogord clinton dallas desmoines
[1,] 2017        1  -0.7251846         0         0       0      0         0
[2,] 2017        1  -0.3864800         0         0       1      0         0
[3,] 2017        1  -0.2766859         0         0       0      1         0
[4,] 2017        1  -0.4690164         0         0       0      0         0
[5,] 2017        1   0.2058128         0         0       0      0         0
[6,] 2017        1  -0.5343577         0         0       0      0         0
     dickinson dubuque johnson lee linn marshall muscatine polk pottawatta
[1,]         0       0       0   0    0        1         0    0          0
[2,]         0       0       0   0    0        0         0    0          0
[3,]         0       0       0   0    0        0         0    0          0
[4,]         0       0       0   0    0        0         0    0          0
[5,]         0       0       0   0    0        0         0    0          0
[6,]         0       0       1   0    0        0         0    0          0
     scott story wapello warren webster woodbury   y_test    fitted
[1,]     0     0       0      0       0        0  2655.64  3891.091
[2,]     0     0       0      0       0        0  4618.92  5409.865
[3,]     0     0       0      0       0        0  4842.17  6098.023
[4,]     0     0       0      0       0        1  7790.48  9749.588
[5,]     0     1       0      0       0        0  9406.41 12686.065
[6,]     0     0       0      0       0        0 20782.37 21141.416
# Mape on train data
test.mape <- final.df2%>%
  as.data.frame()%>%
  ungroup()%>%
  # head()%>%
  mutate(Mape=(y_test - fitted)/y_test)%>%
  mutate(Mape_Prec=abs(Mape))%>%
  gather(key="County",value = "Values",blackhawk :scott)%>%
  filter(Values==1)%>%
  group_by(County)%>%
  summarise(Accuracy_Test=1 - mean(Mape_Prec  ))

head(test.mape)
# A tibble: 6 × 2
  County    Accuracy_Test
  <chr>             <dbl>
1 blackhawk         0.925
2 cerrogord         0.891
3 clinton           0.882
4 dallas            0.656
5 desmoines         0.890
6 dickinson         0.869


Step 13: Coparing train and test accuracy

# Combining Both
final.df<-train.mape%>%
  left_join(test.mape,"County")%>%
  mutate(Diff=round(Accuracy_Train - Accuracy_Test , 2))%>%
  arrange(desc(Diff))

final.df
# A tibble: 15 × 4
   County     Accuracy_Train Accuracy_Test  Diff
   <chr>               <dbl>         <dbl> <dbl>
 1 dallas              0.806         0.656  0.15
 2 marshall            0.830         0.687  0.14
 3 scott               0.944         0.862  0.08
 4 johnson             0.933         0.858  0.07
 5 linn                0.958         0.898  0.06
 6 polk                0.979         0.928  0.05
 7 pottawatta          0.944         0.889  0.05
 8 blackhawk           0.957         0.925  0.03
 9 dubuque             0.928         0.902  0.03
10 lee                 0.841         0.824  0.02
11 clinton             0.895         0.882  0.01
12 desmoines           0.905         0.890  0.01
13 muscatine           0.868         0.854  0.01
14 cerrogord           0.887         0.891  0   
15 dickinson           0.871         0.869  0   

My Channel

No comments:

Post a Comment

Web Scraping Tutorial 2 - Getting the Avg Rating and Reviews Count

Web Scrapping Tutorial 2: Getting Overall rating and number of reviews ...