Monday, May 31, 2021

Blog 42 Pivot Using multiple Columns in R

Pivot Multiple Columns in R


Introduction

There are certain cases where we need to create a pivot/widen the data based on multiple features.This cant be done simply and requires a two step approach.We will look at this using Medicaid dataset present in the AER package.

package.name<-c("dplyr","tidyr","stringr","AER")

for(i in package.name){

  if(!require(i,character.only = T)){

    install.packages(i)
  }
  library(i,character.only = T)

}

data("Medicaid1986")
df<-Medicaid1986
head(df)
  visits exposure children age income health1 health2 access married gender
1      0      100        1  24 14.500   0.495  -0.854   0.50      no female
2      1       90        3  19  6.000   0.520  -0.969   0.17      no female
3      0      106        4  17  8.377  -1.227   0.317   0.42      no female
4      0      114        2  29  6.000  -1.524   0.457   0.33      no female
5     11      115        1  26  8.500   0.173  -0.599   0.67      no female
6      3      102        1  22  6.000  -0.905   0.062   0.25      no female
  ethnicity school enroll program
1      cauc     13    yes    afdc
2      cauc     11    yes    afdc
3      cauc     12    yes    afdc
4      cauc     12    yes    afdc
5      cauc     16    yes    afdc
6     other     12    yes    afdc

The following columns are categorical in nature:

  • married
  • gender
  • ethnicity
  • enroll
  • program

Lets say we want to pivot the dataset based on the above features.In order to do this, we need to use the following two step approach:

  • Step 1: unpivot the data on the mentioned categorical data
  • Step 2: then pivot the data based on the key:value pair obtained

Step 1: unpivot the data on the mentioned categorical data

interim.df<-df%>%
  mutate(Index=1:n())%>%
  gather(key = "attributes",value = "values",
         c("married","gender","ethnicity","enroll","program"))%>%
  mutate(key_value=str_c(attributes ,"_",values))%>%
  select(Index,everything())%>%
  unique()
Warning: attributes are not identical across measure variables;
they will be dropped
head(interim.df)
  Index visits exposure children age income health1 health2 access school
1     1      0      100        1  24 14.500   0.495  -0.854   0.50     13
2     2      1       90        3  19  6.000   0.520  -0.969   0.17     11
3     3      0      106        4  17  8.377  -1.227   0.317   0.42     12
4     4      0      114        2  29  6.000  -1.524   0.457   0.33     12
5     5     11      115        1  26  8.500   0.173  -0.599   0.67     16
6     6      3      102        1  22  6.000  -0.905   0.062   0.25     12
  attributes values  key_value
1    married     no married_no
2    married     no married_no
3    married     no married_no
4    married     no married_no
5    married     no married_no
6    married     no married_no


Step 2: then pivot the data based on the key:value pair obtained

final.df<-interim.df%>%
  mutate(key_value=str_c(attributes ,"_",values))%>%
  mutate(TotalCount=1)%>%
  select(-attributes,-values)%>% # Most important step
  spread(key_value,TotalCount,fill=NA)
  
head(final.df)
  Index visits exposure children age income health1 health2 access school
1     1      0      100        1  24 14.500   0.495  -0.854   0.50     13
2     2      1       90        3  19  6.000   0.520  -0.969   0.17     11
3     3      0      106        4  17  8.377  -1.227   0.317   0.42     12
4     4      0      114        2  29  6.000  -1.524   0.457   0.33     12
5     5     11      115        1  26  8.500   0.173  -0.599   0.67     16
6     6      3      102        1  22  6.000  -0.905   0.062   0.25     12
  enroll_no enroll_yes ethnicity_cauc ethnicity_other gender_female gender_male
1        NA          1              1              NA             1          NA
2        NA          1              1              NA             1          NA
3        NA          1              1              NA             1          NA
4        NA          1              1              NA             1          NA
5        NA          1              1              NA             1          NA
6        NA          1             NA               1             1          NA
  married_no married_yes program_afdc program_ssi
1          1          NA            1          NA
2          1          NA            1          NA
3          1          NA            1          NA
4          1          NA            1          NA
5          1          NA            1          NA
6          1          NA            1          NA

final.df contains the pivotted dataset or the wide form dataset.

Final Note

We saw how it is extremely easy to create a pivoted or a wide form dataset using multiple keys in R

My Youtube Channel

Web Scraping Tutorial 2 - Getting the Avg Rating and Reviews Count

Web Scrapping Tutorial 2: Getting Overall rating and number of reviews ...