Machine Learning Made Easy: April 2020

Histograms in R

Introduction

While analysing datasets, it is important to represent summary stats using appropriate graphs.In this series, we will look at how to create most commonly used plots using ggplot library. Real case scenarios will be taken to understand the nitty-gritties of implementation

Installing the library: dplyr,tidyr and Ecdat package

package.name<-c("dplyr","tidyr","Ecdat","ggplot2")

for(i in package.name){

  if(!require(i,character.only = T)){

    install.packages(i)
  }
  library(i,character.only = T)

}


# Ecdat package has the 'Health Insurance and Hours Worked By Wives' data
data(HI)
df<-HI
head(df)

  whrswk hhi whi hhi2  education  race hispanic experience kidslt6 kids618
1      0  no  no   no 13-15years white       no       13.0       2       1
2     50  no yes   no 13-15years white       no       24.0       0       1
3     40 yes  no  yes    12years white       no       43.0       0       0
4     40  no yes  yes 13-15years white       no       17.0       0       1
5      0 yes  no  yes  9-11years white       no       44.5       0       0
6     40 yes yes  yes    12years white       no       32.0       0       0
   husby       region   wght
1 11.960 northcentral 214986
2  1.200 northcentral 210119
3 31.275 northcentral 219955
4  9.000 northcentral 210317
5  0.000 northcentral 219955
6 15.690 northcentral 208148

Step 1:Frequency Profile of the variables

Lets look at the count of records for different levels of categorical variables

interim.df<-df%>%
  select(hhi,whi,hhi2,education,race,hispanic,kidslt6,kids618,region)
  
l1<-lapply(colnames(interim.df),function(x){

  z<-interim.df%>%
    select(x)%>%
    mutate(Feature=x)
  
  colnames(z)<-c("Level","Feature")
  
  z1<-z%>%
    group_by(Feature,Level)%>%
    summarise(Total=n())
  
  z1["Level"]<-sapply(z1["Level"],as.character)
  
  return(z1)
})

df.final<-do.call(rbind.data.frame,l1)%>%
  as.data.frame()
row.names(df.final)<-NULL
head(df.final)

  Feature Level Total
1     hhi    no 11219
2     hhi   yes 11053
3     whi    no 13961
4     whi   yes  8311
5    hhi2    no  8696
6    hhi2   yes 13576

Histograms of all the Categorical Variables

lapply(unique(df.final[["Feature"]]),function(y){

p<-ggplot(data=df.final%>%
            as.data.frame()%>%
            filter(Feature==y), aes(x=Level, y=Total)) +
  geom_bar(stat="identity",fill = "orange")+
  ggtitle("Barplot") +
  xlab(y)+ylab("Frequency Count")+
  theme(plot.title = element_text(hjust = 0.5))# hjust value of 0.5 centre aligns the title
p

  
})

[[1]]


[[2]]


[[3]]


[[4]]


[[5]]


[[6]]


[[7]]


[[8]]


[[9]]

Final Comments

The above plot helps us to understand how the frequency summary can be visually represented using ggplot

Link to Previous R Blogs

https://ml-withparag.com/

List of Datasets for Practise

https://hofmann.public.iastate.edu/data_in_r_sortable.html

https://vincentarelbundock.github.io/Rdatasets/datasets.html

Machine Learning Made Easy

Sunday, April 12, 2020

Blog 23: Histogram using ggplot