Handling NA(Missing Values) in R
Parag Verma
27th Dec, 2019
NA Values-What are they ?
All missing values in R are represented as NA. NA is of special interest in R as there are a lot of inbuilt functions to handle it. We will look at some examples of how to handle NA values within a vector and a data frame. 
 
NA values within a vector
Lets create a vector and insert an NA value
s = c(1,2,3,NA) 
s[1]  1  2  3 NA‘s’ is a numeric vector with a single NA value.Lets see the class of ‘s’ to check the impact of NA on the data type
class(s)[1] "numeric"There is no impact of NA on the class of ‘s’ 
Get the total Count of NA in ‘s’
There is a function is.na() which check for the presence of NA.The result of this function is a logical vector with True values at indices where NA values are present and False elsewhere
is.na(s)[1] FALSE FALSE FALSE  TRUEIf we take sum of is.na(s), we would get the total occurences of NA
sum(is.na(s))[1] 1Get the Index position of NA in ‘s’
Here we can use the ‘which’ function
which(is.na(s))[1] 4This will be useful when we are trying to replace/impute the NA within the vector.
Mathematical Operation on ‘s’
We will now look at the impact of NA on vector operations
t<-c(10,NA,12,NA)Lets add ‘s’ to ‘t’
s + t[1] 11 NA 15 NAWe can draw the following inferences
- NA added to a number gives NA
- NA added to a NA gives NA
Example related to addition has been shown but it applies to other Mathematical operators as well
Inbuilt Mathematical function on ‘s’
Applying inbuilt arithematic function such as mean() on ‘s’
mean(s)[1] NAIt results in an NA. To get the mean without taking NA into account, we need to use the argument with the mean function. It is na.rm=T. Here we are specifically asking the function to compute the mean by removing NA from s
mean(s,na.rm=T)[1] 2Logical Operation on ‘s’
s1<-c(T,NA,F,NA,NA)
t1<-c(T,F,F,NA,T)Lets AND ‘s1’ to ‘t1’
s1 & t1[1]  TRUE FALSE FALSE    NA    NAWe can draw the following inferences
- NA AND to F gives F
- NA AND to NA gives NA
- NA AND to T gives NA 
Lets OR ‘s1’ to ‘t1’
s1 | t1[1]  TRUE    NA FALSE    NA  TRUEWe can draw the following inferences
- NA AND to F gives F
- NA AND to NA gives NA
- NA AND to T gives T 
Lets NOT ‘s1’
!s1[1] FALSE    NA  TRUE    NA    NAWe can draw the following inferences
- Not on NA gives NA
Practical Use Case:NA and Data Frames
In a data frame, we can select rows or columns or both.So essentially we will be looking at ways to extract set of rows and/or subset of columns.Lets declare a data frame
if(!require("dplyr")){
  
  install.packages("dplyr")
}else{
  
  library(datasets)
}
data(package = "dplyr")
df<-starwars
colnames(df) [1] "name"       "height"     "mass"       "hair_color" "skin_color"
 [6] "eye_color"  "birth_year" "gender"     "homeworld"  "species"   
[11] "films"      "vehicles"   "starships" head(df[,1:4])# A tibble: 6 x 4
  name           height  mass hair_color 
  <chr>           <int> <dbl> <chr>      
1 Luke Skywalker    172    77 blond      
2 C-3PO             167    75 <NA>       
3 R2-D2              96    32 <NA>       
4 Darth Vader       202   136 none       
5 Leia Organa       150    49 brown      
6 Owen Lars         178   120 brown, greyWe can see that there are NA’s present in hair_color and gender columns. Lets us try and create a small report highlighting the NA count for each variable
l1<-list()
for(i in colnames(df)){
  
  Total_Count<-sum(is.na(df[,i]))
  temp.df<-data.frame(Variable=i,'Sum of NA'=Total_Count,stringsAsFactors = F)
  l1[[i]]<-temp.df
  
}
df_NA<-do.call(rbind.data.frame,l1)
row.names(df_NA)<-NULL
df_NA     Variable Sum.of.NA
1        name         0
2      height         6
3        mass        28
4  hair_color         5
5  skin_color         0
6   eye_color         0
7  birth_year        44
8      gender         3
9   homeworld        10
10    species         5
11      films         0
12   vehicles         0
13  starships         0NA Imputations
Based on the above summary on NA, it is clear that we need to replace them with suitable values before deriving any summary insights from it.Identifying columns for which the imputations/replacement needs to be done
required.columns<-df_NA[df_NA$Sum.of.NA > 0,][['Variable']]
required.columns[1] "height"     "mass"       "hair_color" "birth_year" "gender"    
[6] "homeworld"  "species"   Height and mass are numeric columns while other are categorical in nature. The logic that we create should factor this fact
for(j in colnames(df)){
  
  if(j %in% c("height","mass")){
    
    temp<-mean(df[[j]],na.rm=T)
    df[which(is.na(df[,j])),j]<-temp
    
  }else if(j %in% required.columns[which( !required.columns %in% c("height","mass"))]){
    
    temp<-names(sort(table(df[[j]]),T)[1])
    df[which(is.na(df[,j])),j]<-temp
    
  }else{
    
    dummy<-1
  }
  
}df now contains all the NA values replaced depending upon whether a column was numeric or categorical in nature. We can check this using the below piece of cide
l1_Check<-list()
for(i in colnames(df)){
  
  Total_Count<-sum(is.na(df[,i]))
  temp.df<-data.frame(Variable=i,'Sum of NA'=Total_Count,stringsAsFactors = F)
  l1_Check[[i]]<-temp.df
  
}
df_NA_Check<-do.call(rbind.data.frame,l1_Check)
row.names(df_NA_Check)<-NULL
df_NA_Check     Variable Sum.of.NA
1        name         0
2      height         0
3        mass         0
4  hair_color         0
5  skin_color         0
6   eye_color         0
7  birth_year         0
8      gender         0
9   homeworld         0
10    species         0
11      films         0
12   vehicles         0
13  starships         0Final Comments
In this blog we have seen how we can analyse the NA values within a element in R. There are a lot of inbuilt functions in R that helps us to estimate the total count of NAs, summary stats such as mean,etc. We also saw how we can use for loops to create a summary of variables and the NA count and also methods to do imputations.
Link to Previous R Blogs
Blog 1-Vectors,Matrics, Lists and Data Frame in R https://mlmadeeasy.blogspot.com/2019/12/2datatypesr.html
Blog 2 - Operators in R https://mlmadeeasy.blogspot.com/2019/12/blog-2-operators-in-r.html
Blog 3 - Loops in R https://mlmadeeasy.blogspot.com/2019/12/blog-3-loops-in-r.html
Blog 4 - Handling NA in R https://mlmadeeasy.blogspot.com/2019/12/blog-4-indexing-in-r.html
List of Datasets for Practise https://hofmann.public.iastate.edu/data_in_r_sortable.html
 
No comments:
Post a Comment