Machine Learning Made Easy: 2020

Introduction

There are often cases in data analaysis where we need to extarct certain portion of the text based on some pattern in the text.The text generally resides in a column within the data frame or is stored in a vector. In this blog, we will look at how to use popular regular expression technique in R along with some useful functions

Step 1:Installing libraries

For this blog,we will install the stringr and stringi library which houses most of the functions for matching patterns and extracting text.

package.name<-c("dplyr","tidyr","stringr","stringi")

for(i in package.name){
  
  if(!require(i,character.only = T)){
    
    install.packages(i)
  }
  library(i,character.only = T)
  
}

Step 2:str_detect function

Lets try and see if a particular pattern exists in a text string or not.

str_detect("Roger is great","Roger")

[1] TRUE

Here I have a string - "Roger is great" and we try to see if the word "Roger" is present or not in this string. Using str_Detect(function for detecting pattern in a string), we can verify if the pattern is present in a string or not.The result will be logical vector

Lets provide a vector of string and check again

txt<-c("Roger is great","Roger is from Switzerland","He has won 20 grand slam titles")
str_detect(txt,"Roger")

[1]  TRUE  TRUE FALSE

There are 3 elements in the txt vector. The first two have the word Roger but the third doesnt. Hence we get a logical vector with TRUE in the first two places and a FALSE in the third

Step 3:str_extract function

In certain cases, instead of detecting the presence of a pattern, we would like to extarct certain information from the text input.In those cases we normally use str_extract and str_extract_all

In a slight modification of the previous example, lets try and extract a number from the txt vector

txt<-c("Roger is great","Roger is from Switzerland","He has won 20 grand slam titles plus 37 ATP titles")
output<-str_extract(txt,"[0-9]+")
output

[1] NA   NA   "20"

str_extract gives the value of the first occurence of a number within the string. The output is in the form of a vector.If we want all the numbers within the text, then we should use str_extract_all

txt<-c("Roger is great","Roger is from Switzerland","He has won 20 grand slam titles plus 37 ATP titles")
output<-str_extract_all(txt,"[0-9]+")
output

[[1]]
character(0)

[[2]]
character(0)

[[3]]
[1] "20" "37"

As we can see that we have extracted 20 and 37 from the last element of the txt vector. The class of 'output' is a list.

Step 4:str_split

Lets say we want to split the text based on a certain pattern.We can do it using str_split

txt<-c("Roger Federer,20","Rafael Nadal,20","Novak Djokovic,17")
output<-str_split(txt,"[,]")
output

[[1]]
[1] "Roger Federer" "20"           

[[2]]
[1] "Rafael Nadal" "20"          

[[3]]
[1] "Novak Djokovic" "17"

We can create a menaingful data frame out of the above summary

txt<-c("Roger Federer,20","Rafael Nadal,20","Novak Djokovic,17")
player.df<-data.frame(Name=sapply(txt,function(x){
  
  str_split(x,"[,]")[[1]][1]
  
}),
'Grand Slam Title'=sapply(txt,function(x){
  
  str_split(x,"[,]")[[1]][2]
  
}))

player.df

                            Name Grand.Slam.Title
Roger Federer,20   Roger Federer               20
Rafael Nadal,20     Rafael Nadal               20
Novak Djokovic,17 Novak Djokovic               17

Step 5:Replace Function using gsub

If we want to replace a certain portion of the text with some other value, we can do that using gsub function.

txt<-c("Novak Djokovic is at the top of ATP rankings")
gsub("Novak Djokovic","Roger Federer",txt)

[1] "Roger Federer is at the top of ATP rankings"

Step 6:Standardise dates

Normally regular expression is most useful when we have data available as character/string and we cant apply standard date functions to it.We can leverage regular expression to standardise dates.In the example shown below, we will add the time stamp to the date if it is not present in the date value

data.df<-data.frame(Value=c("2020-12-19 00:57:40","2020-10-19 00:58:40",
                            "2020-12-18","2012-9-22"))
temp.df<-data.df%>%
  mutate(NewValue=ifelse(str_detect(Value,"[':'][0-9][0-9][':']"),Value,paste0(Value," 00:00:00")))

temp.df

                Value            NewValue
1 2020-12-19 00:57:40 2020-12-19 00:57:40
2 2020-10-19 00:58:40 2020-10-19 00:58:40
3          2020-12-18 2020-12-18 00:00:00
4           2012-9-22  2012-9-22 00:00:00

Step 7:Extract Year and Month Info

Lets extract the Year number from the date

data.df<-data.frame(Value=c("2020-12-19 00:57:40","2020-10-19 00:58:40",
                            "2020-12-18","2012-9-22"))
temp.df<-data.df%>%
  mutate(Year=str_extract(Value,"^[0-9]{4}"),
         Month=str_extract(Value,"(-[0-9][0-9])|(-[0-9])"),
         MonthNew=gsub('-',"",Month))%>%
  select(-Month)


temp.df

                Value Year MonthNew
1 2020-12-19 00:57:40 2020       12
2 2020-10-19 00:58:40 2020       10
3          2020-12-18 2020       12
4           2012-9-22 2012        9

Step 8:Extract Date Info

Lets extract the date from the field

data.df<-data.frame(Value=c("2020-12-19 00:57:40","2020-10-19 00:58:40",
                            "2020-12-18","2012-9-22"))
temp.df<-data.df%>%
  mutate(Year=str_extract(Value,"^[0-9]{4}"),
         Month=str_extract(Value,"(-[0-9][0-9])|(-[0-9])"),
         MonthNew=gsub('-',"",Month))%>%
  select(-Month)%>%
  mutate(DateNew=sapply(Value,function(x){
    
    z<-str_split(x,"[-]")[[1]][3]
    return(z)
    
  }),
  DateNew2=sapply(DateNew,function(x){
    
     z<-str_split(x," ")[[1]][1]
    return(z)
    
  }))


temp.df

                Value Year MonthNew     DateNew DateNew2
1 2020-12-19 00:57:40 2020       12 19 00:57:40       19
2 2020-10-19 00:58:40 2020       10 19 00:58:40       19
3          2020-12-18 2020       12          18       18
4           2012-9-22 2012        9          22       22

Step 9:Combining Year, Month and Date Info into a Data Frame

final.df<-temp.df%>%
  select(Value,Year,MonthNew,DateNew2)%>%
  rename(Month=MonthNew)%>%
  rename(Date=DateNew2)

final.df

                Value Year Month Date
1 2020-12-19 00:57:40 2020    12   19
2 2020-10-19 00:58:40 2020    10   19
3          2020-12-18 2020    12   18
4           2012-9-22 2012     9   22

In this blog we saw very simple yet effective examples of how we can use regular expressions in conjunction with str_ functions to exrtact meaningful summary out of text data.