Introduction
There are often cases in data analaysis where we need to extarct certain portion of the text based on some pattern in the text.The text generally resides in a column within the data frame or is stored in a vector. In this blog, we will look at how to use popular regular expression technique in R along with some useful functions
Step 1:Installing libraries
For this blog,we will install the stringr and stringi library which houses most of the functions for matching patterns and extracting text.
package.name<-c("dplyr","tidyr","stringr","stringi")
for(i in package.name){
if(!require(i,character.only = T)){
install.packages(i)
}
library(i,character.only = T)
}
Step 2:str_detect function
Lets try and see if a particular pattern exists in a text string or not.
str_detect("Roger is great","Roger")
[1] TRUE
Here I have a string - "Roger is great" and we try to see if the word "Roger" is present or not in this string. Using str_Detect(function for detecting pattern in a string), we can verify if the pattern is present in a string or not.The result will be logical vector
Lets provide a vector of string and check again
txt<-c("Roger is great","Roger is from Switzerland","He has won 20 grand slam titles")
str_detect(txt,"Roger")
[1] TRUE TRUE FALSE
There are 3 elements in the txt vector. The first two have the word Roger but the third doesnt. Hence we get a logical vector with TRUE in the first two places and a FALSE in the third
Step 3:str_extract function
In certain cases, instead of detecting the presence of a pattern, we would like to extarct certain information from the text input.In those cases we normally use str_extract and str_extract_all
In a slight modification of the previous example, lets try and extract a number from the txt vector
txt<-c("Roger is great","Roger is from Switzerland","He has won 20 grand slam titles plus 37 ATP titles")
output<-str_extract(txt,"[0-9]+")
output
[1] NA NA "20"
str_extract gives the value of the first occurence of a number within the string. The output is in the form of a vector.If we want all the numbers within the text, then we should use str_extract_all
txt<-c("Roger is great","Roger is from Switzerland","He has won 20 grand slam titles plus 37 ATP titles")
output<-str_extract_all(txt,"[0-9]+")
output
[[1]]
character(0)
[[2]]
character(0)
[[3]]
[1] "20" "37"
As we can see that we have extracted 20 and 37 from the last element of the txt vector. The class of 'output' is a list.
Step 4:str_split
Lets say we want to split the text based on a certain pattern.We can do it using str_split
txt<-c("Roger Federer,20","Rafael Nadal,20","Novak Djokovic,17")
output<-str_split(txt,"[,]")
output
[[1]]
[1] "Roger Federer" "20"
[[2]]
[1] "Rafael Nadal" "20"
[[3]]
[1] "Novak Djokovic" "17"
We can create a menaingful data frame out of the above summary
txt<-c("Roger Federer,20","Rafael Nadal,20","Novak Djokovic,17")
player.df<-data.frame(Name=sapply(txt,function(x){
str_split(x,"[,]")[[1]][1]
}),
'Grand Slam Title'=sapply(txt,function(x){
str_split(x,"[,]")[[1]][2]
}))
player.df
Name Grand.Slam.Title
Roger Federer,20 Roger Federer 20
Rafael Nadal,20 Rafael Nadal 20
Novak Djokovic,17 Novak Djokovic 17
Step 5:Replace Function using gsub
If we want to replace a certain portion of the text with some other value, we can do that using gsub function.
txt<-c("Novak Djokovic is at the top of ATP rankings")
gsub("Novak Djokovic","Roger Federer",txt)
[1] "Roger Federer is at the top of ATP rankings"
Step 6:Standardise dates
Normally regular expression is most useful when we have data available as character/string and we cant apply standard date functions to it.We can leverage regular expression to standardise dates.In the example shown below, we will add the time stamp to the date if it is not present in the date value
data.df<-data.frame(Value=c("2020-12-19 00:57:40","2020-10-19 00:58:40",
"2020-12-18","2012-9-22"))
temp.df<-data.df%>%
mutate(NewValue=ifelse(str_detect(Value,"[':'][0-9][0-9][':']"),Value,paste0(Value," 00:00:00")))
temp.df
Value NewValue
1 2020-12-19 00:57:40 2020-12-19 00:57:40
2 2020-10-19 00:58:40 2020-10-19 00:58:40
3 2020-12-18 2020-12-18 00:00:00
4 2012-9-22 2012-9-22 00:00:00
Step 7:Extract Year and Month Info
Lets extract the Year number from the date
data.df<-data.frame(Value=c("2020-12-19 00:57:40","2020-10-19 00:58:40",
"2020-12-18","2012-9-22"))
temp.df<-data.df%>%
mutate(Year=str_extract(Value,"^[0-9]{4}"),
Month=str_extract(Value,"(-[0-9][0-9])|(-[0-9])"),
MonthNew=gsub('-',"",Month))%>%
select(-Month)
temp.df
Value Year MonthNew
1 2020-12-19 00:57:40 2020 12
2 2020-10-19 00:58:40 2020 10
3 2020-12-18 2020 12
4 2012-9-22 2012 9
Step 8:Extract Date Info
Lets extract the date from the field
data.df<-data.frame(Value=c("2020-12-19 00:57:40","2020-10-19 00:58:40",
"2020-12-18","2012-9-22"))
temp.df<-data.df%>%
mutate(Year=str_extract(Value,"^[0-9]{4}"),
Month=str_extract(Value,"(-[0-9][0-9])|(-[0-9])"),
MonthNew=gsub('-',"",Month))%>%
select(-Month)%>%
mutate(DateNew=sapply(Value,function(x){
z<-str_split(x,"[-]")[[1]][3]
return(z)
}),
DateNew2=sapply(DateNew,function(x){
z<-str_split(x," ")[[1]][1]
return(z)
}))
temp.df
Value Year MonthNew DateNew DateNew2
1 2020-12-19 00:57:40 2020 12 19 00:57:40 19
2 2020-10-19 00:58:40 2020 10 19 00:58:40 19
3 2020-12-18 2020 12 18 18
4 2012-9-22 2012 9 22 22
Step 9:Combining Year, Month and Date Info into a Data Frame
final.df<-temp.df%>%
select(Value,Year,MonthNew,DateNew2)%>%
rename(Month=MonthNew)%>%
rename(Date=DateNew2)
final.df
Value Year Month Date
1 2020-12-19 00:57:40 2020 12 19
2 2020-10-19 00:58:40 2020 10 19
3 2020-12-18 2020 12 18
4 2012-9-22 2012 9 22
In this blog we saw very simple yet effective examples of how we can use regular expressions in conjunction with str_ functions to exrtact meaningful summary out of text data.
Link to Previous R Blogs
List of Datasets for Practise
https://hofmann.public.iastate.edu/data_in_r_sortable.html
https://vincentarelbundock.github.io/Rdatasets/datasets.html