Parsing Pseudo Codes
Parag Verma
Introduction
In this blog, we will look at how to parse simple text in R. Tidytext library is very rich in the sense it can break text into tidy formats. We can then extract various entities from the text based on specific dictionaries.We will write simple pseudo codes in R and will try to execute it using text parsing and entity extraction
Installing libraries
Lets install tidytext library
package.name<-c("tidytext","textstem","dplyr","tidyr","stringr")
for(i in package.name){
if(!require(i,character.only = T)){
install.packages(i)
}
library(i,character.only = T)
}
Step 1:Creation of Pseudo Code to calculate mean
The dataset taken in this blog is 'mtcars'. We will try and calculate the mean of hp column.We will first store it in a vector and then use it in a data frame
# mtcars dataset
df<-mtcars
pscode<-"Mean of 'hp'"
text_df<-data.frame(line=1:length(pscode),text=pscode,stringsAsFactors = F)
text_df
line text
1 1 Mean of 'hp'
Step 2:Initialising dictionaries
Here we will establish some dictionaries that will be used to identify key entities from the pseudo code
# Idntifying columns
column.identifiers<-colnames(mtcars)
# Identifying mathematical functions
action.identifiers<-data.frame(word=c("mean","average","sum","summation","total"),
WithinR=c("mean","mean","sum","sum","sum"),
stringsAsFactors = F)
action.identifiers
word WithinR
1 mean mean
2 average mean
3 sum sum
4 summation sum
5 total sum
Step 3:Breaking pseudo code into chunks
We will now break the pseudo code into individual elements and arrange them in a single column
token.df<-text_df %>%
unnest_tokens(word, text)
row.names(token.df)<-NULL
head(token.df)
line word
1 1 mean
2 1 of
3 1 hp
Step 4:Extracting Action Entity
Using the below snippet of code, we will match the individual components of the pseudo code with the maths functions
# Extract action to be performed
extract.action<-token.df%>%
left_join(action.identifiers,by="word")%>%
filter(!is.na(WithinR))%>%
select(WithinR)%>%
pull(WithinR)
extract.action
[1] "mean"
Step 5:Extract Column idenifier
Identify the column on which mathematical function will be applied
extract.column<-token.df%>%
filter(word %in% column.identifiers)%>%
select(word)%>%
pull(word)
extract.column
[1] "hp"
Step 6:Evaluating the pseudo code
We will combine extract.action and extract.column using paste0 function and then evaluate the expression using the eval function
output.value<-eval(parse(text=paste0(extract.action,"(df[['",extract.column,"']])")))
output.value
[1] 146.6875
Final Comments
In this blog we saw a simple example of how a pseudo can be parsed using tidytext and dplyr library and evaluated with the help of custom dictionaries. We can evaluate complex pseudo codes as well with the same logic
Link to Previous R Blogs
List of Datasets for Practise
https://hofmann.public.iastate.edu/data_in_r_sortable.html
https://vincentarelbundock.github.io/Rdatasets/datasets.html