Machine Learning Made Easy: June 2020

The Race of My Life: An Autobiography by Milkha Singh
My rating: 4 of 5 stars

No matter what hand life deals you, one should always strive to be positive and take life headstrong. This is the gist of Milkha Singh's life. Right from his torn childhood during partition, life in army to being asia's number one athlete, Milkha Singh gives us strength and the positivity to combat the mundane existence of ordinary life. Hope and despair are two things that are deeply entwined in his initial years where he tries to pick himself up from the ruins of partition. Drifting through most of his childhood and youth, it was only when he joined the army that he got some sense of direction. There he was identified and lauded for his sporting acumen and the laurels it brought to his regiment.
Fast forward to 1956 and he is part of the Indian Olympic contingent to Australia. Having never been exposed to the brilliance of world class athletes, he failed to make a mark. It is post the 1956 Olympics that he set himself a clear goal of winning the 400 metres race. Enroute, he won a lot of races including the ones at National games ,Commonwealth Games where he set an altogether new record. Going into the 1960 Olympics, most anticipated Milkha Singh to win the 400 meters event and best the all time record. The fairytale eventually ended with him finishing 4th.Post 1960, his life as a runner took backstage and settled into oblivion. Post his voluntary retirement from army, he joined Punjab govt as a Sports administrator with a vision to identify and nurture bright sportsmen from a very young age. In this duration, he married Nimmi and settled down in Chandigarh. Even post his retirement, he continued to push for an overhaul of the sporting infrastructure and result oriented model to coaching staff.
From the various stages of his life, one can learn to never give up and pick oneself up from all setbacks. It is the waxing and waning of life through which one has wade through and make a mark

View all my reviews

Term Frequency Inverse Document Frequency

Introduction

In this Blog we will look at how to use TF-IDF metric to analyse text. We will take the text example from the previous blog to understand key differences between Term Frequency and Term Frequency Inverse Document Frequency approach.Boradly we will look at the following topics

What is TF-IDF
Calculating TF-IDF on text data
Creating the Document Term Matrix(DTM)

Installing the library: tidytext along with dplyr,tidyr and stringr package

package.name<-c("tidytext","textstem","dplyr","tidyr","stringr")

for(i in package.name){

  if(!require(i,character.only = T)){

    install.packages(i)
  }
  library(i,character.only = T)

}

What is Term Frequency Inverse Document Frequency (TF-IDF)

It is a metric calculated by multiplying the frequency (TF) of a word by the inverse document frequency(IDF). IDF decreases the weight of the commonly used terms and increases the importance of words that are not used much in the text data. TF * IDF (TF-IDF) is nothing but frequency of the term adjusted to how rarely it is used

We already know the formulae of TF. Lets look at how to calculate IDF

IDF is equal to \(\mathrm{ln}(n/n_{word})\) where
n is the total number of records in the data set
\(n_{word}\) total number of documents in which the word appears

Create a sample text data set

Lets create sample text data to understand key concepts better.

string_txt <- c("Roger Federer is undoubtedly the Greatest tennis player of all times",
                "His legacy is not in the number of grand slam championships",
                " he has won.",
                "He will defintely be remembered for the longevity of his career",
                " and how he was able to take care of his body over the years",
                "His return in 2017 and winning the Autralian open against his",
                " arch rival Nadal is considered to be a modern day spectacle",
                "The only thing left to achieve is the elusive",
                " Olympic gold in Tennis singles")




# In order to analyze this we need to convert it into a data frame
text_df<-data.frame(line=1:length(string_txt),text=string_txt,stringsAsFactors = F)
text_df

  line                                                                 text
1    1 Roger Federer is undoubtedly the Greatest tennis player of all times
2    2          His legacy is not in the number of grand slam championships
3    3                                                          he has won.
4    4      He will defintely be remembered for the longevity of his career
5    5          and how he was able to take care of his body over the years
6    6        His return in 2017 and winning the Autralian open against his
7    7          arch rival Nadal is considered to be a modern day spectacle
8    8                        The only thing left to achieve is the elusive
9    9                                       Olympic gold in Tennis singles

Step 1: Tokenization,lemmatization and Removing Stop words

custom_words<-c("legacy")
Stopword_custom<-data.frame(word=custom_words,stringsAsFactors = F)%>%
  cbind("lexicon"="SMART")%>%
  rbind.data.frame(stop_words)

token.df<-text_df %>%
  unnest_tokens(word, text)%>%
  mutate(word2=lemmatize_strings(word, dictionary = lexicon::hash_lemmas))%>%
  select(-word)%>%
  rename(word=word2)%>%
  anti_join(Stopword_custom)

head(token.df)

  line        word
1    1       roger
2    1     federer
3    1 undoubtedly
4    1      tennis
5    1      player
6    1        time

We can see that the the text has been broken into individual chunks.These chunks are known as tokens. There is a column for row number created by the name line which can be used for grouping some frequency related metrics at row level. We have also used lemmatization and stop word removal to standardise text data

Step 2: Calculating TF-IDF for unigrams

tf_idf.unigram<-token.df%>%
  group_by(line,word)%>%
  summarise(Total_Count=n())%>%
  bind_tf_idf(word, line, Total_Count)
  
head(tf_idf.unigram,8)

# A tibble: 8 x 6
# Groups:   line [2]
   line word         Total_Count    tf   idf tf_idf
  <int> <chr>              <int> <dbl> <dbl>  <dbl>
1     1 federer                1 0.167  2.20  0.366
2     1 player                 1 0.167  2.20  0.366
3     1 roger                  1 0.167  2.20  0.366
4     1 tennis                 1 0.167  1.50  0.251
5     1 time                   1 0.167  2.20  0.366
6     1 undoubtedly            1 0.167  2.20  0.366
7     2 championship           1 0.25   2.20  0.549
8     2 grand                  1 0.25   2.20  0.549

Lets take line 1 and go through some of the values of tf and idf. For the word ‘federer’ * TF:There are 6 words in line 1 and all appear only once. Hence term frequency for each will be 1/6 which is around 0.16. * IDF:There are a total of 9 documents(rows of text data) and federer appears only in the first one. so ln(9/1) is around 2.19 * TF-IDF: 0.16x2.19 gives 0.36

similar inference can be made for other words as well

Step 2.b: Calculating TF-IDF for unigrams,bigrams and trigrams

Here we will calculate TF-IDF scores individually for unigrams, bigrams and trigrams and then combine all the three results together

unigram.df<-token.df%>%
  unnest_tokens(features, word, token = "ngrams", n = 1)%>%
  group_by(line,features)%>%
  summarise(Total_Count=n())%>%
  bind_tf_idf(features, line, Total_Count)


bigram.df<-token.df%>%
  unnest_tokens(features, word, token = "ngrams", n = 2)%>%
  group_by(line,features)%>%
  summarise(Total_Count=n())%>%
  bind_tf_idf(features, line, Total_Count)

trigram.df<-token.df%>%
  unnest_tokens(features, word, token = "ngrams", n = 3)%>%
  group_by(line,features)%>%
  summarise(Total_Count=n())%>%
  bind_tf_idf(features, line, Total_Count)

ngram.df<-rbind.data.frame(unigram.df,bigram.df,trigram.df)%>%
  arrange(desc(tf_idf))
  
head(ngram.df,20)

# A tibble: 20 x 6
# Groups:   line [7]
    line features                     Total_Count    tf   idf tf_idf
   <int> <chr>                              <int> <dbl> <dbl>  <dbl>
 1     5 care body                              1 1      2.20  2.20 
 2     8 leave achieve elusive                  1 1      2.20  2.20 
 3     3 win                                    1 1      1.50  1.50 
 4     5 body                                   1 0.5    2.20  1.10 
 5     5 care                                   1 0.5    2.20  1.10 
 6     8 achieve elusive                        1 0.5    2.20  1.10 
 7     8 leave achieve                          1 0.5    2.20  1.10 
 8     2 grand slam championship                1 0.5    2.20  1.10 
 9     2 numb grand slam                        1 0.5    2.20  1.10 
10     4 defintely remember longevity           1 0.5    2.20  1.10 
11     4 remember longevity career              1 0.5    2.20  1.10 
12     6 2017 win autralian                     1 0.5    2.20  1.10 
13     6 return 2017 win                        1 0.5    2.20  1.10 
14     9 gold tennis single                     1 0.5    2.20  1.10 
15     9 olympic gold tennis                    1 0.5    2.20  1.10 
16     8 achieve                                1 0.333  2.20  0.732
17     8 elusive                                1 0.333  2.20  0.732
18     8 leave                                  1 0.333  2.20  0.732
19     2 grand slam                             1 0.333  2.20  0.732
20     2 numb grand                             1 0.333  2.20  0.732

As you can see, legacy has been removed from the word column

Step 3:Creating the DTM

Lets use a tf-idf value of more than 2 for feature selection

features<-ngram.df%>%
  ungroup()%>%
  filter(tf_idf > 2)%>%
  filter(!is.na(features))%>%
  select(features)

features

# A tibble: 2 x 1
  features             
  <chr>                
1 care body            
2 leave achieve elusive

Once features have been shorlisted, we can go ahead and create the document term matrix where each row would represent the text record and columns would represent the features identified. Eseentially we are converting unstructured text data to structured format

feature.df<-ngram.df%>%
  select(line,features,tf_idf)%>%
            inner_join(features,by="features")


head(feature.df)

# A tibble: 2 x 3
# Groups:   line [2]
   line features              tf_idf
  <int> <chr>                  <dbl>
1     5 care body               2.20
2     8 leave achieve elusive   2.20

You can see that in the process of mapping the text data with Features, we have lost row number 1,2,3,4,6 and 7 . In order to avoid dropping off records, lets add “dummy” text to all the line records to the above data frame

feature.df<-ngram.df%>%
  select(line,features,tf_idf)%>%
  rbind.data.frame(data.frame(line=1:nrow(text_df),features="dummy",tf_idf=1))%>%
            inner_join(rbind.data.frame(features,"dummy"),by="features")

feature.df

# A tibble: 11 x 3
# Groups:   line [9]
    line features              tf_idf
   <int> <chr>                  <dbl>
 1     5 care body               2.20
 2     8 leave achieve elusive   2.20
 3     1 dummy                   1   
 4     2 dummy                   1   
 5     3 dummy                   1   
 6     4 dummy                   1   
 7     5 dummy                   1   
 8     6 dummy                   1   
 9     7 dummy                   1   
10     8 dummy                   1   
11     9 dummy                   1

Now we can use the spread function to convert the above data frame from long to wide (or from unpivotted to pivotted)

DTM<-feature.df%>%
  spread(features,"tf_idf",fill=0)%>%
  select(-dummy)

DTM

# A tibble: 9 x 3
# Groups:   line [9]
   line `care body` `leave achieve elusive`
  <int>       <dbl>                   <dbl>
1     1        0                       0   
2     2        0                       0   
3     3        0                       0   
4     4        0                       0   
5     5        2.20                    0   
6     6        0                       0   
7     7        0                       0   
8     8        0                       2.20
9     9        0                       0

Final Comments

We saw how text data can be easily analysed using TF-IDF metric by running through a simple example and understanding the calcualtion behind the metric. Next blog will focus on a use case around using a topic model to divide text data into meaningful topics

Link to Previous R Blogs

https://www.aimlmadeeasy.com/2020/06/r-complete-guide.html

List of Datasets for Practise

https://hofmann.public.iastate.edu/data_in_r_sortable.html

https://vincentarelbundock.github.io/Rdatasets/datasets.html

Machine Learning Made Easy

Sunday, June 28, 2020

Book Review- The Race of my Life: An Autobiography

Saturday, June 20, 2020

Blog 28: Analysing text using TF-IDF