Machine Learning Made Easy: Web Scraping Tutorial 2 - Getting the Avg Rating and Reviews Count

Web Scrapping Tutorial 2: Getting Overall rating and number of reviews

Introduction

In the first tutorial, we looked at how we can use Rselenium to extract the contents from the web. We specifically looked at how to leverage xpath from a web element(such as store name) to scrape information from google reviews. We looked at the following functions to extract data:

web_driver$navigate(l1)
web_driver$findElements
getElementAttribute(“href”)
web_driver$findElements(using = "xpath", value = nm)[[1]]$getElementText()

Moving on, in this blog we would understand how to extract the average google ratings and total number of reviews given for each store (from previous examples)

Step 0: How would Rselnium do web scraping for these two stores

We would use the following steps to get the information

Start a headless browser
Navigate to the google map page(shown above)
Get the url(links) of each of these stores
Navigate on each of these links
Get the xpath for the store name and address
For each of the xpaths(names and address), get the element sitting at these locations

So as the first step, we will start a headless browser.Firefox works fine in my system so I would go with Firefox browser. Before this, lets import the required libraries

package.name<-c("tidyverse","RSelenium")

for(i in package.name){

  if(!require(i,character.only = T)){

    install.packages(i)
  }
  library(i,character.only = T)

}

Loading required package: tidyverse

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.0     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Loading required package: RSelenium

Step 1:Start a headless Firefox browser

The syntax for initiating a headless Firefox browser is shown below

driver <- rsDriver( 
  browser = c("firefox"), 
  chromever = NULL, 
  verbose = F, 
  extraCapabilities = list("firefoxOptions" = list(args = list("--headless"))) 
) 
web_driver <- driver[["client"]]

Once I execute this, Firefox browser would pop up in the background as shown below.

Step 2: Navigate to the web page

Once you see the above page, we now have to go to the “Jumbo wada pav mulund west” page in google maps. To do this, we will use the following lines of code

nm<-"Jumbo wada pav mulund west "
ad_url<-str_c("https://www.google.co.id/maps/search/ ",nm)

# Now navigate to the URL.This is for the browser to go to that location
web_driver$navigate(ad_url)

Once I execute the above, Firefox browser would go to the Jumbo Vada Pav page.

Step 3: Getting the URL for each store

We can see that there are just two stores here.We will get the store name and corresponding address in a data frame

For this, we will have to follow a two steps process:

Get the URL for each store
Once you get the URL, access the URL link and then get the name and address

Get the XML path of the URL link through Inspection. The XML path for the two stores would look like the below:

/html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[1]/div[1]/div[4]/div/a ## Including Plots
/html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[1]/div[1]/div[6]/div/a

The difference between the two is only wrt to the penultimate div element. For the first store it is div[4] and for the second store it is div[6]

Now we will use this information to extract all the links. For each of these XML path, we need to get the href(url)

The penultimate div which is the only difference between store1 and store2 XML paths will be specified as div(instead of div[4] or div[6]) and then each of these elements, we will extract the href using

# l1<-list()
link_Store <- web_driver$findElements(using = "xpath", value = "/html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[1]/div[1]/div/div/a")

# print(store)
l1<-list()
for(i in 1:length(link_Store) ){
  
  l1[[i]]<-link_Store[[i]]$getElementAttribute("href")[[1]]
  
}
l1

[[1]]
[1] "https://www.google.co.id/maps/place/JUMBO+KING+MULUND/data=!4m7!3m6!1s0x3be7b9f243892907:0xc1c58fde55a52ab8!8m2!3d19.1716851!4d72.9552551!16s%2Fg%2F11c6lk8nbl!19sChIJBymJQ_K55zsRuCqlVd6PxcE?authuser=0&hl=en&rclk=1"

[[2]]
[1] "https://www.google.co.id/maps/place/Jumbo+Vada+Pav/data=!4m7!3m6!1s0x3be7b9ee2763f83f:0x6a56910364c6346b!8m2!3d19.1722852!4d72.9559067!16s%2Fg%2F11t10_mhq2!19sChIJP_hjJ-655zsRazTGZAORVmo?authuser=0&hl=en&rclk=1"

We can see that the url for the two stores are now stored in l1 list. We will use these links to navigate to individual store site and then extract the store name and address

Step 4: Getting the store name and address

Now we will navigate to each of these store links, get the XML path for each of the store name and address and extract the corresponding elements

Step 4a: Getting store name and address for store 1

web_driver$navigate(l1[[1]])

# Th XML path where the store name is located is same for both the stores
  nm1_name<-"/html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[2]/div/div[1]/div[1]/h1"
  
# Getting the Store Name
store_nm1 <- web_driver$findElements(using = "xpath", value = nm1_name)[[1]]$getElementText()[[1]]



# Th XML path where the store address is located 
nm1_add<-"/html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[9]/div[3]/button/div/div[2]/div[1]"
  
  
# Getting the Store address
store_add1 <- web_driver$findElements(using = "xpath", value = nm1_add)[[1]]$getElementText()[[1]]

store1.df<-data.frame(Store_Name=store_nm1,
                      Store_Address=store_add1)

store1.df

         Store_Name
1 JUMBO KING MULUND
                                                                           Store_Address
1 Shop no 1, Sardar Vallabhbhai Patel Rd, Mulund West, Mumbai, Maharashtra 400080, India

Step 4b: Getting store name and address for store 2

web_driver$navigate(l1[[2]])

# Th XML path where the store name is located is same for both the stores
  nm2_name<-"/html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[2]/div/div[1]/div[1]/h1"
  
# Getting the Store Name
store_nm2 <- web_driver$findElements(using = "xpath", value = nm2_name)[[1]]$getElementText()[[1]]


# Th XML path where the store address is located 
nm2_add<-"/html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[7]/div[3]/button/div/div[2]/div[1]"
  
  
# Getting the Store address
store_add2 <- web_driver$findElements(using = "xpath", value = nm2_add)[[1]]$getElementText()[[1]]

store2.df<-data.frame(Store_Name=store_nm2,
                      Store_Address=store_add2)

store2.df

      Store_Name
1 Jumbo Vada Pav
                                                                                Store_Address
1 8, Sardar Vallabhbhai Patel Rd, Vidya Vihar, Mulund West, Mumbai, Maharashtra 400080, India

Lets combine the two data frames

interim.df<-rbind.data.frame(store1.df,store2.df)
interim.df

         Store_Name
1 JUMBO KING MULUND
2    Jumbo Vada Pav
                                                                                Store_Address
1      Shop no 1, Sardar Vallabhbhai Patel Rd, Mulund West, Mumbai, Maharashtra 400080, India
2 8, Sardar Vallabhbhai Patel Rd, Vidya Vihar, Mulund West, Mumbai, Maharashtra 400080, India

Step 5: Geeting the logic for Total Average rating

For this, we will first use web_driver$navigate(l1[[1]]) to go to the URL for store 1. It would look something like this

Now we would highlight 3.6 and inspect its element and get the corresponding XML path

The XML path for rating is: /html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[2]/div/div[1]/div[2]/div/div[1]/div[2]/span[1]/span[1]

Similarly, we can get the XML path for total respondents

XML path is: /html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[2]/div/div[1]/div[2]/div/div[1]/div[2]/span[2]/span/span

Now lets use the above two to get the total average rating and total number of reviews for Store1

web_driver$navigate(l1[[1]])

# Th XML path where the store rating is located
nm1_rating<-"/html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[2]/div/div[1]/div[2]/div/div[1]/div[2]/span[1]/span[1]"
  
# Getting the Store Rating
store_rating1 <- web_driver$findElements(using = "xpath", value = nm1_rating)[[1]]$getElementText()[[1]]


# Th XML path where the store review count is located 
nm1_count_review<-"/html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[2]/div/div[1]/div[2]/div/div[1]/div[2]/span[2]/span/span"
  
  
# Getting the Store address
store_review_count1 <- web_driver$findElements(using = "xpath", value = nm1_count_review)[[1]]$getElementText()[[1]]

store1.rating.df<-data.frame(Avg_Rating=store_rating1,
                             Total_Review_Count=store_review_count1)

store1.rating.df

  Avg_Rating Total_Review_Count
1        3.6              (150)

Store2

web_driver$navigate(l1[[2]])

# Th XML path where the store rating is located
nm2_rating<-"/html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[2]/div/div[1]/div[2]/div/div[1]/div[2]/span[1]/span[1]"
  
# Getting the Store Rating
store_rating2 <- web_driver$findElements(using = "xpath", value = nm2_rating)[[1]]$getElementText()[[1]]


# Th XML path where the store review count is located 
nm2_count_review<-"/html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[2]/div/div[1]/div[2]/div/div[1]/div[2]/span[2]/span/span"
  
  
# Getting the Store address
store_review_count2 <- web_driver$findElements(using = "xpath", value = nm2_count_review)[[1]]$getElementText()[[1]]

store2.rating.df<-data.frame(Avg_Rating=store_rating2,
                             Total_Review_Count=store_review_count2)

store2.rating.df

  Avg_Rating Total_Review_Count
1        4.8                (8)

interim.df2<-rbind.data.frame(store1.rating.df,store2.rating.df)
interim.df2

  Avg_Rating Total_Review_Count
1        3.6              (150)
2        4.8                (8)

Having discussed the above, lets see if we could do the above in one shot. For this, we would have to do the following:

Navigate to each store link
Get store name, address, rating and review count based on common XML path between stores

Step 6: Running all the logic at once

Running the headless browsing and getting the individual store URLs

nm<-"Jumbo wada pav mulund west "
ad_url<-str_c("https://www.google.co.id/maps/search/ ",nm)

# Now navigate to the URL.This is for the browser to go to that location
web_driver$navigate(ad_url)

# l1<-list()
link_Store <- web_driver$findElements(using = "xpath", value = "/html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[1]/div[1]/div/div/a")

# print(store)
l1<-list()
for(i in 1:length(link_Store) ){
  
  l1[[i]]<-link_Store[[i]]$getElementAttribute("href")[[1]]
  
}
l1

[[1]]
[1] "https://www.google.co.id/maps/place/JUMBO+KING+MULUND/data=!4m7!3m6!1s0x3be7b9f243892907:0xc1c58fde55a52ab8!8m2!3d19.1716851!4d72.9552551!16s%2Fg%2F11c6lk8nbl!19sChIJBymJQ_K55zsRuCqlVd6PxcE?authuser=0&hl=en&rclk=1"

[[2]]
[1] "https://www.google.co.id/maps/place/Jumbo+Vada+Pav/data=!4m7!3m6!1s0x3be7b9ee2763f83f:0x6a56910364c6346b!8m2!3d19.1722852!4d72.9559067!16s%2Fg%2F11t10_mhq2!19sChIJP_hjJ-655zsRazTGZAORVmo?authuser=0&hl=en&rclk=1"

For each store, getting the following

Name
Address
Avg Rating
Review count

l2<-list()
k<-1
for(i in l1){

k<-k+1    

# Acessing the store url
web_driver$navigate(i)


#############################STORE NAME#################################################################
# The XML path where the store name is located is same for both the stores
store_name_xml<-"/html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[2]/div/div[1]/div[1]/h1"

# Getting the Store Name
store_name <- web_driver$findElements(using = "xpath", value = store_name_xml)[[1]]$getElementText()[[1]]
#############################STORE NAME#################################################################


#############################STORE ADDRESS##############################################################
store_add_xml<-"/html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div/div[3]/button/div/div[2]/div[1]"

# Getting the Store Address
store_add <- web_driver$findElements(using = "xpath", value = store_add_xml)[[1]]$getElementText()[[1]]
#############################STORE ADDRESS##############################################################



#############################STORE RAING################################################################
store_rating_xml<-"/html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[2]/div/div[1]/div[2]/div/div[1]/div[2]/span[1]/span[1]"

# Getting the Store Avg Rating
store_rating <- web_driver$findElements(using = "xpath", value = store_rating_xml)[[1]]$getElementText()[[1]]
#############################STORE RATING###############################################################


#############################STORE REVIEW COUNT#########################################################
store_rating_count_xml<-"/html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[2]/div/div[1]/div[2]/div/div[1]/div[2]/span[2]/span/span"

# Getting the Store Review Count
store_review_count <- web_driver$findElements(using = "xpath", value = store_rating_count_xml)[[1]]$getElementText()[[1]]
#############################STORE REVIEW COUNT#########################################################


# data frame containing details
store.df<-data.frame(Store_Name=store_name,
                     Store_Address=store_add,
                     Store_Rating=store_rating,
                     Store_Total_Review=store_review_count)


l2[[k]]<-store.df
  
}

final.df<-do.call(rbind.data.frame,l2)
final.df

         Store_Name
1 JUMBO KING MULUND
2    Jumbo Vada Pav
                                                                                Store_Address
1                                                                                            
2 8, Sardar Vallabhbhai Patel Rd, Vidya Vihar, Mulund West, Mumbai, Maharashtra 400080, India
  Store_Rating Store_Total_Review
1          3.6              (150)
2          4.8                (8)

Machine Learning Made Easy

Tuesday, April 23, 2024

Web Scraping Tutorial 2 - Getting the Avg Rating and Reviews Count