Thursday, May 23, 2024

Web Scraping Tutorial 3 - Getting Detailed Google Reviews, Star Rating and Time Stamp

Web Scrapping Tutorial 3: Scrolling Down and Expanding Reviews by Presssing More

Introduction

In this tutorial, we will look at how we can use Rselenium to :

  • Scroll Down the review page
  • Expand reviews by pressing More

We will also extract the text review along with time stamp and rating given

Step 0: Installing Libraries

package.name<-c("tidyverse","RSelenium")

for(i in package.name){

  if(!require(i,character.only = T)){

    install.packages(i)
  }
  library(i,character.only = T)

}
Loading required package: tidyverse
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.0     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Loading required package: RSelenium


Step 1:Start a headless Firefox browser

The syntax for initiating a headless Firefox browser is shown below

driver <- rsDriver( 
  browser = c("firefox"), 
  chromever = NULL, 
  verbose = F, 
  extraCapabilities = list("firefoxOptions" = list(args = list("--headless"))) 
) 
web_driver <- driver[["client"]] 


Once I execute this, Firefox browser would pop up in the background as shown below.


Step 2: Navigate to the web page

Once you see the above page, we now have to go to the “patanjali store in powai” page in google maps. To do this, we will use the following lines of code

nm<-"patanjali store in powai "
ad_url<-str_c("https://www.google.co.id/maps/search/ ",nm)

# Now navigate to the URL.This is for the browser to go to that location
web_driver$navigate(ad_url)


Once I execute the above, Firefox browser would go to the Jumbo Vada Pav page.


Step 3: Getting the URL for each store

We can see that there are 4 stores here.We will get the information for lets say one of the stores to understand the process in more detail.Once we are familiar with the process, we can replicate it for the other stores as well.

For this, we will have to follow a two steps process:

  • Get the URL for each store
  • Once you get the URL, access the URL link and then get the name and address

Get the XML path of the URL link through Inspection. The XML path for the three stores would look like the below:

  • /html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[1]/div[1]/div[3]/div/a
  • /html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[1]/div[1]/div[5]/div/a
  • /html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[1]/div[1]/div[7]/div/a

The difference between the three is only wrt to the penultimate div element. For the first store it is div[3],for the second store it is div[5] and for the third it is div[7]


Now we will use this information to extract all the links. For each of these XML path, we need to get the href(url)

The penultimate div which is the only difference between the store XML paths will be specified as div(instead of div[3],div[5] or div[7]) and then each of these elements, we will extract the href using

# l1<-list()
link_Store <- web_driver$findElements(using = "xpath", value = "/html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[1]/div[1]/div/div/a")

# print(store)
l1<-list()
for(i in 1:length(link_Store) ){
  
  l1[[i]]<-link_Store[[i]]$getElementAttribute("href")[[1]]
  
}
l1
[[1]]
[1] "https://www.google.co.id/maps/place/Anmol+Patanjali+Store/data=!4m7!3m6!1s0x3be7c9e659d3c0e7:0xcd3cb45a143317ea!8m2!3d19.1202659!4d72.8897626!16s%2Fg%2F11p5jw_gg8!19sChIJ58DTWebJ5zsR6hczFFq0PM0?authuser=0&hl=en&rclk=1"

[[2]]
[1] "https://www.google.co.id/maps/place/Patanjali+Chikitsalay+Powai/data=!4m7!3m6!1s0x3be7c7f214bc8cdf:0x445cf1e34b310805!8m2!3d19.1259382!4d72.9193655!16s%2Fg%2F1pwfbqzfr!19sChIJ34y8FPLH5zsRBQgxS-PxXEQ?authuser=0&hl=en&rclk=1"

[[3]]
[1] "https://www.google.co.id/maps/place/Patanjali+Powai/data=!4m7!3m6!1s0x3be7c7e3010cb5d5:0xa06d2c38a41eb003!8m2!3d19.1186185!4d72.9039256!16s%2Fg%2F11f_j2zbxn!19sChIJ1bUMAePH5zsRA7AepDgsbaA?authuser=0&hl=en&rclk=1"

[[4]]
[1] "https://www.google.co.id/maps/place/Patanjali+Chikitsalay/data=!4m7!3m6!1s0x3be7c7a9b2ec66a7:0xfcf4c5119bd05d3b!8m2!3d19.118694!4d72.903835!16s%2Fg%2F11ssjv5bc1!19sChIJp2bssqnH5zsRO13QmxHF9Pw?authuser=0&hl=en&rclk=1"

We can see that the url for the four stores are now stored in l1 list.


Step 4: Navigating to the Second Store

Lets take the second store for our illustration.It has an average rating of 3.3 and has 21 reviews.Lets navigate to this store first

web_driver$navigate(l1[[2]])

Once we are on Store 2, we have to do the following things:

  • Press the Review Page so that reviews are visible
  • Scroll down all the way to the bottom so that are the reviews are visible
  • There are some reviews for which More button is present. We need to press it
  • Extract the element which has the info for each review
  • Extract relevant info such as text, rating and time stamp


Step 4a: Pressing the Review Page

Review_botton_xpath_store2<-"/html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[3]/div/div/button[2]/div[2]/div[2]"

Review_botton_element_store2<-web_driver$findElement(using = "xpath", 
                                                     value = Review_botton_xpath_store2)


Review_botton_element_store2$clickElement()


Once we execute the above, we will be inside the review page as shown below



Step 4b: Scrolling Down the Page

In this step, we need to get the xpath for the scrollable bar.


This is a little tricky as might be difficult to identify the element through inspection as you will find that there are multiple elements that might look like scroll bar.But we have to proceed as shown in the screenshot below



You just have to search for scroll in the search bar. You will get some options. Get to the elements which has event and scroll in it. Hack - I have found that for most google reviews, scroll down is the same element.So you can use the xpath that we get for the above and use it in any scenario.

We also see that we might have to scroll down multiple times to view the entire review page.In one scroll, around 5 reviews are present.Hence in order to cover all the reviews, we will have to scroll down 4 times

# Scroll to the end
scrollable_div <-
  try(web_driver$findElements(using = "xpath", 
                              value = "/html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[2]"))

num_revies<-21
for (i in 1:ceiling(num_revies / 10 ) + 1) {
  
  try(web_driver$executeScript("arguments[0].scrollTop = arguments[0].scrollHeight",
                               scrollable_div))
  Sys.sleep(0.5)
  print("Yes")
}
[1] "Yes"
[1] "Yes"
[1] "Yes"


For now, lets assume that we would have to run the executescript function with the given arguments to reach the bottom of the page.I got it from a google search.

After executing this, we will be at the bottom of the page as shown below