Tuesday, April 23, 2024

Web Scraping Tutorial 2 - Getting the Avg Rating and Reviews Count

Web Scrapping Tutorial 2: Getting Overall rating and number of reviews

Introduction

In the first tutorial, we looked at how we can use Rselenium to extract the contents from the web. We specifically looked at how to leverage xpath from a web element(such as store name) to scrape information from google reviews. We looked at the following functions to extract data:

  • web_driver$navigate(l1)
  • web_driver$findElements
  • getElementAttribute(“href”)
  • web_driver\(findElements(using = "xpath", value = nm)[[1]]\)getElementText()

Moving on, in this blog we would understand how to extract the average google ratings and total number of reviews given for each store (from previous examples)

Step 0: How would Rselnium do web scraping for these two stores

We would use the following steps to get the information

  • Start a headless browser
  • Navigate to the google map page(shown above)
  • Get the url(links) of each of these stores
  • Navigate on each of these links
  • Get the xpath for the store name and address
  • For each of the xpaths(names and address), get the element sitting at these locations

So as the first step, we will start a headless browser.Firefox works fine in my system so I would go with Firefox browser. Before this, lets import the required libraries

package.name<-c("tidyverse","RSelenium")

for(i in package.name){

  if(!require(i,character.only = T)){

    install.packages(i)
  }
  library(i,character.only = T)

}
Loading required package: tidyverse
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.0     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Loading required package: RSelenium

Step 1:Start a headless Firefox browser

The syntax for initiating a headless Firefox browser is shown below

driver <- rsDriver( 
  browser = c("firefox"), 
  chromever = NULL, 
  verbose = F, 
  extraCapabilities = list("firefoxOptions" = list(args = list("--headless"))) 
) 
web_driver <- driver[["client"]] 


Once I execute this, Firefox browser would pop up in the background as shown below.


Step 2: Navigate to the web page

Once you see the above page, we now have to go to the “Jumbo wada pav mulund west” page in google maps. To do this, we will use the following lines of code

nm<-"Jumbo wada pav mulund west "
ad_url<-str_c("https://www.google.co.id/maps/search/ ",nm)

# Now navigate to the URL.This is for the browser to go to that location
web_driver$navigate(ad_url)


Once I execute the above, Firefox browser would go to the Jumbo Vada Pav page.