Web Scrapping Tutorial 2: Getting Overall rating and number of reviews
2024-04-23
Introduction
In the first tutorial, we looked at how we can use Rselenium to extract the contents from the web. We specifically looked at how to leverage xpath from a web element(such as store name) to scrape information from google reviews. We looked at the following functions to extract data:
- web_driver$navigate(l1)
- web_driver$findElements
- getElementAttribute(“href”)
- web_driver\(findElements(using = "xpath", value = nm)[[1]]\)getElementText()
Moving on, in this blog we would understand how to extract the average google ratings and total number of reviews given for each store (from previous examples)
Step 0: How would Rselnium do web scraping for these two stores
We would use the following steps to get the information
- Start a headless browser
- Navigate to the google map page(shown above)
- Get the url(links) of each of these stores
- Navigate on each of these links
- Get the xpath for the store name and address
- For each of the xpaths(names and address), get the element sitting at these locations
So as the first step, we will start a headless browser.Firefox works fine in my system so I would go with Firefox browser. Before this, lets import the required libraries
package.name<-c("tidyverse","RSelenium")
for(i in package.name){
if(!require(i,character.only = T)){
install.packages(i)
}
library(i,character.only = T)
}
Loading required package: tidyverse
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.0 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Loading required package: RSelenium
Step 1:Start a headless Firefox browser
The syntax for initiating a headless Firefox browser is shown below
driver <- rsDriver(
browser = c("firefox"),
chromever = NULL,
verbose = F,
extraCapabilities = list("firefoxOptions" = list(args = list("--headless")))
)
web_driver <- driver[["client"]]
Once I execute this, Firefox browser would pop up in the
background as shown below.
Step 3: Getting the URL for each store
We can see that there are just two stores here.We will get the store name and corresponding address in a data frame
For this, we will have to follow a two steps process:
- Get the URL for each store
- Once you get the URL, access the URL link and then get the name and address
Get the XML path of the URL link through Inspection. The XML path for the two stores would look like the below:
- /html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[1]/div[1]/div[4]/div/a ## Including Plots
- /html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[1]/div[1]/div[6]/div/a
The difference between the two is only wrt to the penultimate div element. For the first store it is div[4] and for the second store it is div[6]
Now we will use this information to extract all the links. For each of these XML path, we need to get the href(url)
The penultimate div which is the only difference between store1 and store2 XML paths will be specified as div(instead of div[4] or div[6]) and then each of these elements, we will extract the href using
# l1<-list()
link_Store <- web_driver$findElements(using = "xpath", value = "/html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[1]/div[1]/div/div/a")
# print(store)
l1<-list()
for(i in 1:length(link_Store) ){
l1[[i]]<-link_Store[[i]]$getElementAttribute("href")[[1]]
}
l1
[[1]]
[1] "https://www.google.co.id/maps/place/JUMBO+KING+MULUND/data=!4m7!3m6!1s0x3be7b9f243892907:0xc1c58fde55a52ab8!8m2!3d19.1716851!4d72.9552551!16s%2Fg%2F11c6lk8nbl!19sChIJBymJQ_K55zsRuCqlVd6PxcE?authuser=0&hl=en&rclk=1"
[[2]]
[1] "https://www.google.co.id/maps/place/Jumbo+Vada+Pav/data=!4m7!3m6!1s0x3be7b9ee2763f83f:0x6a56910364c6346b!8m2!3d19.1722852!4d72.9559067!16s%2Fg%2F11t10_mhq2!19sChIJP_hjJ-655zsRazTGZAORVmo?authuser=0&hl=en&rclk=1"
We can see that the url for the two stores are now stored in l1 list. We will use these links to navigate to individual store site and then extract the store name and address
Step 4: Getting the store name and address
Now we will navigate to each of these store links, get the XML path for each of the store name and address and extract the corresponding elements
Step 4a: Getting store name and address for store 1
web_driver$navigate(l1[[1]])
# Th XML path where the store name is located is same for both the stores
nm1_name<-"/html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[2]/div/div[1]/div[1]/h1"
# Getting the Store Name
store_nm1 <- web_driver$findElements(using = "xpath", value = nm1_name)[[1]]$getElementText()[[1]]
# Th XML path where the store address is located
nm1_add<-"/html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[9]/div[3]/button/div/div[2]/div[1]"
# Getting the Store address
store_add1 <- web_driver$findElements(using = "xpath", value = nm1_add)[[1]]$getElementText()[[1]]
store1.df<-data.frame(Store_Name=store_nm1,
Store_Address=store_add1)
store1.df
Store_Name
1 JUMBO KING MULUND
Store_Address
1 Shop no 1, Sardar Vallabhbhai Patel Rd, Mulund West, Mumbai, Maharashtra 400080, India
Step 4b: Getting store name and address for store 2
web_driver$navigate(l1[[2]])
# Th XML path where the store name is located is same for both the stores
nm2_name<-"/html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[2]/div/div[1]/div[1]/h1"
# Getting the Store Name
store_nm2 <- web_driver$findElements(using = "xpath", value = nm2_name)[[1]]$getElementText()[[1]]
# Th XML path where the store address is located
nm2_add<-"/html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[7]/div[3]/button/div/div[2]/div[1]"
# Getting the Store address
store_add2 <- web_driver$findElements(using = "xpath", value = nm2_add)[[1]]$getElementText()[[1]]
store2.df<-data.frame(Store_Name=store_nm2,
Store_Address=store_add2)
store2.df
Store_Name
1 Jumbo Vada Pav
Store_Address
1 8, Sardar Vallabhbhai Patel Rd, Vidya Vihar, Mulund West, Mumbai, Maharashtra 400080, India
Lets combine the two data frames
interim.df<-rbind.data.frame(store1.df,store2.df)
interim.df
Store_Name
1 JUMBO KING MULUND
2 Jumbo Vada Pav
Store_Address
1 Shop no 1, Sardar Vallabhbhai Patel Rd, Mulund West, Mumbai, Maharashtra 400080, India
2 8, Sardar Vallabhbhai Patel Rd, Vidya Vihar, Mulund West, Mumbai, Maharashtra 400080, India
Step 5: Geeting the logic for Total Average rating
For this, we will first use web_driver$navigate(l1[[1]]) to go to the URL for store 1. It would look something like this
Now we would highlight 3.6 and inspect its element and get the corresponding XML path
The XML path for rating is: /html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[2]/div/div[1]/div[2]/div/div[1]/div[2]/span[1]/span[1]
Similarly, we can get the XML path for total respondents