Web Scrapping Tutorial 3: Scrolling Down and Expanding Reviews by Presssing More
2024-05-23
Introduction
In this tutorial, we will look at how we can use Rselenium to :
- Scroll Down the review page
- Expand reviews by pressing More
We will also extract the text review along with time stamp and rating given
Step 0: Installing Libraries
package.name<-c("tidyverse","RSelenium")
for(i in package.name){
if(!require(i,character.only = T)){
install.packages(i)
}
library(i,character.only = T)
}
Loading required package: tidyverse
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.0 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Loading required package: RSelenium
Step 1:Start a headless Firefox browser
The syntax for initiating a headless Firefox browser is shown below
driver <- rsDriver(
browser = c("firefox"),
chromever = NULL,
verbose = F,
extraCapabilities = list("firefoxOptions" = list(args = list("--headless")))
)
web_driver <- driver[["client"]]
Once I execute this, Firefox browser would pop up in the
background as shown below.
Step 3: Getting the URL for each store
We can see that there are 4 stores here.We will get the information for lets say one of the stores to understand the process in more detail.Once we are familiar with the process, we can replicate it for the other stores as well.
For this, we will have to follow a two steps process:
- Get the URL for each store
- Once you get the URL, access the URL link and then get the name and address
Get the XML path of the URL link through Inspection. The XML path for the three stores would look like the below:
- /html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[1]/div[1]/div[3]/div/a
- /html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[1]/div[1]/div[5]/div/a
- /html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[1]/div[1]/div[7]/div/a
The difference between the three is only wrt to the penultimate div element. For the first store it is div[3],for the second store it is div[5] and for the third it is div[7]
Now we will use this information to extract all the links. For each of these XML path, we need to get the href(url)
The penultimate div which is the only difference between the store XML paths will be specified as div(instead of div[3],div[5] or div[7]) and then each of these elements, we will extract the href using
# l1<-list()
link_Store <- web_driver$findElements(using = "xpath", value = "/html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[1]/div[1]/div/div/a")
# print(store)
l1<-list()
for(i in 1:length(link_Store) ){
l1[[i]]<-link_Store[[i]]$getElementAttribute("href")[[1]]
}
l1
[[1]]
[1] "https://www.google.co.id/maps/place/Anmol+Patanjali+Store/data=!4m7!3m6!1s0x3be7c9e659d3c0e7:0xcd3cb45a143317ea!8m2!3d19.1202659!4d72.8897626!16s%2Fg%2F11p5jw_gg8!19sChIJ58DTWebJ5zsR6hczFFq0PM0?authuser=0&hl=en&rclk=1"
[[2]]
[1] "https://www.google.co.id/maps/place/Patanjali+Chikitsalay+Powai/data=!4m7!3m6!1s0x3be7c7f214bc8cdf:0x445cf1e34b310805!8m2!3d19.1259382!4d72.9193655!16s%2Fg%2F1pwfbqzfr!19sChIJ34y8FPLH5zsRBQgxS-PxXEQ?authuser=0&hl=en&rclk=1"
[[3]]
[1] "https://www.google.co.id/maps/place/Patanjali+Powai/data=!4m7!3m6!1s0x3be7c7e3010cb5d5:0xa06d2c38a41eb003!8m2!3d19.1186185!4d72.9039256!16s%2Fg%2F11f_j2zbxn!19sChIJ1bUMAePH5zsRA7AepDgsbaA?authuser=0&hl=en&rclk=1"
[[4]]
[1] "https://www.google.co.id/maps/place/Patanjali+Chikitsalay/data=!4m7!3m6!1s0x3be7c7a9b2ec66a7:0xfcf4c5119bd05d3b!8m2!3d19.118694!4d72.903835!16s%2Fg%2F11ssjv5bc1!19sChIJp2bssqnH5zsRO13QmxHF9Pw?authuser=0&hl=en&rclk=1"
We can see that the url for the four stores are now stored in l1 list.