Web Scraping using Rselenium
Parag Verma and Wen Long
2023-12-13
Basics of Web Scraping
Web Scraping is process of extracting useful information from a website or URL.This information can be in the form of text, tables, embedded links,ratings, etc. It is a very handy tool when one wants to supplement the existing information of country demographics, customer preferences, store location, etc.
How to scrape data from a website
Data can be scrapped in two ways:
- One from a static website(which doesnt change often).Examples of
this include wikipedia page, govt websites, Company e-site.
- Second from a dynamic website(such as google pages,shopee, etc).Here the content is masked through Java script and Jqeury and hence we cant use the html way(using specific tags) of extracting information
In this blog, we will look at how to scrape data for the second option.We will use what is known as headless browsing.A headless browser enables you to load a website without a GUI and all the actions are implemented using a command line interface. I wont go into the detail of it as it would murk the purpose of the blog. In R, we use Rselenium package that helps in headless browsing
Step 0: Importing the libraries
package.name<-c("tidyverse","RSelenium")
for(i in package.name){
if(!require(i,character.only = T)){
install.packages(i)
}
library(i,character.only = T)
}
## Loading required package: tidyverse
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Loading required package: RSelenium
Step 1: Extracting names and address of all Jumbo Vada Pav stores in Mulund West, Mumbai
Lets say we want to extract the names of all the stores and their location from Mulund West. Lets see what we get when we search this on google
we can see from the image that there are two Jumbo Vada Pav stores in Mulund West Mumbai. I have taken this example because we can grasp the concept easily with just two stores.
Step 2: How would Rselnium do web scraping for these two stores
Rselenium would perform the following basic steps:
- Start a headless browser
- Navigate to the google map page(shown above)
- Get the url(links) of each of these stores
- Navigate on each of these links
- Get the xpath for the store name and address
- For each of the xpaths(names and address), get the element sitting at these locations
So as th first step, we will start a headless browser.Firefox works fine in my system so I would go with Firefox browser
Step 3: Start a headless Firefox browser
The syntax for initiating a headless Firefox browser is shown below
driver <- rsDriver(
browser = c("firefox"),
chromever = NULL,
verbose = F,
extraCapabilities = list("firefoxOptions" = list(args = list("--headless")))
)
web_driver <- driver[["client"]]
Once I execute this, Firefox browser would pop up in the background as shown below.
## Step 3: Navigate to the google map page for Jumbo Vada Pav
We will now use the Firefox browser to navigate to the google map page for Jumbo Vada Pav
nm<-"Jumbo wada pav mulund west "
ad_url<-str_c("https://www.google.co.id/maps/search/ ",nm)
# Now navigate to the URL.This is for the browser to go to that location
web_driver$navigate(ad_url)
Once I execute the above, Firefox browser would go to the Jumbo Vada Pav page.
Step 4: Get the url(links) of each of these stores
In order to gt the link, we have to right click on the first store and click on inspect
Once you click on inspect, you would be directed to the highlighted portion. The a tag indicates that it is a link
Right click on the highlighted portion and copy the xml path as shown below