Web Scraping using Rselenium
Parag Verma and Wen Long
2023-12-13
Basics of Web Scraping
Web Scraping is process of extracting useful information from a website or URL.This information can be in the form of text, tables, embedded links,ratings, etc. It is a very handy tool when one wants to supplement the existing information of country demographics, customer preferences, store location, etc.
How to scrape data from a website
Data can be scrapped in two ways:
- One from a static website(which doesnt change often).Examples of
this include wikipedia page, govt websites, Company e-site.
- Second from a dynamic website(such as google pages,shopee, etc).Here the content is masked through Java script and Jqeury and hence we cant use the html way(using specific tags) of extracting information
In this blog, we will look at how to scrape data for the second option.We will use what is known as headless browsing.A headless browser enables you to load a website without a GUI and all the actions are implemented using a command line interface. I wont go into the detail of it as it would murk the purpose of the blog. In R, we use Rselenium package that helps in headless browsing
Step 0: Importing the libraries
package.name<-c("tidyverse","RSelenium")
for(i in package.name){
if(!require(i,character.only = T)){
install.packages(i)
}
library(i,character.only = T)
}
## Loading required package: tidyverse
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Loading required package: RSelenium
Step 1: Extracting names and address of all Jumbo Vada Pav stores in Mulund West, Mumbai
Lets say we want to extract the names of all the stores and their location from Mulund West. Lets see what we get when we search this on google