Wednesday, December 13, 2023

Web Scraping using RSelenium - Tutorial 1

Web Scraping using Rselenium

Basics of Web Scraping

Web Scraping is process of extracting useful information from a website or URL.This information can be in the form of text, tables, embedded links,ratings, etc. It is a very handy tool when one wants to supplement the existing information of country demographics, customer preferences, store location, etc.


How to scrape data from a website

Data can be scrapped in two ways:

  • One from a static website(which doesnt change often).Examples of this include wikipedia page, govt websites, Company e-site.
  • Second from a dynamic website(such as google pages,shopee, etc).Here the content is masked through Java script and Jqeury and hence we cant use the html way(using specific tags) of extracting information

In this blog, we will look at how to scrape data for the second option.We will use what is known as headless browsing.A headless browser enables you to load a website without a GUI and all the actions are implemented using a command line interface. I wont go into the detail of it as it would murk the purpose of the blog. In R, we use Rselenium package that helps in headless browsing

Step 0: Importing the libraries

package.name<-c("tidyverse","RSelenium")

for(i in package.name){

  if(!require(i,character.only = T)){

    install.packages(i)
  }
  library(i,character.only = T)

}
## Loading required package: tidyverse
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Loading required package: RSelenium


Step 1: Extracting names and address of all Jumbo Vada Pav stores in Mulund West, Mumbai

Lets say we want to extract the names of all the stores and their location from Mulund West. Lets see what we get when we search this on google