Why doesn't
Cheteshwar Pujara come one down in a 50 over match ?. Or why doesn’t Aswin
bowl flighted deliveries in limited over matches ? Despite being the ‘Hitman’
of ODI, Rohit Sharma seldom makes the cut in a 5 day contest. Are the required
skill set different for a test match against a one dayer or for that matter a
T20 ? All of this can be answered with a simple phrase-horses for courses. You pick players that fit the bill. All
selection has to be made in accordance with the requirements. That’s precise
isn’t it ! Same thing can be carried to usage of software for ML. This blog
particularly explores the scenarios which are more conducive to the usage of
Python against R and vice-versa.
Before we start, a little
background of R and Python language is necessary in light of Machine
Learning(ML). Python is a computer programming language. In the wake of recent
advances in ML, python community contributed several libraries that enables one
to play around with data. However, R as such is a typical Statistical
programming language like Matlab. It was developed to cater to the Math community in the first place. There is lot of debate on which language is best
and what to prefer for ML.The exasperation is aptly shown in the below image.
The table below
highlights the key differences between Python and R wrt certain commonly used practices
in ML. The entries in the cell indicates the library and/or function used to
execute the requirement. The colored grid indicates the superiority of a given
language over the other. In case of a tie, both the cells are colored.
Functionality
|
Python
|
R
|
Data Slicing and Summary
|
pandas
|
Dplyr, data.table
|
Visualization
|
matplotlib
|
ggplot
|
Data Set Repositories
|
NA
|
Economteric Data: AER library
|
Linear Models (Regression family)
|
Scikit learn
|
Car glm
|
Hyperparamter Tuning
|
makeLearner
|
GridSearchCV
|
Natural Language Processing(NLP)
|
NLTK, gensim
|
Tidyverse,topicmodel
|
Web Scrapping
|
Beautiful soup
|
rvest
|
Interfacing with other System(like Outlook)
|
Pywin32
|
RDCOMClient
|
Read JSON
|
json
|
rjson
|
Pickeling
|
pickle
|
saveRDS,readRDS
|
Web App (Especially for Proof of Concept)
|
Django
|
R Shiny
|
Below is an explanation of the contents in the table:
- Data Slicing and Summary: Data filtering,sorting,summarization,etc are required in every ML exercise. In R, one can do this using functions from dplyr and data.table libraries. The pipe operator(%>%) from dplyr is specially useful as it helps in readability of a cascaded operation and in debugging. Python on the other hand has Pandas which doesn’t have a pipe operator. Thus cascaded operations on data becomes unmanageable
- Visualization: ggplot and associated libraries in R helps to create highly useful plots such as histograms,geographical heat maps,Interactive and animated graphs. Python has matplotlib library for creating graphs but doesn’t provide enhancements as ggplot does
- Data Set Repositories: There are a lot of data repositories in R. Users can invoked these from several libraries. Thus one can play around with the data and gain understanding. Some useful repositories include AER library that has useful census data. Python on the other hand doesn’t have any
- Linear Models: R and python both have libraries that helps in application of regression models. However, there is one aspect where R stands out as a clear winner: treatment of a categorical variable. N-1 encoding is automatically taken care of in R but in python it is at the discretion of the user
- Hyper parameter Tuning: Both languages offer extraction of optimal parameters using hyper parameter tuning. However, in python, one can tune more number of parameters in comparison to R. For instance in R, for a Random Forest algorithm, one can only tune number of trees, nodes and leaf size. However, using python, one can also tune in sample split parameter. More optimal parameters lead to better accuracy
- Natural Language Processing(NLP): Both R as well as Python have libraries to handle text. A lot of users will vouch for Python here but having used both the software, I didn’t find any difference between the two
- Web Scrapping: Python has methods from beautiful soup library to extract any element having an html tag. Things are more clearly and precisely defined in python. However, R doesn’t offer a one stop solution for extraction. A lot of libraries with no clear examples leave much to soul searching
- Interfacing with other System(like Outlook): Considering Python is a programming language, system integration is pretty matured. One can use python to communicate between two different systems such as Outlook and Python terminal. The protocols that govern such a communication are already there. On the other hand R doesn’t have well defined functions to do this
- Read JSON: Python takes less time to read and process a JSON file format in comparison to R. Also since text inside a JSON resembles a dictionary, using python to read and parse it makes a lot of sense
- Pickling: This can be done in both Python as well as R
- Web App (Especially for Proof of Concept): This can be done in both Python as well as R however, the time to create an App in R is less.
Thanks for the very neat explanation!
ReplyDeleteThanks for the feedback.you can check my other blogs as well
ReplyDelete