Wednesday, October 24, 2018

Unsupervised Vs Supervised: Battle of ML

Batman Vs Superman: Who's better ? DC fans world over break their heads over it. While one is a demi god, the other is a master strategist. In all comics and movie adaptations batsman has beaten Man of Steel hands down. Despite Man of Steels's krytonian powers and herculean machismo ,bat vigilante never misses a beat to pip him in all departments of combat. This is because all adventures demand romance at the expense of powerful characters. Depicting someone overcome hurdles and go the distance  brings hope into the otherwise despair world. There is something beautiful about how an underdog finds his ways up the world order. It is indeed poetic to find something so powerless yet so powerful. This comparison rallies an important point: all confrontations are unfair. Even though Man of Steel and Dark Knight are part of same cinematic universe and work astride in Gotham, their comparison is grossly unjustified. In the context of defending the city against goons, batman's solution are practical and seems to work. Even though superman is brute power and adrenaline, his settings are mostly fights with extra terrestrial or against his own kind. This makes his powers/ methods irrelevant for run of the mill problems. Even though the above excerpt is modeled around comic book characters, they are so true for understanding which Machine Learning method is most relevant and sought after.  In the course of this discussion, we will look at how Unsupervised ML fair against Supervised ML and what problems a typical Data Scientist face at workplace.



Course curriculum the world over has taught us to tackle structured problems. But problems are seldom structured in nature. That's why there is a huge gap between academia and industry. There is an altogether different host of skills a person needs to acquire to survive in an industrial set up. Especially for an industry like Data Science which is continuously evolving, a person needs to have skills to identify THE PROBLEM and convert it into a structured one. Most of us are given a very open ended problem where the stakeholder wants to do something with the data. This has been depicted below in the form of a caricature.


In this situations, Supervised Machine Learning Algorithms seldom helps as for applying them the problem has to be very precise with the need to have an Independent set of Features affecting a given target(s).For almost all problems, Unsupervised MLs provide some respite. All these algorithms are based on identifying patterns, customer segments, dimension reduction, association rules, etc. Ease of application also brings in the prejudice of subjectivity. There are very few diagnostic measures that can be used to ascertain the effectiveness of an Unsupervised ML. With the absence of metrics like accuracy, mape, p value, etc, the onus of ensuring that the method clocks in the desired results really lies with the Data Scientist.

       I am listing some of the widely used Unsupervised MLs along with the relevant industry and use case.

Algorithm/Method
Domain
Use Case
Clustering
Marketing
Identify natural groups within customers to customize marketing campaigns
Principle Component Analysi(PCA)
Marketing
Generally used in Survey data to reduce the number of variables
Market Basket Analysis
Retail
To identify product based rules and association between items
Multi Collaborative Filtering
Retail
To identify product based rules and association between items and impact of demographics
Topic Modelling
Sales
To identify the category into which a particular purchase falls based on the description of item
Density Based Methods
Finance/HR
To identify fraudulent expenses report submitted by an Employee
Histogram based Outlier Scoring(HBOS)
Finance
To identify anomalous transaction
KNN
Retail
Recommend similar items based on user profile

Monday, October 15, 2018

Machine Learning and AI: A Big Tech Bubble

Machine Learning and AI: A Big Tech Bubble

Geoffrey Chaucer said "All good things must come to an end ". The euphoria associated with nice things in life is normally followed by a lull. Everyone experience sapped energy levels with lack of any activity whatsoever. Overall, it is marked by clogged and cluttered opinions about self and surroundings. Could this ebb be generalized to other facets of life. Can we identify cycles of crest and trough in all phenomenons. In the recent years, Machine Learning has gained a lot of traction and has been touted as 'The Next Big Thing'. Through the course of this blog we will see some famous tech bubbles and if Machine Learning fits the bill.

A Tech Bubble is characterized by an unrealistic increase in perceived value of something. This is normally speculative in nature where companies follow herd mentality and try to board the bandwagon.  There is a collective thinking directed at the 'Utopia' where all will gain, prosper and the frenzy will be perennial. New concepts evolve to justify the actions while keeping fundamentals at the backseat. Market corrections eventually catch up as very few is materialized from the stack of forlorn promises. 


                                         As real as it gets

Let us now look at Machine Learning as a tech bubbles and draw some similarities with the previous ones.

Dot Com bubble: It started in 1995 with a surge in stock prices of a lot of internet based companies such as boo.com, Pets.com etc. Even though there was no profit shown by these companies, a lot of money was being invested in them. All realms of rationality were crossed in creating the hype around the 'coming of age' companies. This stirred an insatiable itch in general public to cash in now or never. Stock market rallied initially much to the satisfaction of investors with index such as 'NASDAQ composite index' and 'S & P 500 index' peaking just before the collapse. Eventually in 2000, market correction got the better of stock market and most companies had to file for bankruptcy or settle for depreciated valuation. The turn of events have been beautifully captured in the book 'boo hoo' by Ernst Malmsten.

Crypto Bubble: This one has caught attention of most economic noble laureates. It started in 2009 and has been the bone of contention of most monetary agencies pipping it as 'worth nothing'.   The price of a bitcoin (crytpo's favorite son) has seen an unprecedented increase in the last few years peaking around 2016-17. Ever since then it has nosedived and continues to wither away. What led to such an unimaginable surge in price ? Possibly it captured the imagination of general public as a way to increase their earnings. Word of mouth created a domino effect leading to more subscribers risking their hard earner money. The reasons for the increased price can only be explained by the dynamics of supply and demand where more people were willing to take the plunge with few exchanges to be had. In spite of  its immense popularity, several regulators discounted it with few even suggesting it to have no intrinsic value other than black marketing. This mala fide eventually led the exodus of money from it.

Machine Learning (ML): This has been touted to change the course of mankind. It will solve  complex problems, increase profits, reduce cost, enhance automation and what not. But does it really have that fire power to deliver ??? Sure it does. But it depends upon what means are used to achieve the end and  I will come to that later. How it fits the bill of being a bubble are discussed in the below points:

  1. Machine Learning requires an upscale in IT infrastructure. A lot of companies have pumped in huge amount of money in it(very similar to IT and Telecom companies during Dot Com bubble)
  2. Companies/VC that have invested huge amounts believe that there will be payoff at the certain point which would justify its humongous cost
  3. Top MNCs and start ups have embraced ML and have hired workforce by drooling out heavy paychecks. People with few very years of experience are getting paid like anything. 
    1. The issue with this is that there are very few who actually know how to put Machine Learning to any practical use
    2. Skills such as ETL, reporting, warehousing that come under the umbrella of Analytics (which also includes ML ) are still more relevant than ML as the work is more defined and workforce is more seasoned
    3. The workforce often lacks the necessary skill to identify use case specific to the industry. Without proper use case there is no relevance to ML
    4. People talk about Deep Learning, Computer vision but still cant show how it brings any significant improvement in results obtained from simple ML/analysis/plots
  4. Clients/stakeholders dont really buy the results of  an ML as they are still comfortable with excel tables, charts. Tuning alpha, beta of an algorithm is not really their cup of tea
  5. As the industry matures, VC and big companies will eventually realize that ML doesnt really results in any significant improvement in the offering and that will be the start of real market correction/normalization as far the workforce is concerned (bubble will burst)





Tuesday, October 9, 2018

Machine Learning : What's that ?

Machine Learning

Ever came across the term 'AI' ? Seen the 2004 movie 'I,Robot' where Robots evolve into a a destructive force and were constantly hounding Will Smith. Some enhanced references to AI is seen in the Marvel cinematic universe where 'Jarvis'  is seen helping Tony Stark finish the evil for good. Who can forget the entire 'Matrix' series where AI powered machines were seen scourging after 'Neo' in a post apocalyptic world. It is pretty fascinating to see AI used as a subject of many Sci-fi movies and taking on people's imagination, but what exactly is AI (Artificial Intelligence). To get a hang of AI, it is imperative to first walk the by lanes of Machine Learning (ML). 
            ML is a combination of steps where certain patterns/trends are extracted from the data and the pattern is then used on a similar unseen data to predict or generate score. The data can be related to Sales, Marketing, Order History, Tweets, Financial Transactions, Expense report etc. The steps that encompass ML are (but not restricted to) the following:
  1. Data Collection and Storage
  2. Data Manipulation
  3. Identification of appropriate algorithm/method for the problem
  4. Extraction of patterns/summaries/trends/sensitivities from the data (Learning from data)
  5. Pickling: Storing the extracted learning for later retrieval
  6. Applying the learning on unseen data
  7. Publishing the result in the form of exception reports
A block diagram to represent the above steps is shown below

Different components of the diagram are explained below:


  • Historical Data: The entire  analyses is done on this piece of information. Based on the trends,patterns, summaries etc extracted from this data, appropriate use case and algorithm to use is identified  
  • Storage into an RDBMS structure: The data is fed into a relational database which enables it to be queried by an external query based system. This helps in quick retrieval of historical data
  • Data manipulation: Most of the input data is unfit for carrying out any useful analyses. The data has to be manipulated in order to identify more hidden structures. Some commonly used data manipulation techniques includes (but are not limited to) Dummy Variable Coding, Level Pruning, Log Transformation, Normalization etc
  • ML Application and Pattern Extraction: Based on appropriate business use case, a suitable ML algorithm is identified that can be applied on the input data. Output of the  algorithm is normally sensitivity scores, effects, densities, etc
    • The extracted patterns also known as the learned part of the Model is normally stored onto the drive for later retrieval. This process is commonly referred to as Pickling
  • The learned part of the Model is then applied onto the live data (or the unseen data).  This normally results in scores, predictions etc.
    • Normally, some part of the output is sent to the 'ML Application and Pattern Extraction' phase where it is used for course correction. A typical example of this is seen in processes where a Data Steward validates the output of the Model and directs his findings to the Model Development team. This helps in better tuning of the ML
  • Publish Results: The results of the entire exercise is published in the form of exception reports. This is normally used by the end user for consumption

Machine Learning by themselves: Far from reality

In most places there is a lot of emphasis on need of an ML to automatically do the magic of learning. By now we are pretty clear with what learning is: Extraction of trends, patterns, summaries from the historical data so that we can mimic the properties of it at some later vantage point. This however is grossly exaggerated. All MLs have to be appropriately set up in order to achieve a decent level of precision. For instance, if my goal is to analyse Sales data, I have to make a distinction about what range of Sales data I can use. It will involve removal of Outliers, some transformation, binning etc. The steps leading up to an ML are so application dependent that each time I handle Sales, I have to come up with a different sanity check. All these myriad things make automatic learning a difficult proposition. Meaning an ML has to be assisted to get wonders out of it.

Next we will discuss about what most IT aficionado label ML(or AI) to be: A big bubble. Just like the dot com bubble, it is destined to swell to it ultimate collapse. But is there a modicum of truth in this supposition or is it  purely an over speculation ?


Web Scraping Tutorial 2 - Getting the Avg Rating and Reviews Count

Web Scrapping Tutorial 2: Getting Overall rating and number of reviews ...