Basics of Decision Tree

Introduction to Decision Trees

Decision Tree is a Supervised Machine Learning algorithm which can be used for classification as well as regression problems.It uses a tree representation of the dataset at hand where each record/case is assigned to the node of a tree.The arrangement represents a tree like structure but with roots at the top.The tree is read from root to the leaf node which contains the classification of the instance

Simple Decision Tree

Lets consider a situation where we want to play tennis and we have a model to help us out here. Now depending on the weather condition, we will be using a decision tree to predict whether we can play or not.A simple illustration is represented below

Tree classifies Sunday morning according to whether they are suitable for playing tennis
Classification is made by traversing from root node to leaf node
Each node represents an attribute
Each branch descending from the node represents one of the possible values of the attribute
- Sunny, Overcast and Rain are the unique values for 'Outlook'
If Outlook = Sunny,Humidity = High.This will be classified as a negative example
The output expression are presented only with respect to yes cases:
- (Outlook = Sunny; Humidity = Normal )
- (Outlook = Overcast)
- (Outlook = Rain ; Wind = Weak)

ID3 Algorithm

Most implementations of decision trees are based on the ID3 algorithm.Salient features of the ID3 are:

A decision tree is constructed from top to bottom
Root node is determined by attribute that best splits the data
A descendant of the root node is created using all possible values of the attribute and the entire process is repeated
It is called greedy as it never goes back to the previous choices

Which attribute to pick ?

Attribute that is most useful for classifying examples should be used at the node.This is quantitatively determined by Information Gain criteria.To understand Information gain we should first study Entropy that measures purity of an arbitrary collection of examples

Entropy

To understand entropy,the following illustration should be kept in mind

Given a sample S having some positive and negative examples Entropy is given by
- S = -(p+)(log2p+) - (p-)(log2p-)
- p+ represent proportion of +ve examples in S and p- represent proportion of -ve examples
Suppose in a sample of 14 cases,there are 9 +ve and 5 –ve instances
- S = -(9/14) log2(9/14) - (5/14) log2 (5/14)
- S = 0.94
S = 0 when all member belong to the same class
S = 1 when the proportions are equal
S is between 0 and 1 when sample contains unequal number of +ve and –ve cases

Information Gain

Measures the expected reduction in Entropy.Entropy measures the impurity in a sample. Higher the impurity, higher is the entropy of the system and conversely lower is the information gain.This is used by the algorithm to determine which attribute to pick for splitting.

As a general rule of thumb,If we partition the sample based on a attribute, then there should be reduction in entropy.The attribute that results in maximum reduction in entropy is used for splitting.

Gain (S,A) = Entropy (S) - ∑ (|Sv|/|S|)* (Entropy Sv) for all values in A(A is an attribute)

Dataset for our Analysis

The example that we discussed above about playing Tennis on a Sunday morning us based on the dataset shown below.There are a total of 9 instances where we can play tennis and there are 5 instances where we cant.

Select Attribute for root node

Now lets determine which attribute should be selected to first split the data. The attribute selected should result in reduction of entropy or conversely should result in maximum Information gain.To start, we have to calculate the reference Entropy of the system.The value calculated in the Entropy section as 0.94 is the reference entropy.

Wind as Root Node

Lets select wind as a candidate and calcualte information gain

Humidity as Root Node

Lets select Humidity as a candidate and calcualte information gain

Outlook as Root Node

Lets select Outlook as a candidate and calcualte information gain

Temp as Root Node

Lets select Temp as a candidate and calcualte information gain

The information gain for the attributes is summarised in the below table.As shown, Outlook has the highest Info gain value and hence should be taken as the root node

First Layer of Decision Tree

Based on selection of Outlook for root node, we can create the first layer of branching as shown below

Subsequent Branching

Lets say we want to further split the left most branch(Sunny) into subsequent branches.We will use the same criteria for selecting attributes and calculating their Information Gain.Lets look at the calculation and identify the attribute for shorlist here

So by picking Humidity for splitting the left most branch, the new tree would look like as shown below

As can be seen, the branching resulted in branches with all classes belonging to only category:either yes or no.This illustrates that the branching results in groups that are as homogenous as possible.

Final Note

In this blog we looked at general introduction of a Decision Tree, how it uses entropy and infomration gain to branch. We also used a working example to understand the splitting process in some detail.In the next blog, we will look at implementation of Decision Tree using R for credit risk dataset

Machine Learning Made Easy

Friday, April 9, 2021

Blog 40: Introduction to Decision Trees