Machine Learning Made Easy: June 2021

Hospital Data Analysis using Decision Tree

Introduction

In this blog we are going to discuss the application of Decision Tree to analyze length of stay(los) dataset.We will identify the rules that classify ‘death’ for various patients depending upon other features in the data.The library rpart will be used to create a decision tree.

Step 1: Read Data

The LOS_model dataset is part of NHSRdatasets package. Packages rpart is imported to create a decision tree. Lets import the dataset and look at the first few records

package.name<-c("dplyr","tidyr","stringr",
                "MLmetrics",
                "NHSRdatasets",
                "rpart","rpart.plot")

for(i in package.name){

  if(!require(i,character.only = T)){

    install.packages(i)
  }
  library(i,character.only = T)

}


# NHSRdatasets package has the LOS_model dataset

data(LOS_model)
df<-LOS_model%>%
  select(-ID)

head(df)

# A tibble: 6 x 4
  Organisation   Age   LOS Death
  <ord>        <int> <int> <int>
1 Trust1          55     2     0
2 Trust2          27     1     0
3 Trust3          93    12     0
4 Trust4          45     3     1
5 Trust5          70    11     0
6 Trust6          60     7     0

It is a simulated data set containing information on 300 patients across 10 different hospitals(trusts).The brief description of the columns are given below:

ID:Patient Identifier
Organisation: Hospital Name
Age:Age of the Patient
Death:1 or 0 representing whether a patient died

We will attempt to model LOS, using Organisation,Age and Death by creating decision trees

Step 2: Cleaning Trust column

Long description text within the ’Organisation’column can make the decision tree difficult to read.To simplify this, we will replace the Trust test with T

df[["Organisation"]]<-gsub("Trust","T",df[["Organisation"]])

Step 3: Splitting into Train(70%) and Test(30%) Dataset

We will split the dataset into train and test and then:

Fit model on train dataset
Check its accuracy on test dataset

set.seed(1)

train.sample<-sample(nrow(df),0.70*nrow(df),replace = F)
train.df<-df[train.sample,]

test.sample<-sample(nrow(df),0.30*nrow(df),replace = F)
test.df<-df[test.sample,]

Step 4: Training a Decision Tree

Lets create a decision tree and discuss the paramters that are used for setting it up

m1 <- rpart(Death ~ ., data = train.df,method = "class",
            control  = rpart.control(minsplit = 15, cp = 0.001, xval=10, maxdepth = 6))
rpart.plot(m1, type = 3)

It can be seen that minsplit is given a value of 15.It is a parameter that determines the expanse of the tree.maxdepth also determines the expanse of the tree but it is related to minsplit to a certain extent.After some permutation and combinations, a value of 15 was taken.This value balances the accuracy and the sensitivity metrics.In the later sections, we will see change in values of accuracy and sensitivity by changing minsplit

Lets now look at the rules generated from the decision tree model

rules.df<-rpart.rules(m1,cover = TRUE)
written_rules<-gsub(',',"",apply(rules.df,1,toString))
names(written_rules)<-NULL
written_rules

[1] "0.07 when Age >=   73 & Organisation is T4 or T6 or T9        7%"                                       
[2] "0.10 when Age >=   88 & Organisation is T1 or T10 or T2 or T3 or T5 or T7 or T8 & LOS >=   5  5%"       
[3] "0.11 when Age <  73             71%"                                                                    
[4] "0.20 when Age is 76 to 88 & Organisation is T1 or T10 or T2 or T3 or T5 or T7 or T8 & LOS is 5 to 8  5%"
[5] "0.50 when Age >=   76 & Organisation is T1 or T10 or T2 or T3 or T5 or T7 or T8 & LOS <  5    6%"       
[6] "0.83 when Age is 76 to 88 & Organisation is T1 or T10 or T2 or T3 or T5 or T7 or T8 & LOS >=   8  3%"   
[7] "0.86 when Age is 73 to 76 & Organisation is T1 or T10 or T2 or T3 or T5 or T7 or T8        3%"

0.07, 0.10, 0.11,0.20 and so on represents the probability of Death for the segregated branches of a decision tree

Step 5: Checking Accuracy on Train data

predicted_matrix<-predict(m1,train.df)
default_predicted<-colnames(predicted_matrix)[apply(predicted_matrix,1,which.max)]
train.df[["Death_predicted"]]<-default_predicted

Getting the Accuracy metric

xtab <- as.matrix(table(train.df[["Death_predicted"]], train.df[["Death"]]))
xtab

accuracy<-(xtab[1,1] + xtab[2,2])/sum(xtab)
accuracy

[1] 0.8619048

So on the train dataset, we have an accuracy of 86%. Lets check it on test dataset

# Sensitivity: TP/TP+FN
sens<-xtab[2,2]/(xtab[2,2]+xtab[2,1])
sens

[1] 0.8461538

Step 6: Checking Accuracy on Test data

predicted_matrix_test<-predict(m1,test.df)
default_predicted_test<-colnames(predicted_matrix_test)[apply(predicted_matrix_test,1,which.max)]
test.df[["Death_predicted_test"]]<-default_predicted_test

Getting the Accuracy metric

xtab <- as.matrix(table(test.df[["Death_predicted_test"]], test.df[["Death"]]))
xtab

accuracy<-(xtab[1,1] + xtab[2,2])/sum(xtab)
accuracy

[1] 0.8333333

So on the test dataset we have got an accuracy of 83% which is pretty close to 86%. So the model is working fine

# Sensitivity: TP/TP+FN
sens<-xtab[2,2]/(xtab[2,2]+xtab[2,1])
sens

[1] 0.8

Step 7: Impact of changing minsplit on Accuracy and Sensitivity

Creating the train and test data

set.seed(1)

train.sample<-sample(nrow(df),0.70*nrow(df),replace = F)
train.df<-df[train.sample,]

test.sample<-sample(nrow(df),0.30*nrow(df),replace = F)
test.df<-df[test.sample,]

l1<-list()
for(i in c(5,10,15,20,25,30,35,40,45)){

m1 <- rpart(Death ~ ., data = train.df,method = "class",
            control  = rpart.control(minsplit = i, cp = 0.001, xval=10, maxdepth = 6))

# Training the model
predicted_matrix<-predict(m1,train.df)
death_predicted<-colnames(predicted_matrix)[apply(predicted_matrix,1,which.max)]
train.df[["Death_predicted"]]<-death_predicted

xtab <- as.matrix(table(train.df[["Death_predicted"]], train.df[["Death"]]))

accuracy_train<-(xtab[1,1] + xtab[2,2])/sum(xtab)
sens_train<-xtab[2,2]/(xtab[2,2]+xtab[2,1])



# Test Data
predicted_matrix_test<-predict(m1,test.df)
death_predicted_test<-colnames(predicted_matrix_test)[apply(predicted_matrix_test,1,which.max)]
test.df[["Death_predicted_test"]]<-death_predicted_test

xtab_test <- as.matrix(table(test.df[["Death_predicted_test"]], test.df[["Death"]]))

accuracy_test<-(xtab_test[1,1] + xtab_test[2,2])/sum(xtab_test)

sens_test<-xtab_test[2,2]/(xtab_test[2,2]+xtab_test[2,1])
    
l1[[i]]<-data.frame(Minsplit=i,
                    Accuracy_train=round(accuracy_train,2),Accuracy_test=round(accuracy_test,2),
                    Sensitivity_train=round(sens_train,2),Sensitivity_test=round(sens_test,2))

train.df$Death_predicted<-NULL
test.df$Death_predicted<-NULL

}


final.df<-do.call(rbind.data.frame,l1)

head(final.df)

  Minsplit Accuracy_train Accuracy_test Sensitivity_train Sensitivity_test
1        5           0.92          0.89              0.87             0.83
2       10           0.90          0.91              0.76             0.92
3       15           0.86          0.83              0.85             0.80
4       20           0.84          0.81              0.86             1.00
5       25           0.84          0.80              0.62             0.50
6       30           0.84          0.80              0.67             0.50

We can see that at 5 and 15 minsplit values, difference between Accuracy and Sensitivity values of Train and Test datasets are very small.At the beginning of the blog, we took minsplit as 15.We can take 5 as well depending upon what metric we are trying to maximize.

Final Comments

In this blog we saw a practical applciation of Decision Tree in healthcare domain..We will continue to explore further and discuss the drawbacks of decision tree and how random forest can be used to overcome them in the coming blogs

Link to Previous R Blogs

R complete Guide

All other Blog

Link to Youtube Channel

https://www.youtube.com/playlist?list=PL6ruVo_0cYpV2Otl1jR9iDe6jt1jLXgAq

Machine Learning Made Easy

Wednesday, June 9, 2021

Blog 43: Hospital Data Analyses using Decision Tree