Hospital Data Analysis using Decision Tree
Parag Verma
Introduction
In this blog we are going to discuss the application of Decision Tree to analyze length of stay(los) dataset.We will identify the rules that classify ‘death’ for various patients depending upon other features in the data.The library rpart will be used to create a decision tree.
Step 1: Read Data
The LOS_model dataset is part of NHSRdatasets package. Packages rpart is imported to create a decision tree. Lets import the dataset and look at the first few records
package.name<-c("dplyr","tidyr","stringr",
"MLmetrics",
"NHSRdatasets",
"rpart","rpart.plot")
for(i in package.name){
if(!require(i,character.only = T)){
install.packages(i)
}
library(i,character.only = T)
}
# NHSRdatasets package has the LOS_model dataset
data(LOS_model)
df<-LOS_model%>%
select(-ID)
head(df)
# A tibble: 6 x 4
Organisation Age LOS Death
<ord> <int> <int> <int>
1 Trust1 55 2 0
2 Trust2 27 1 0
3 Trust3 93 12 0
4 Trust4 45 3 1
5 Trust5 70 11 0
6 Trust6 60 7 0
It is a simulated data set containing information on 300 patients across 10 different hospitals(trusts).The brief description of the columns are given below:
- ID:Patient Identifier
- Organisation: Hospital Name
- Age:Age of the Patient
- Death:1 or 0 representing whether a patient died
We will attempt to model LOS, using Organisation,Age and Death by creating decision trees
Step 2: Cleaning Trust column
Long description text within the ’Organisation’column can make the decision tree difficult to read.To simplify this, we will replace the Trust test with T
df[["Organisation"]]<-gsub("Trust","T",df[["Organisation"]])
Step 3: Splitting into Train(70%) and Test(30%) Dataset
We will split the dataset into train and test and then:
- Fit model on train dataset
- Check its accuracy on test dataset
set.seed(1)
train.sample<-sample(nrow(df),0.70*nrow(df),replace = F)
train.df<-df[train.sample,]
test.sample<-sample(nrow(df),0.30*nrow(df),replace = F)
test.df<-df[test.sample,]
Step 4: Training a Decision Tree
Lets create a decision tree and discuss the paramters that are used for setting it up
m1 <- rpart(Death ~ ., data = train.df,method = "class",
control = rpart.control(minsplit = 15, cp = 0.001, xval=10, maxdepth = 6))
rpart.plot(m1, type = 3)
It can be seen that minsplit is given a value of 15.It is a parameter that determines the expanse of the tree.maxdepth also determines the expanse of the tree but it is related to minsplit to a certain extent.After some permutation and combinations, a value of 15 was taken.This value balances the accuracy and the sensitivity metrics.In the later sections, we will see change in values of accuracy and sensitivity by changing minsplit
Lets now look at the rules generated from the decision tree model
rules.df<-rpart.rules(m1,cover = TRUE)
written_rules<-gsub(',',"",apply(rules.df,1,toString))
names(written_rules)<-NULL
written_rules
[1] "0.07 when Age >= 73 & Organisation is T4 or T6 or T9 7%"
[2] "0.10 when Age >= 88 & Organisation is T1 or T10 or T2 or T3 or T5 or T7 or T8 & LOS >= 5 5%"
[3] "0.11 when Age < 73 71%"
[4] "0.20 when Age is 76 to 88 & Organisation is T1 or T10 or T2 or T3 or T5 or T7 or T8 & LOS is 5 to 8 5%"
[5] "0.50 when Age >= 76 & Organisation is T1 or T10 or T2 or T3 or T5 or T7 or T8 & LOS < 5 6%"
[6] "0.83 when Age is 76 to 88 & Organisation is T1 or T10 or T2 or T3 or T5 or T7 or T8 & LOS >= 8 3%"
[7] "0.86 when Age is 73 to 76 & Organisation is T1 or T10 or T2 or T3 or T5 or T7 or T8 3%"
0.07, 0.10, 0.11,0.20 and so on represents the probability of Death for the segregated branches of a decision tree
Step 5: Checking Accuracy on Train data
predicted_matrix<-predict(m1,train.df)
default_predicted<-colnames(predicted_matrix)[apply(predicted_matrix,1,which.max)]
train.df[["Death_predicted"]]<-default_predicted
Getting the Accuracy metric
xtab <- as.matrix(table(train.df[["Death_predicted"]], train.df[["Death"]]))
xtab
0 1
0 170 27
1 2 11
accuracy<-(xtab[1,1] + xtab[2,2])/sum(xtab)
accuracy
[1] 0.8619048
So on the train dataset, we have an accuracy of 86%. Lets check it on test dataset
# Sensitivity: TP/TP+FN
sens<-xtab[2,2]/(xtab[2,2]+xtab[2,1])
sens
[1] 0.8461538
Step 6: Checking Accuracy on Test data
predicted_matrix_test<-predict(m1,test.df)
default_predicted_test<-colnames(predicted_matrix_test)[apply(predicted_matrix_test,1,which.max)]
test.df[["Death_predicted_test"]]<-default_predicted_test
Getting the Accuracy metric
xtab <- as.matrix(table(test.df[["Death_predicted_test"]], test.df[["Death"]]))
xtab
0 1
0 71 14
1 1 4
accuracy<-(xtab[1,1] + xtab[2,2])/sum(xtab)
accuracy
[1] 0.8333333
So on the test dataset we have got an accuracy of 83% which is pretty close to 86%. So the model is working fine
# Sensitivity: TP/TP+FN
sens<-xtab[2,2]/(xtab[2,2]+xtab[2,1])
sens
[1] 0.8
Step 7: Impact of changing minsplit on Accuracy and Sensitivity
Creating the train and test data
set.seed(1)
train.sample<-sample(nrow(df),0.70*nrow(df),replace = F)
train.df<-df[train.sample,]
test.sample<-sample(nrow(df),0.30*nrow(df),replace = F)
test.df<-df[test.sample,]
l1<-list()
for(i in c(5,10,15,20,25,30,35,40,45)){
m1 <- rpart(Death ~ ., data = train.df,method = "class",
control = rpart.control(minsplit = i, cp = 0.001, xval=10, maxdepth = 6))
# Training the model
predicted_matrix<-predict(m1,train.df)
death_predicted<-colnames(predicted_matrix)[apply(predicted_matrix,1,which.max)]
train.df[["Death_predicted"]]<-death_predicted
xtab <- as.matrix(table(train.df[["Death_predicted"]], train.df[["Death"]]))
accuracy_train<-(xtab[1,1] + xtab[2,2])/sum(xtab)
sens_train<-xtab[2,2]/(xtab[2,2]+xtab[2,1])
# Test Data
predicted_matrix_test<-predict(m1,test.df)
death_predicted_test<-colnames(predicted_matrix_test)[apply(predicted_matrix_test,1,which.max)]
test.df[["Death_predicted_test"]]<-death_predicted_test
xtab_test <- as.matrix(table(test.df[["Death_predicted_test"]], test.df[["Death"]]))
accuracy_test<-(xtab_test[1,1] + xtab_test[2,2])/sum(xtab_test)
sens_test<-xtab_test[2,2]/(xtab_test[2,2]+xtab_test[2,1])
l1[[i]]<-data.frame(Minsplit=i,
Accuracy_train=round(accuracy_train,2),Accuracy_test=round(accuracy_test,2),
Sensitivity_train=round(sens_train,2),Sensitivity_test=round(sens_test,2))
train.df$Death_predicted<-NULL
test.df$Death_predicted<-NULL
}
final.df<-do.call(rbind.data.frame,l1)
head(final.df)
Minsplit Accuracy_train Accuracy_test Sensitivity_train Sensitivity_test
1 5 0.92 0.89 0.87 0.83
2 10 0.90 0.91 0.76 0.92
3 15 0.86 0.83 0.85 0.80
4 20 0.84 0.81 0.86 1.00
5 25 0.84 0.80 0.62 0.50
6 30 0.84 0.80 0.67 0.50
We can see that at 5 and 15 minsplit values, difference between Accuracy and Sensitivity values of Train and Test datasets are very small.At the beginning of the blog, we took minsplit as 15.We can take 5 as well depending upon what metric we are trying to maximize.
Final Comments
In this blog we saw a practical applciation of Decision Tree in healthcare domain..We will continue to explore further and discuss the drawbacks of decision tree and how random forest can be used to overcome them in the coming blogs
Link to Youtube Channel
https://www.youtube.com/playlist?list=PL6ruVo_0cYpV2Otl1jR9iDe6jt1jLXgAq