Decision Trees

5 min readOct 9, 2020

To determine which separation is best, we need a way to measure and compare “impurity”.

Yes/No questions

Calculate gini impurity for each leaf node: 1-(prob(yes))²-(prob(no))²

The total Gini impurity for using Chest Pain to separate patients with and without heart disease is the weighted average of the leaf node impurities.

Good blood circulation has the lowest impurity (it separates patients with and without heart disease the best).

When we divided all of the patients using good blood circulation, we ended up with “impure” leaf nodes.

Each leaf contained a mixture of patients with and without heart disease.

Numerical Data

For example, use weight to determine heart disease

Step 1) Sort the patients by weight, lowest to highest.

Step 2) Calculate the average weight for all adjacent patients.

Step 3) Calculate the impurity values for each average weight.

Step 4) Use the average weight which has lowest impurity to separate.

Ranked data

It is similar to numeric data, except instead now we calculate impurity scores for all of the possible ranks.

So if people could rank my jokes from 1 to 4 (4 being the funniest) to predict if they like something, we could calculate the following impurity scores:

joke rank ≤ 1, joke rank ≤ 2, joke rank ≤ 3 (no 4 since it will include everyone)

Multiple choices

When there are multiple choices, like ‘color choice can be blue, green or red’, you calculate an impurity score for each one as well as each possible combination:

color choice: blue, color choice: green, color choice: red, color choice: blue or green, color choice: blue or red, color choice: green or red

Feature selection

If the gini impurity value is not low enough, some feature is not included in the tree. Then this is a type of automatic feature selection.

However, we could have also created a threshold such that the impurity reduction has to be large enough to make a big difference.

As a result, we end up with simpler trees that are not “overfit”.

“Overfit” means our tree does well with the original data — the data we used to make the tree but doesn’t do well with any other data set.

Decision trees have the downside of often being overfit. Requiring each split to make a large reduction in impurity helps a tree from being overfit.

Missing data

For some patient, we do not know if blocked arteries cause heart disease or not. If overall, “yes” occurred more times than “no”, we could put “yes” here.

Alternatively, we could find another column that has the highest correlation with blocked arteries and use that as a guide.

For numerical data, we do not have someweight data. We could use mean or median. Or we can find the column with the highest correlation, and do a linear regression on the two columns and use the least squares line to predict the value for weight.

Regression Tree

In regression tree, each leaf represents a numeric value.

In contrast, classification trees have true or false in their leaves.

Since each leaf corresponds to the average drug effectiveness in a different cluster of observations, the tree does a better job reflecting the data than the straight line.

When we have one predictor, draw a graph might be easy to see.

When we have multiple predictors, a regression tree easily accommodate the additional predictors.

For each point in the data, we can draw the residual, the difference between the observed and predicted values and we can use the residuals to quantify the quality of these predictions.

We can make plot:

x-axis -> dosage threshold for tree

y-axis -> sum of squared residuals

And pick the threshold with the least sum of squared residuals.

In summary, we split the data into two groups by finding the threshold that gave us the smallest sum of squared residuals.

When we have observations with the same drug effectiveness, we don’t need to split them into smaller groups. They become leaf nodes.

When a model fits the training data perfectly, it probably means it is overfit and will not perform well with new data. The model has no bias, but potentially large variance.

How to prevent overfitting:

The simplest is to only split observations when there are more than some minimum number. Typically, the minimum number of observations to allow for a split is 20.

However, since this example doesn’t have many observations, I set the minimum to 7. For example, if we have 6 observations, we will not split these observations in this node.

For multiple predictors, we grow the tree just like before, except now we compare the lowest sum of squared residuals from each predictor. And we pick the candidate with the smallest sum of squared residuals to be the root. And just like before, when a leaf has less than a minimum number of observations, which is usually 20, but we are using 7, we stop trying to divide them. We repeat the process until we cannot split the remaining nodes into smaller groups.

How to Prune Regression Tree

cost complexity pruning

Calculate the sum of squared residuals for each tree:

first for the whole tree, second for the sub-tree with one fewer leaf, and so on…for the tree with two leaves, last for the tree with one leaf.

but each time we remove a leaf, the sum of squared residuals gets larger and larger.

Weakest Link Pruning works by calculating a tree score that is based on the sum of squared residuals (SSR) for the tree or sub-tree and a tree complexity penalty that is a function of the number of leaves or terminal nodes in the tree or sub-tree:

Tree Score = SSR + alpha*T

The tree complexity penalty compensates for the difference in the number of leaves. alpha is a tuning parameter that we finding using cross validation, for example set alpha = 10000. The more leaves, the larger penalty.

We pick the sub-tree with the lowest tree score. Thus the value for alpha makes a difference in our choice of sub-tree.

Step 1) use all of the data build a full sized regression tree. note: this full sized tree is different than before because it was fit to all of the data, not just the training data.

When alpha = 0, the tree complexity penalty becomes 0

Tree Score = SSR + 0 * T = SSR

And as we saw earlier, all of the sub-trees will have larger sum of squared residuals.

So put alpha =0 on top of the tree to remind us that this tree has the lowest tree score when alpha = 0.

Then we increase alpha until pruning leaves will give us a lower tree score.

And so on.

In the end, different values for alpha give us a sequence of trees, from full sized to just a leaf.

Step 2) divide the full data set into training and testing datasets.

Then using training data, use the alpha values we found before to build a full tree and a sequence of sub-trees that minimize the tree score.

Then calculate the sum of squared residuals for each new tree using only the testing data.

Then we created new training data and new testing data.

We just keep repeating until we have done 10-fold cross validation. And the value for alpha that on average gave us the lowest sum of squared residuals with the testing data, is the final value for alpha.

Step 3) We go back to the original trees and sub-trees made from the full data. And pick the tree that corresponds to the value for alpha that we selected. This sub-tree will be the final, pruned tree.

Decision Trees

Regression Tree

Written by yueyuan