Decision Tree¶
interpretable and intuitive model
Information Theory¶
Entropy¶
Measure the impurity of the attribute, that will be used to select the atrribute to be splited.
The lower Entropy -> distribution varied (not uniform) -> more information for prediction
$$ H(X) = - \sum^y_{y\in Y}P(y)log_2P(y) $$
Conditional Entropy¶
$$ H(A|S) = -\sum^x_{x \in S} P(x) \sum^y_{y \in A}P(y|x)log_2 P(y|x) $$ For decision tree, take data in dataset \(S\) (of current branch) as \(x\), attribute as \(y\)
Information Gain¶
\[\begin{split}Gain(S, A) &= \text{expected reduction in entropy due to branching on attribute A} \\
&= \text{original entropy} - \text{entropy after branching} \\
&= H(S) - H(A|S)\end{split}\]
Compute for all attributes, lower Entropy -> higher gain $$arg max_x Gain(x) = arg max_x H(Y) - H(Y|X) = arg min_x H(Y|X)$$
Continus¶
\[\begin{split}H(Y|X:t) &= p(X <t) H(Y|X < t) + p(X >= t) H(Y|X >= t) \\
Gain(Y|X:t) &= H(Y) - H(Y|X:t) \\
Gain*(Y|X) &= max_t IG(Y|X:t)\end{split}\]
Gain Ratio¶
One problem with Gain is that it likes to partition too much, and favors numerous splits.
\[\frac{Gain(S,A)}{SplitInfo(S,A)}\]
\[\begin{split}SplitInfo(S,A)=-\sum^J_{j=1}\frac{|S_j|}{|S|}log(\frac{|S_j|}{|S|})\\ \text{, where } |S_j| = \text{the number of data in splited }branch j\end{split}\]
Pruning¶
avoid Overfit
Methods¶
- Chi-square automatic interaction detection, CHAID (1974)
- Classification and regression tree, CART (1984)
- Iterative Dichotomiser 3, ID3 (1986)
- successor of ID3, C4.5 (1993)
CHAID¶
CART¶
- GINI index
- binary splits
- Regression
- (usually) average output number of data in current branch
- (usually) object function: square error
- mininize object function
ID3¶
- Information Gain
- Calculate the entropy of every attribute \({\displaystyle a}\) of the data set \({\displaystyle S}\).
- Partition (“split”) the set \({\displaystyle S}\) into subsets using the attribute for which the resulting entropy after splitting is minimized; or, equivalently, information gain is maximum.
- Make a decision tree node containing that attribute.
- Recurse on subsets using the remaining attributes.
C4.5¶
- Check for the base cases.
- For each attribute a, find the normalized information gain ratio from splitting on a.
- Let a_best be the attribute with the highest normalized information gain.
- Create a decision node that splits on a_best.
- Recurse on the sublists obtained by splitting on a_best, and add those nodes as children of node.
- Handling both continuous and discrete attributes - In order to handle continuous attributes, C4.5 creates a threshold and then splits the list into those whose attribute value is above the threshold and those that are less than or equal to it.
- Pruning trees after creation - C4.5 goes back through the tree once it’s been created and attempts to remove branches that do not help by replacing them with leaf nodes.
- Options
- leaving the tree as is
- replace that part of the tree with a leaf corresponding to the most frequent label in the data S going to that part of the tree.
- replace that part of the tree with one of its subtrees, corresponding to the most common branch in the split
- compute the upper bound of probability of error with fixed $\alpha=0.25$
- Select options with smallest upper bound
- Options
- Handling training data with missing attribute values - C4.5 allows attribute values to be marked as ? for missing. Missing attribute values are simply not used in gain and entropy calculations.
- Handling attributes with differing costs.
ref: MIT15_097S12_lec08