CART
Introduction
A tree with leaf nodes that represent predicted value and non-leaf nodes that represent decision rules.
To predict the outcome, follow splits in the tree.
Example
If , we follow the path and predict the outcome as Gray
.
The minimum number of observations in a leaf node. If the number of observations in a leaf node is less than minbucket
, the tree will not split further.
Most complicated: (each leaf node has only one observation).
- Low
minbucket
: More flexibility in splitting → Deeper tree → Higher risk of overfitting. - High
minbucket
: Fewer splits allowed → Shallower tree → More generalized model, but potential underfit if too high.
Cross validation
A method to evaluate the performance of a model by dividing the data into k
subsets, training the model on k-1
subsets, and testing it on the remaining subset. This process is repeated k
times, with each subset being used as the test set once.
The final performance metric is the average of the performance metrics from each fold.
Similar to minbucket
, but for the entire tree. It controls the size of the tree. cp
.
- Low
cp
: More splits allowed → Deeper tree → Higher risk of overfitting. - High
cp
: Fewer splits allowed → Shallower tree → More generalized model, but potential underfit if too high.
Enhances prediction accuracy of CART.
Works by building large number of CART trees, and combining their predictions: