CART
Introduction
A tree with leaf nodes that represent predicted value and non-leaf nodes that represent decision rules.
To predict the outcome, follow splits in the tree.
Example
If , we follow the path and predict the outcome as Gray
.
The minimum number of observations in a leaf node. If the number of observations in a leaf node is less than minbucket
, the tree will not split further.
Most complicated: (each leaf node has only one observation).
- Low
minbucket
: More flexibility in splitting → Deeper tree → Higher risk of overfitting. - High
minbucket
: Fewer splits allowed → Shallower tree → More generalized model, but potential underfit if too high.
Enhances prediction accuracy of CART.
Works by building large number of CART trees, and combining their predictions:
- Each tree can split on only a random subset of the variables
- Each tree is built from a "bagged"/"bootstrapped" sample of the data, by selecting observations randomly with replacement.
This process is controlled by two parameters:
nodesize
: min. number of observations in a subsetntree
: number of trees to build, should not be too small as bagging may miss observations
Example
Consider the following dataset, with 5 variables and 5 observations:
Obs | Var1 | Var2 | Var3 | Var4 | Var5 | Outcome |
---|---|---|---|---|---|---|
1 | 2 | 3 | 1 | 5 | 4 | A |
2 | 1 | 2 | 3 | 4 | 3 | B |
3 | 4 | 5 | 2 | 1 | 2 | A |
4 | 3 | 4 | 5 | 2 | 1 | B |
5 | 5 | 1 | 4 | 3 | 5 | A |
Step 1: Bootstrapping Generate a random sample with replacement. For example:
- Sample: Obs 1, Obs 3, Obs 5, Obs 1, Obs 4.
Step 2: Random selection of variables
For each tree, randomly select a subset of variables to consider at each split (e.g., Var2
, Var4
, Var5
).
Step 3: Build the tree Construct a CART tree using the bootstrapped sample and the selected variables. For example:
- Root node splits on
Var4
> 3. - Left child splits on
Var2
≤ 2. - Right child splits on
Var5
> 4.
Step 4: Repeat
Repeat Steps 1–3 to build ntree
trees (e.g., 100 trees).
Step 5: Combine predictions Each tree votes on the outcome for new data. The final prediction is based on majority vote (classification).
Cross validation
A method to evaluate the performance of a model by dividing the data into k
subsets, training the model on k-1
subsets, and testing it on the remaining subset. This process is repeated k
times, with each subset being used as the test set once.
The final performance metric is the average of the performance metrics from each fold.
A parameter to control cross validation, controlling the size of the tree. cp
.
- Low
cp
: More splits allowed → Deeper tree → Higher risk of overfitting. - High
cp
: Fewer splits allowed → Shallower tree → More generalized model, but potential underfit if too high.