CART
Introduction
A tree with leaf nodes that represent predicted value and non-leaf nodes that represent decision rules.
To predict the outcome, follow splits in the tree.
Example
If , we follow the path and predict the outcome as Gray.
The minimum number of observations in a leaf node. If the number of observations in a leaf node is less than minbucket, the tree will not split further.
Most complicated: (each leaf node has only one observation).
- Low
minbucket: More flexibility in splitting → Deeper tree → Higher risk of overfitting. - High
minbucket: Fewer splits allowed → Shallower tree → More generalized model, but potential underfit if too high.
Enhances prediction accuracy of CART.
Works by building large number of CART trees, and combining their predictions:
- Each tree can split on only a random subset of the variables
- Each tree is built from a "bagged"/"bootstrapped" sample of the data, by selecting observations randomly with replacement.
This process is controlled by two parameters:
nodesize: min. number of observations in a subsetntree: number of trees to build, should not be too small as bagging may miss observations
Example
Consider the following dataset, with 5 variables and 5 observations:
| Obs | Var1 | Var2 | Var3 | Var4 | Var5 | Outcome |
|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 1 | 5 | 4 | A |
| 2 | 1 | 2 | 3 | 4 | 3 | B |
| 3 | 4 | 5 | 2 | 1 | 2 | A |
| 4 | 3 | 4 | 5 | 2 | 1 | B |
| 5 | 5 | 1 | 4 | 3 | 5 | A |
Step 1: Bootstrapping Generate a random sample with replacement. For example:
- Sample: Obs 1, Obs 3, Obs 5, Obs 1, Obs 4.
Step 2: Random selection of variables
For each tree, randomly select a subset of variables to consider at each split (e.g., Var2, Var4, Var5).
Step 3: Build the tree Construct a CART tree using the bootstrapped sample and the selected variables. For example:
- Root node splits on
Var4> 3. - Left child splits on
Var2≤ 2. - Right child splits on
Var5> 4.
Step 4: Repeat
Repeat Steps 1–3 to build ntree trees (e.g., 100 trees).
Step 5: Combine predictions Each tree votes on the outcome for new data. The final prediction is based on majority vote (classification).
Cross validation
A method to evaluate the performance of a model by dividing the data into k subsets, training the model on k-1 subsets, and testing it on the remaining subset. This process is repeated k times, with each subset being used as the test set once.
The final performance metric is the average of the performance metrics from each fold.

A parameter to control cross validation, controlling the size of the tree. cp .
- Low
cp: More splits allowed → Deeper tree → Higher risk of overfitting. - High
cp: Fewer splits allowed → Shallower tree → More generalized model, but potential underfit if too high.