Notes@HKU by Jax

CART

Introduction

Classification and Regression Trees (CART)

A tree with leaf nodes that represent predicted value and non-leaf nodes that represent decision rules.

To predict the outcome, follow splits in the tree.

Example

If X=75,Y=10X=75, Y=10, we follow the path and predict the outcome as Gray.

Minbucket

The minimum number of observations in a leaf node. If the number of observations in a leaf node is less than minbucket, the tree will not split further.

Most complicated: minbucket=1minbucket=1 (each leaf node has only one observation).

  • Low minbucket: More flexibility in splitting → Deeper tree → Higher risk of overfitting.
  • High minbucket: Fewer splits allowed → Shallower tree → More generalized model, but potential underfit if too high.

Cross validation

K-fold cross validation

A method to evaluate the performance of a model by dividing the data into k subsets, training the model on k-1 subsets, and testing it on the remaining subset. This process is repeated k times, with each subset being used as the test set once.

The final performance metric is the average of the performance metrics from each fold.

Complexity parameter (cp)

Similar to minbucket, but for the entire tree. It controls the size of the tree. cp [0,1]\in [0,1].

  • Low cp: More splits allowed → Deeper tree → Higher risk of overfitting.
  • High cp: Fewer splits allowed → Shallower tree → More generalized model, but potential underfit if too high.

Random forest

Enhances prediction accuracy of CART.

Works by building large number of CART trees, and combining their predictions:

CART in R-lang

Fitting cart model by minbucket

# Split data randomly into training and test sets
library(caTools)
set.seed(123) # for reproducibility
split = sample.split(data$DV, SplitRatio = 0.7)
Train = subset(data, split == TRUE)
Test = subset(data, split == FALSE)
 
# Fit a CART model
library(rpart)
Tree = rpart(DV ~ ., data = Train, method = "class", minbucket = 5)
# Plot the tree
library(rpart.plot)
prp(Tree)

Making predictions

# Make predictions
Predictions = predict(Tree, newdata = Test, type = "class")
table(Test$DV, Predictions) # Print confusion matrix
   Predict CART
          0 1
0        41 36
1        12 31

Fitting cart model by cp

# K-fold cross validation
library(caret)
numFolds = trainControl(method = "cv", number = 10)
cpGrid = expand.grid(.cp = seq(0.01, 0.5, 0.01)) # from 0.01 to 0.1 with steps of 0.01
# Perform cross validation
train(DV ~ ., data = Train, method = "rpart", trControl = numFolds, tuneGrid = cpGrid)
# Output of cross-validation
CART 
 
396 samples  
    6 predictors  
    2 classes: '0', '1'  
 
No pre-processing  
Resampling: Cross-Validated (10-fold)  
Summary of sample sizes: 357, 356, 356, 356, 356, 357, ...  
 
Resampling results across tuning parameters:  
 
| cp   | Accuracy | Kappa       |
|------|----------|-------------|
| 0.01 | 0.6010   | 0.1869      |
| 0.02 | 0.6185   | 0.2159      |
| 0.03 | 0.6134   | 0.2142      |
...
 
Accuracy was used to select the optimal model using the largest value.  
 
The final value used for the model was cp = 0.19.
# Create CART model with cp
CART = rpart(DV ~ ., data = Train, method = "class", cp = 0.19)

On this page