Logistic Regression
Introduction
Logistic regression is a statistical method used for binary classification problems, where the outcome variable is categorical and can take on two possible values (e.g., success/failure, yes/no, 1/0). It models the relationship between one or more independent variables and the probability of the dependent variable being in a particular category.
The outcome of a logistic regression model is , which is the probability of the dependent variable being 1 given the independent variables .
For each unit increase in , logit increases by .
Example
Suppose , and we have observations of independent variables :
Making predictions
The threshold is a probability at which we classify the outcome as 1 or 0.
- If , we classify the outcome as 1.
- If , we classify the outcome as 0.
A confusion / classification matrix is a table that summarizes the performance of a classification model by comparing the predicted and actual values.
Actual \ Predicted | 0 | 1 |
---|---|---|
0 | True Negative (TN) | False Positive (FP) |
1 | False Negative (FN) | True Positive (TP) |
When |
Note: If the confusion matrix is about a prediction, the accuracy is the out-of-sample accuracy.
The ROC (Receiver Operating Characteristic) curve is a graphical representation of true positive rate against false positive rate at values of . The plot always starts at and ends at .
The ideal ROC curve is for it to be close to top left corner , which means .
The area under the ROC curve (AUC) is a measure of the model's ability to distinguish between two outcomes, ranges between to (perfect classifier). means the model is no better than random guessing.
Logistic regression in R-lang
Below is an example of performing logistic regression in R:
The model can be expressed as:
And recall , we can compute the probability of Bad_air
given Wind
and Temp
.
The p-value is computed as . The value of can be found by referencing the standard normal distribution table.
Significant codes are the same as in linear regression.
Showing the confusion matrix in R:
The used threshold is . This means, if given a predicted probability of Bad_air
as 0.7
, we classify it as 1
(bad air).
The values in the confusion matrix are: