Linear Regression
Single variable linear regression
Linear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables.
- Dependent variable: The variable we are trying to predict (y)
- Independent variable: The variable we are using to make predictions (x)
The population model of y with one independent variable x is given by:
where:
- is the y-intercept
- is the slope
we can express the model as a regression function:
where is the expected value of y given x. From the straight-line formula, is the y-intercept and is the slope of the line.
Estimates the regression model with n observations
where:
- is the estimated y-intercept
- is the estimated slope
The difference between the observed value of the dependent variable and the predicted value.
The coefficient of determination , ranged from is gives the proportion of the variance in y that is explained by the regression model:
where:
- SSR is the sum of squared residuals
- SST is the total sum of squares
Multi-variable linear regression
The population model of y with k independent variables is given by:
where:
- is the dependent variable
- are the independent variables
- is the y-intercept
- are the slopes
We would want IVs to be uncorrelated with each other, but in the real world, they are often collinear. If the correlations are too large, then estimated coefficients will be unreliable. This is called multicollinearity.
Checks for multicollinearity, of an IV. Use vif(model)
. If value no problem, else drop variable.
Modified version of that adjusts for the number of IVs. (Produced in summary(model)
)
Making predictions
- In-sample R square: The proportion of variance explained by the model on the training data.
- Out-of-sample R square: The proportion of variance explained by the model on the test data. The value can be negative, indicating that the model is worse than a simple mean prediction.
Linear regression in R-lang
Below is an example of performing linear regression in R:
The model can be expressed as:
Hence, for example if Temp
increases by 2, the Air_quality
will increase by 1.8402 * 2 = 3.6804
, all else being equal.
The F-test is used to test the overall significance of the regression model. It tests whether at least one of the independent variables is significantly related to the dependent variable.
The null and alt hypothesis of the F-test is :
As the p-value
is less than 0.05
, we reject the null hypothesis, and conclude that at least one of the independent variables is significant.
The signficant code is computed by comparing the p-value (Pr(>|t|)
), and checking if it is less than the following values:
***
: 0.001**
: 0.01-
*
: 0.05.
: 0.1
Other functions
rows columns
. The number of rows is the number of observations, and the number of columns is the number of variables.
Difference in differences
Correlation does not mean causation. We use DiD to estimate the casual effect (causation) from a treatment (intervention).
T is the treatment group and C is the control group.
Example
The treatment group is the group that receives the treatment, and the control group is the group that does not receive the treatment.
Time Period | Control Group (Sales) [C] | Treated Group (Sales) [T] | |
---|---|---|---|
Before implementation | Year 1 | 1500 | 1400 |
After implementation | Year 2 | 990 | 1000 |
Change [] | -510 | -400 |
Compute DiD:
Using linear regression:
where:
-
is the treatment group (1 if treated, 0 if not)
-
is the post-treatment period (1 if post, 0 if pre)
-
is the interaction term (1 if treated AND post, 0 otherwise)
-
(control group pre-treatment)
-
(treatment group pre-treatment)
-
(control group post-treatment)
-
(DiD)
Example
...