Notes@HKU by Jax

Linear Regression

Single variable linear regression

Linear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables.

Dependent and independent variables

  • Dependent variable: The variable we are trying to predict (y)
  • Independent variable: The variable we are using to make predictions (x)

Simple regression model

The population model of y with one independent variable x is given by:

y=β0+β1xy = \beta_0 + \beta_1 x

where:

  • β0\beta_0 is the y-intercept
  • β1\beta_1 is the slope

we can express the model as a regression function:

E[yx]=β0+β1xE[y|x] = \beta_0 + \beta_1 x

where E[yx]E[y|x] is the expected value of y given x. From the straight-line formula, β0\beta_0 is the y-intercept and β1\beta_1 is the slope of the line.

Estimated regression function

Estimates the regression model with n observations (xi,yi)(x_i, y_i)

y^=β0^+β1^x\hat{y} = \hat{\beta_0} + \hat{\beta_1} x

where:

  • β0^\hat{\beta_0} is the estimated y-intercept
  • β1^\hat{\beta_1} is the estimated slope

Residuals

The difference between the observed value of the dependent variable and the predicted value.

ei=yiyi^e_i = y_i - \hat{y_i}

Sum of Squared Errors (SSE)

SSE=i=1n(yiyi^)2SSE = \sum_{i=1}^{n} (y_i - \hat{y_i})^2

Total Sum of Squares (SST)

SST=i=1n(yiyˉ)2SST = \sum_{i=1}^{n} (y_i - \bar{y})^2

where:

  • yˉ\bar{y} is the mean of the dependent variable

Coefficient of Determination (R square)

The coefficient of determination R2R^2, ranged from 010\to1 is gives the proportion of the variance in y that is explained by the regression model:

R2=1SSESSTR^2 = 1 - \frac{SSE}{SST}

where:

  • SSR is the sum of squared residuals
  • SST is the total sum of squares

Multi-variable linear regression

Multiple regression model

The population model of y with k independent variables is given by:

y=β0+β1x1+β2x2+...+βkxky = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_k x_k

where:

  • yy is the dependent variable
  • x1,x2,...,xkx_1, x_2, ..., x_k are the independent variables
  • β0\beta_0 is the y-intercept
  • β1,β2,...,βk\beta_1, \beta_2, ..., \beta_k are the slopes

Multicollinearity

We would want IVs to be uncorrelated with each other, but in the real world, they are often collinear. If the correlations are too large, then estimated coefficients will be unreliable. This is called multicollinearity.

Variance inflation factors

Checks for multicollinearity, of an IV. Use vif(model). If value <10< 10 no problem, else drop variable.

Adjusted R square

Modified version of R2R^2 that adjusts for the number of IVs. (Produced in summary(model))

Making predictions

In-sample and out-of-sample R square

  • In-sample R square: The proportion of variance explained by the model on the training data.
  • Out-of-sample R square: The proportion of variance explained by the model on the test data. The value can be negative, indicating that the model is worse than a simple mean prediction.

Linear regression in R-lang

Below is an example of performing linear regression in R:

Call:
lm(formula = Air_quality ~ Wind + Temp, data = data)
 
Residuals:
    Min      1Q  Median      3Q     Max 
-41.251  -13.695  -2.856   11.390  100.367 
 
Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -71.0332    23.5780  -3.013  0.0032 ** 
Wind          -3.0555     0.6633  -4.607  0.00001 ***
Temp           1.8402     0.2500   7.362  0.00000 ***
---
Signif. codes:  
0***0.001**0.01*0.05 ‘.’ 0.1 ‘ ’ 1
 
Residual standard error: 21.85 on 113 degrees of freedom
  (37 observations deleted due to missingness)
Multiple R-squared:  0.5687,	Adjusted R-squared:  0.5611 
F-statistic: 74.5 on 2 and 113 DF,  p-value: < 2.2e-16

Intepreting the estimates

The model can be expressed as:

Air quality=71.03323.0555Wind+1.8402Temp\text{Air quality} = -71.0332 - 3.0555 \text{Wind} + 1.8402 \text{Temp}

Hence, for example if Temp increases by 2, the Air_quality will increase by 1.8402 * 2 = 3.6804, all else being equal.

F-test

The F-test is used to test the overall significance of the regression model. It tests whether at least one of the independent variables is significantly related to the dependent variable.

The null and alt hypothesis of the F-test is :

H0:β1=β2=0H1:i{1,2}βi0H_0: \beta_1 = \beta_2 = 0\\ H_1: \exists i \in \{1,2\} \to \beta_i \neq 0

As the p-value is less than 0.05, we reject the null hypothesis, and conclude that at least one of the independent variables is significant.

Multiple R-squared

The proportion of variation explained by the model, as a percentage.

t-value

The t-value is computed as Estimate / Std. Error.

Significant codes

The signficant code is computed by comparing the p-value (Pr(>|t|)), and checking if it is less than the following values:

  • ***: 0.001
  • **: 0.01
  • \ldots *: 0.05 .: 0.1 : 1

Other functions

Dimension of data

> dim(data)
[1] 151  3

rows columns. The number of rows is the number of observations, and the number of columns is the number of variables.

Difference in differences

Correlation does not mean causation. We use DiD to estimate the casual effect (causation) from a treatment (intervention).

DiD

DiD=ΔTΔCDiD = \Delta {T} - \Delta {C}

T is the treatment group and C is the control group.

Example

The treatment group is the group that receives the treatment, and the control group is the group that does not receive the treatment.

Time PeriodControl Group (Sales) [C]Treated Group (Sales) [T]
Before implementationYear 115001400
After implementationYear 29901000
Change [Δ\Delta]-510-400

Compute DiD: DiD=ΔTΔC400(510)=110DiD = \Delta T - \Delta C -400 - (-510) = 110

DiD by linear regression

Using linear regression:

y=β0+β1T+β2P+β3(TP)y = \beta_0 + \beta_1 T + \beta_2 P + \beta_3 (T * P)

where:

  • TT is the treatment group (1 if treated, 0 if not)

  • PP is the post-treatment period (1 if post, 0 if pre)

  • TPT * P is the interaction term (1 if treated AND post, 0 otherwise)

  • β0=E[yT=0,P=0]\beta_0 = E[y | T=0, P=0] (control group pre-treatment)

  • β1=E[yT=1,P=0]E[yT=0,P=0]\beta_1 = E[y | T=1, P=0] - E[y | T=0, P=0] (treatment group pre-treatment)

  • β2=E[yT=0,P=1]E[yT=0,P=0]\beta_2 = E[y | T=0, P=1] - E[y | T=0, P=0] (control group post-treatment)

  • β3=E[yT=1,P=1]E[yT=0,P=1]E[yT=1,P=0]+E[yT=0,P=0]\beta_3 = E[y | T=1, P=1] - E[y | T=0, P=1] - E[y | T=1, P=0] + E[y | T=0, P=0] (DiD)

Example

...

Requirements for DiDs

  1. Both "pre" and "post" periods must be observed for both treatment and control groups.
  2. Must have control and treated group, where the control should exhibit parallel trend before treatment

On this page