Notes@HKU by Jax

Statistics

Random variables

A numerical description of the outcome of a random experiment is called a random variable. It can be either discrete or continuous.

Notation

Random variables are denoted with capital letters XX
The possible outcomes are denoted with regular letters xx
Probability that the outcome of XX is xx is denoted by P(X=x)P(X=x)

Expected value and Variance

E(Xn)=(xnP(X=x))E(X^n)=\sum(x^n\cdot P(X=x)) gives the expected value, which represents the mean value (outcome) of the random variable.
Var(X)=E(X2)E(X)2Var(X)=E(X^2)-E(X)^2 gives the variance, which is a measure of the variability of the random variable's outcomes.

Operations of E(X) and Var(X)

E(X+Y)=E(X)+E(Y)E(X+Y)=E(X)+E(Y), which any addition/subtraction function within E()E() can be expanded.

If XX and YY are independent:

Var(X+Y)=Var(X)+Var(Y),E(XY)=E(X)E(Y)Var(X+Y) = Var(X) + Var(Y),\quad\quad E(XY) = E(X)\cdot E(Y)
For Y=aX+bY=aX+b:
E(Y)=aE(X)+b,Var(Y)=a2Var(X)E(Y)=aE(X)+b,\quad\quad Var(Y)=a^2Var(X)

Probability distributions

A continuous random variable has probabilities as the area under its curve. Hence, P(X=n)P(X=n) for any outcome nn is 00. A discrete random variable has specific probabilities assigned to an outcome.

Operation on continuous ranges

P(X<x)=P(Xx)P(X<x)=P(X\leq x)
P(X>x)=1P(X<x)P(X>x)=1 - P(X<x)
P(aXb)=P(X<b)P(X<a)P(a\leq X\leq b)=P(X<b)-P(X<a)

Operation on discrete ranges

P(X<x)=P(Xx1)P(X<x)=P(X\leq x-1)
P(X>x)=1P(Xx)P(X>x)=1-P(X\leq x)
P(aXb)=P(Xb)P(Xa1)P(a\leq X\leq b)=P(X\leq b)-P(X\leq a-1)

Probability distribution functions

A p.d.f is a continuous function that returns the probability of the given outcome. The following is an example of a p.d.f:

Xf(x)={Kx20x10otherwiseX\sim f(x) = \begin{cases} Kx^2 & 0 \leq x \leq 1 \\ 0 & \text{otherwise} \end{cases}

Using the function

P(X=x)=0P(X=x)=0
P(a<X<b)=P(aXb)=abf(x)dxP(a<X<b)=P(a\leq X\leq b)=\int_{a}^{b}f(x)dx

Note that f(x)dx=1\int_{\infty}^{-\infty}f(x)dx = 1, as the total probability of any event is 1. This condition must be true for f(x)f(x) to be a valid p.d.f. Hence for the example K=3K=3

Expected value

E(Xn)=xnf(x)dxE(X^n)=\int_{\infty}^{-\infty}x^nf(x)dx

Cumulative distribution function

To obtain a c.d.f XF(x)X'\sim F(x) where P(X=x)=P(X<x)P(X'=x) = P(X < x), all we have to do is integrate the p.d.f:

F(X)=f(x)dxF(X)=\int f(x)dx

Using our example:

F(X)={1x>1x30<x<10x<0F(X)=\begin{cases} 1 & x > 1 \\ x^3 & 0 < x < 1 \\ 0 & x < 0 \end{cases}

And to convert a c.d.f back to its p.d.f, all we have to do is differentiate F(x)F(x):

f(x)=ddxF(x)f(x)=\frac{d}{dx}F(x)

Normal distribution

The Normal distribution is given by the following formula (which you don't have to memorize):

XN(μ,σ2)=12πσ2e(xμ)22σ2,where μmean,σ2varianceX\sim N(\mu, \sigma^2)=\frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}},\quad\quad\text{where }\underbrace{\mu}_{\text{mean}},\underbrace{\sigma^2}_{\text{variance}}

To calculate P(X<x)P(X<x), we have to standardize our NN by ZN(0,1)Z\sim N(0, 1), P(X<x)=P(Z<xμσ)P(X<x)=P(Z<\frac{x-\mu}{\sigma}). Here ZZ is the standard normal variable.

Reading the Z-table

For P(Z<z)=pP(Z<z)=p, to find pp, locate the header and leftmost column in the z-table such that their sum is zz. The corresponding intersecting cell is pp.
ZpZ_p gives the value zz in P(Z<z)=pP(Z<z)=p

Sampling

Sampling notation

For a given population of size NN, μ\mu is the true mean and σ2\sigma^2 is the true variance.

A sample is a subset of the population of size nn. The sample mean is xˉ\bar{x} and the sample variance is s2s^2. They can be calculated as:

xˉ=1nxi,s2=1n1(xixˉ)2\bar{x}=\frac{1}{n}\sum x_i,\quad\quad s^2=\frac{1}{n-1}\sum(x_i-\bar{x})^2

Definitions

The statistics for each random sample observation of the population are random.

The sampling distribution is the distribution of the statistics of the random samples.

Central Limit Theorem

The random variable Xˉ\bar{X} is the mean of a random sample of nn observations if:

  1. The observations are independent.
  2. The observations are identically distributed.
  3. n30n \geq 30.
Xˉ=X1+X2++Xnn\bar{X}=\frac{X_1+X_2+\dots+X_n}{n}

If XX is normally distributed:

XˉN(μ,σ2n),σ2n is the standard error (deviation)\bar{X}\sim N(\mu, \frac{\sigma^2}{\sqrt{n}}),\quad \frac{\sigma^2}{\sqrt{n}}\text{ is the standard error (deviation)}

The standard error captures how off the sample statistics are from the true population value. The value decreases as n increases.

The central limit theorem states that the sampling distribution of the sample mean approaches a normal distribution as the sample size increases, regardless of the shape of the population distribution.

Hypothesis testing

Test values

Test Statistic: Result that is calculated from the sample

Null Hypothesis H0H_0: Hypothesis assumed correct

Alternative Hypothesis H1H_1: Parameter if assumption shown wrong

Conducting tests

  1. Define null and alternative hypotheses
    • Upper tail test: H1:μ>μ0H_1:\mu > \mu_0
    • Lower tail test: H1:μ<μ0H_1:\mu < \mu_0
    • Two-tailed test: H1:μμ0H_1:\mu \neq \mu_0
    • H0H_0 is the opposite of H1H_1
  2. Identify variables xˉ,μ0,n,α,p\bar{x}, \mu_0, n, \alpha, p, and σ\sigma OR ss
  3. Calculate statistic value:
    • Given standard deviation σ\sigma: z-score z=xˉμ0σ/nz=\frac{\bar{x}-\mu_0}{\sigma/\sqrt{n}}
    • Given sample standard deviation ss: t-score t=xˉμ0s/nt=\frac{\bar{x}-\mu_0}{s/\sqrt{n}}
  4. Calculate p-value (use corresponding table)
    • Upper tail: 1P(Z<z)1-P(Z<z) or 1P(T<t)1-P(T<t)
    • Lower tail: P(Z<z)P(Z<z) or P(T<t)P(T<t)
    • Two-tailed: 2P(Z<z)2P(Z<-|z|) or 2P(T<t)2P(T<-|t|)
  5. Compare p-value with α\alpha:
    • If p<αp<\alpha, reject H0H_0 (evidence against H0H_0)
    • If p>αp>\alpha, do not reject H0H_0 (no evidence against H0H_0)
  6. State conclusion

Note that the p-value does not tell the chance that H0H_0 is true or false, but rather the chance of observing the sample data if H0H_0 is true.

On this page