Notes@HKU by Jax

Statistics

Random variables

A numerical description of the outcome of a random experiment is called a random variable. It can be either discrete or continuous.

Notation

Random variables are denoted with capital letters XX
The possible outcomes are denoted with regular letters xx
Probability that the outcome of XX is xx is denoted by P(X=x)P(X=x)

Expected value and Variance

E(Xn)=(xnP(X=x))E(X^n)=\sum(x^n\cdot P(X=x)) gives the expected value, which represents the mean value (outcome) of the random variable.
Var(X)=E(X2)E(X)2Var(X)=E(X^2)-E(X)^2 gives the variance, which is a measure of the variability of the random variable's outcomes.

Operations of E(X) and Var(X)

E(X+Y)=E(X)+E(Y)E(X+Y)=E(X)+E(Y), which any addition/subtraction function within E()E() can be expanded.

If XX and YY are independent:

Var(X+Y)=Var(X)+Var(Y),E(XY)=E(X)E(Y)Var(X+Y) = Var(X) + Var(Y),\quad\quad E(XY) = E(X)\cdot E(Y)
For Y=aX+bY=aX+b:
E(Y)=aE(X)+b,Var(Y)=a2Var(X)E(Y)=aE(X)+b,\quad\quad Var(Y)=a^2Var(X)

Probability distributions

A continuous random variable has probabilities as the area under its curve. Hence, P(X=n)P(X=n) for any outcome nn is 00. A discrete random variable has specific probabilities assigned to an outcome.

Operation on continuous ranges

P(X<x)=P(Xx)P(X<x)=P(X\leq x)
P(X>x)=1P(X<x)P(X>x)=1 - P(X<x)
P(aXb)=P(X<b)P(X<a)P(a\leq X\leq b)=P(X<b)-P(X<a)

Operation on discrete ranges

P(X<x)=P(Xx1)P(X<x)=P(X\leq x-1)
P(X>x)=1P(Xx)P(X>x)=1-P(X\leq x)
P(aXb)=P(Xb)P(Xa1)P(a\leq X\leq b)=P(X\leq b)-P(X\leq a-1)

Probability distribution functions

A p.d.f is a continuous function that returns the probability of the given outcome. The following is an example of a p.d.f:

Xf(x)={Kx20x10otherwiseX\sim f(x) = \begin{cases} Kx^2 & 0 \leq x \leq 1 \\ 0 & \text{otherwise} \end{cases}

Using the function

P(X=x)=0P(X=x)=0
P(a<X<b)=P(aXb)=abf(x)dxP(a<X<b)=P(a\leq X\leq b)=\int_{a}^{b}f(x)dx

Note that f(x)dx=1\int_{\infty}^{-\infty}f(x)dx = 1, as the total probability of any event is 1. This condition must be true for f(x)f(x) to be a valid p.d.f. Hence for the example K=3K=3

Expected value

E(Xn)=xnf(x)dxE(X^n)=\int_{\infty}^{-\infty}x^nf(x)dx

Cumulative distribution function

To obtain a c.d.f XF(x)X'\sim F(x) where P(X=x)=P(X<x)P(X'=x) = P(X < x), all we have to do is integrate the p.d.f:

F(X)=f(x)dxF(X)=\int f(x)dx

Using our example:

F(X)={1x>1x30<x<10x<0F(X)=\begin{cases} 1 & x > 1 \\ x^3 & 0 < x < 1 \\ 0 & x < 0 \end{cases}

And to convert a c.d.f back to its p.d.f, all we have to do is differentiate F(x)F(x):

f(x)=ddxF(x)f(x)=\frac{d}{dx}F(x)

Normal distribution

The Normal distribution is given by the following formula (which you don't have to memorize):

XN(μ,σ2)=12πσ2e(xμ)22σ2,where μmean,σ2varianceX\sim N(\mu, \sigma^2)=\frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}},\quad\quad\text{where }\underbrace{\mu}_{\text{mean}},\underbrace{\sigma^2}_{\text{variance}}

To calculate P(X<x)P(X<x), we have to standardize our NN by ZN(0,1)Z\sim N(0, 1), P(X<x)=P(Z<xμσ)P(X<x)=P(Z<\frac{x-\mu}{\sigma}). Here ZZ is the standard normal variable.

Reading the Z-table

For P(Z<z)=pP(Z<z)=p, to find pp, locate the header and leftmost column in the z-table such that their sum is zz. The corresponding intersecting cell is pp.
ZpZ_p gives the value zz in P(Z<z)=pP(Z<z)=p

Sampling

Sampling notation

For a given population of size NN, μ\mu is the true mean and σ2\sigma^2 is the true variance.

A sample is a subset of the population of size nn. The sample mean is xˉ\bar{x} and the sample variance is s2s^2. They can be calculated as:

xˉ=1nxi,s2=1n1(xixˉ)2\bar{x}=\frac{1}{n}\sum x_i,\quad\quad s^2=\frac{1}{n-1}\sum(x_i-\bar{x})^2

Definitions

The statistics for each random sample observation of the population are random.

The sampling distribution is the distribution of the statistics of the random samples.

Central Limit Theorem

The random variable Xˉ\bar{X} is the mean of a random sample of nn observations if:

  1. The observations are independent.
  2. The observations are identically distributed.
  3. n30n \geq 30.
Xˉ=X1+X2++Xnn\bar{X}=\frac{X_1+X_2+\dots+X_n}{n}

If XX is normally distributed:

XˉN(μ,σ2n),n is the standard error (deviation)\bar{X}\sim N(\mu, \frac{\sigma^2}{\sqrt{n}}),\quad \sqrt{n}\text{ is the standard error (deviation)}

The standard error captures how off the sample statistics are from the true population value. The value decreases as n increases.

The central limit theorem states that the sampling distribution of the sample mean approaches a normal distribution as the sample size increases, regardless of the shape of the population distribution.

Hypothesis testing

Test values

Test Statistic: Result that is calculated from the sample

Null Hypothesis H0H_0: Hypothesis assumed correct

Alternative Hypothesis H1H_1: Parameter if assumption shown wrong

Types of tests

  • H1H_1 in the form H1:p< or p>H_1: p < \text{ or } p > → "one-tailed"

  • H1H_1 in the form H1:pH_1: p \neq → "two-tailed"

  • Over / under / increase / decrease = one-tailed

  • Change / not = two-tailed

Carrying out tests

  1. Identify the available data:

    • Test statistic, XX
    • Significance level
    • One tail or two tail
  2. Execution:

    • Define XX – the test statistic
    • Define H0H_0 & H1H_1
    • Assume H0H_0 true, substitute pp to distribution model → find P(Xt-value)P(X \sim \text{t-value}) to give the t-value.
    • Compare: if P(Xt-value)<s.l.P(X \sim \text{t-value}) < \text{s.l.} → reject H0H_0
    • Conclude in context of the question
Example of Hypothesis Testing

A seed producer claims that 96% of its beans turn golden. A random sample of 75 bean seeds was planted and 66 of these seeds turned golden. Test at 1% significance whether the producer is overstating the probability of the seeds turning golden.

  1. Data identified:

    • n=75,p=0.96n = 75, p = 0.96
    • Model hence is Binomial
    • Test statistic value x=66x = 66
    • s.l.=0.01s.l. = 0.01
    • One tail
  2. Define XX – the test statistic:

    • Let XX = the number of seeds that turn golden
  3. Define H0H_0 & H1H_1:

    • H0:p=0.96,H1:p<0.96H_0: p = 0.96, H_1: p < 0.96
  4. Assume H0H_0 true, substitute pp to distribution model → find P(Xt-value)P(X \sim \text{t-value}):

    • P(X66)=0.00303<0.01P(X \leq 66) = 0.00303 < 0.01
  5. Compare: if P(Xt-value)<s.l.P(X \sim \text{t-value}) < \text{s.l.} → reject H0H_0:

    • H0H_0 rejected
  6. Conclude in context of the question:

    • Producer is overstating …

Example if Two-Tailed

Test at 1% significance whether the producer is lying about the probability of the seeds turning golden.

  1. Define XX – the test statistic:

    • Let XX = the number of seeds that turn golden
  2. Define H0H_0 & H1H_1:

    • H0:p=0.96,H1:p0.96H_0: p = 0.96, H_1: p \neq 0.96
  3. Assume H0H_0 true, substitute pp to distribution model → find P(Xt-value)P(X \geq \text{t-value}) & P(Xt-value)P(X \leq \text{t-value}) if two-tailed:

    • P(X66)=0.00303<0.005P(X \leq 66) = 0.00303 < 0.005
    • P(X66)=10.00303>0.005P(X \geq 66) = 1 - 0.00303 > 0.005
  4. Compare: if P(Xt-value)<s.l.÷2P(X \sim \text{t-value}) < \text{s.l.} \div 2 → reject H0H_0:

    • H0H_0 rejected (first tail)
  5. Conclude in context of the question:

    • Producer is lying and overstating…

On this page