Statistics
Random variables
A numerical description of the outcome of a random experiment is called a random variable. It can be either discrete or continuous.
Random variables are denoted with capital letters
The possible outcomes are denoted with regular letters
Probability that the outcome of is is denoted by
gives the expected value, which represents the mean value (outcome) of the random variable.
gives the variance, which is a measure of the variability of the random variable's outcomes.
, which any addition/subtraction function within can be expanded.
If and are independent:
For :
Probability distributions
A continuous random variable has probabilities as the area under its curve. Hence, for any outcome is . A discrete random variable has specific probabilities assigned to an outcome.
Probability distribution functions
A p.d.f is a continuous function that returns the probability of the given outcome. The following is an example of a p.d.f:
Note that , as the total probability of any event is 1. This condition must be true for to be a valid p.d.f. Hence for the example
Cumulative distribution function
To obtain a c.d.f where , all we have to do is integrate the p.d.f:
Using our example:
And to convert a c.d.f back to its p.d.f, all we have to do is differentiate :
Normal distribution
The Normal distribution is given by the following formula (which you don't have to memorize):
To calculate , we have to standardize our by , . Here is the standard normal variable.
For , to find , locate the header and leftmost column in the z-table such that their sum is . The corresponding intersecting cell is .
gives the value in
Sampling
For a given population of size , is the true mean and is the true variance.
A sample is a subset of the population of size . The sample mean is and the sample variance is . They can be calculated as:
The statistics for each random sample observation of the population are random.
The sampling distribution is the distribution of the statistics of the random samples.
The random variable is the mean of a random sample of observations if:
- The observations are independent.
- The observations are identically distributed.
- .
If is normally distributed:
The standard error captures how off the sample statistics are from the true population value. The value decreases as n increases.
The central limit theorem states that the sampling distribution of the sample mean approaches a normal distribution as the sample size increases, regardless of the shape of the population distribution.
Hypothesis testing
Test Statistic: Result that is calculated from the sample
Null Hypothesis : Hypothesis assumed correct
Alternative Hypothesis : Parameter if assumption shown wrong
-
in the form → "one-tailed"
-
in the form → "two-tailed"
-
Over / under / increase / decrease = one-tailed
-
Change / not = two-tailed
-
Identify the available data:
- Test statistic,
- Significance level
- One tail or two tail
-
Execution:
- Define – the test statistic
- Define &
- Assume true, substitute to distribution model → find to give the t-value.
- Compare: if → reject
- Conclude in context of the question
Example of Hypothesis Testing
A seed producer claims that 96% of its beans turn golden. A random sample of 75 bean seeds was planted and 66 of these seeds turned golden. Test at 1% significance whether the producer is overstating the probability of the seeds turning golden.
-
Data identified:
- Model hence is Binomial
- Test statistic value
- One tail
-
Define – the test statistic:
- Let = the number of seeds that turn golden
-
Define & :
-
Assume true, substitute to distribution model → find :
-
Compare: if → reject :
- rejected
-
Conclude in context of the question:
- Producer is overstating …
Example if Two-Tailed
Test at 1% significance whether the producer is lying about the probability of the seeds turning golden.
-
Define – the test statistic:
- Let = the number of seeds that turn golden
-
Define & :
-
Assume true, substitute to distribution model → find & if two-tailed:
-
Compare: if → reject :
- rejected (first tail)
-
Conclude in context of the question:
- Producer is lying and overstating…