Statistics Notes
Short notes on the AQA statistics A level.
Split into topics for each module and topic.
SS02
Time series analysis
Seasonal variation
When asked for the type of variation
Seasonal variation If the variations from the fit line appear to follow a pattern
Random variation If they don’t
Moving averages and seasonal effects
Moving averages For
Centered moving average The average of the current moving average and the surrounding moving averages
The seasonal effect is the average of the values of the difference between the actual values and the centered moving average.
Take the centered moving average and subtract it from the actual value for each item in a group. Then calculate the average of these values.
Estimation from seasonal effects
Once the seasonal effects has been calculated the fit line can be used to find a value for a given time, and the seasonal effect can then be applied to this value to produce an estimate
Sampling
Simple random samples
- Assign a range of values to the data
- Choose random values from the number table, starting from a random position
Each item has the same probability of being chosen.
If the data is in sections, each section may not be represented. May not give correct representations of strata.
Stratified random sampling
There may often be factors which divide up the population into groups (strata), and we may expect the measurement of interest to vary among the different groups. This can be accounted for when we select a sample from the population in order that we obtain a sample that is representative of the population.
We generally require that the proportion of each stratum in the sample should be the same as in the population.
Stratified sampling techniques are generally used when the population is heterogeneous, or dissimilar, where certain homogeneous sub populations can be isolated.
Some reasons for using stratified sampling over simple random sampling are:
- The cost per observation may be reduced
- Estimates of the population parameters may be wanted for each sub-population
- Increased accuracy at a given cost
Cluster, quota, and systematic sampling
Cluster sampling
The data is is divided into separate groups. Then a simple random sample of clusters is selected from the population.
The population is divided into groups, called clusters
The researcher randomly selects clusters to include in the sample
The number of observations within each cluster is known, and .
One-stage sampling All of the elements within selected clusters are included in the sample
Two-stage sampling A subset of elements within selected clusters are randomly selected for inclusion in the sample
Sometimes, the cost per sample point is less for cluster sampling than for other sampling methods. Given a fixed budget, the researcher may be able to use a bigger sample with cluster sampling than with the other methods. When the increased sample size is sufficient to offset the loss in precision, cluster sampling may be the best choice.
Application:
- Select a cluster grouping as a sampling frame
- Mark each cluster with a unique number
- Choose a sample of clusters applying probability sampling
Quota sampling
Quota sampling requires, that representative individuals are chosen from a specific subgroup
Advantages:
- Primary collection can be done in a short time
- The application of quota sampling can save costs and time
- Quota sampling is not dependent on the presence of the sampling frames.
Disadvantages:
- It is not possible to calculate the sampling error
- Other important characteristics may be disproportionately represented in the final sample group
- There is a potential for researcher bis and the quality of work may suffer due to researcher incompetency
Systematic sampling
A sampling method in which the first position in the data set is randomly chosen, and every position after this is also chosen
Advantages:
- Allows the researcher to add a degree of system or process into the random selection
- Known and equal probability of selection
- The assurance that the population will be evenly sampled. Simple random sampling allows a clustered selection of subjects
Disadvantages:
- The process of selection can interact with a hidden periodic trait within the population
- The process can also hide a periodic trait
Discrete probability distributions
Expectations and variance
Poisson
(Same as MS02)
Interpretation of data
Pie charts
Line diagrams
Box and whisker plots
Frequency diagrams
Scatter diagrams
Histograms not required
Hypothesis testing
Tests for means
Errors
Type I error: Rejecting a true null hypothesis
Type II error: Accepting a false null hypothesis
If it says test. State the fucking hypotheses!
Tests
Test | Details | Case to reject | |
---|---|---|---|
SS02 | |||
Z/T test | Determines whether two population means are different | TV > CV | |
SS03 | |||
Contingency tables | A test for independence | CV > TV | |
Sign test | Test for difference in medians | TV < CV | |
Wilcoxon | Test for difference in mean or median | TV < CV | |
Mann-Whitney | Test for equality of population | TV < CV | |
Kruskal-Wallis | Test for equality of population of two or more samples | TV > CV | |
Correlation coefficient | Test for existence of correlation between two random variables | TV > SL | |
SS04 | |||
Poisson | Test for change in a Poisson variable | TV < SL | |
Proportion | Test whether sample proportion represents the population | TV < SL | |
SS05 | |||
Variance | Tests a sample for a given population variance | TV > CV(Upper) or TV < CV(Lower) | |
Variance equality of samples (F) | Test for equality of the variances of the populations of two samples of two normally distributed random variables | TV > CV | |
Difference in mean (Two sample Z) | Test for the difference in the means of two independent populations | |TV| > |CV| | |
Difference in mean (Two sample T) | Test for the difference in the means of two independent populations with unknown variances | |TV| > |CV| | |
Goodness of fit | Test for the fit of a sample to a particular distributio n | TV > CV | |
SS06 | |||
Paired comparisons | Analysis of the difference between pairs of values sampled from two normal populations | TV > CV | |
Analysis of variance | An extension of F tests with more than 2 populations | TV > CV | |
Two way analysis of variance | An analysis of variance which accounts for a second factor | TV > CV | |
Latin squares |
SS03
Contingency tables
- State hypotheses (In context) i. : No assocation ii : Assocation
-
Calculate the expected values.
If any value is less than 5, merge the rows or columns
- Calculate the text statistic
- Find the critical value for degrees of freedom. This can either be done from the table or with
- If text , reject in context
Yates’ correction
For a 2 by 2 table, Yates’ correction is used.
Rather than the corrected formula is
Distribution free methods
Test | For | Use case |
---|---|---|
Sign test | Median | When Wilcoxon cannot be used as data is not symmetrical or is non-numeric |
Wilcoxon signed-rank test | Median or Mean | When a z or t test cannot be used |
Mann-Whitney U test | Equality of populations of two samples | There are two samples |
Kruskal-Wallis test | Equality of populations of two or more samples | There are more than two samples |
Sign test
The sign test checks for a difference in the median value by comparing each pair. It does not require a symmetric distribution and could be used on non-numeric data so long as the data can be assigned to two groups (e.g. boolean values of opinions)
- State hypotheses i. that population medians are equal ii. : that population medians are not equal or
- Find the differences between each pair, ignoring any equal values
- Count the number of positive differences and the number of negative differences
- Find the value of
- Compare the value above with the significance level. If it is less, reject
Wilcoxon Signed-rank test
The Wilcoxon test is similar to the sign test except that it ranks the differences ignoring their signs.
- State the hypotheses i. Population average difference (in mean or median) of 0 ii. Population average difference not 0
- Rank the absolutes of the differences, giving each rank the sign of its respective difference
- Calculate and , the sums of the positive and negative ranks respectively. Let
- Find the critical value, from the table
- Compare to , rejecting if is smaller
Mann-Whitney U-test
The Mann-Whitney U test tests whether two samples were taken from the same population. It is used when a t test cannot be applied as the data is not normal
- State the hypotheses i. The samples are from the same population, ii. The samples are from different populations,
- Rank the entire dataset and calculate the sum of the ranks for each set
- Calculate the test statistic for each set where is the sum of the ranks of the set and is the size of the set.
Let - Find the critical value for the significance level and the size of each data set
- If is less than the critical value, reject
Kruskal-Wallis test
The Kruskal-Wallis test is a non-parametric version of the ANOVA test. It determines a difference between samples
- State the hypotheses i. All samples are from the same population ii. Samples are from different populations
- Rank the entire dataset and calculate the sum of the ranks of each set
- Calculate the test statistic where is the sum of all samples sizes, and and are the rank sums and sizes of each sample
- Find the degrees of freedom, the number of samples minus one
- Calculate the critical value for the given significance level either from the tables or with
- If the test statistic is larger than the critical value, reject
Correlation
Spearman’s Rank Correlation Coefficient
Spearman’s rank correlation coefficient is used when the data is ranked.
The value is given by where is the difference between the rank of a pair of values, and is the (equal) size of each dataset
Testing with the correlation coefficient
A test can be carried out in order to determine whether there is or is not a correlation between two random variables. Assuming that the correlation coefficient, , is already known.
- State the hypotheses i. (independent) ii. (not-independent)
- Find the critical value from the table
- If the correlation coefficient is greater than the critical value, reject
SS04
Linear combinations of independent normal variables
If the letters and are variables and the letters and are constants then
- A linear combination of independent, normal variables will itself be normally distributed
Given a normal distribution , if an event is given which is times then this new event has a distribution . The standard deviation of the new event is then .
Given two normal distributions , and the normally distributed random variable has the distribution .
Approximating distributions
- The purpose of making an approximation is:
- To reduce the amount of calculation
- To allow tables to be used where they otherwise could not
- To calculate confidence intervals
- The binomial distribution may be approximated by the Poisson distribution if and
- The conditions for the approximations are rules of theu,b. They are not shar dividing lines between good approximations and bad approximations
- The binomial distribution may be approximated by the normal distribution if and
- The Poisson distribution may be approximated by the normal distribution if
Confidence intervals
- An estimate of a population standard deviation calculated from a random sample of size has degrees of freedom
- If is the mean of a random sample of size from a normal distribution with mean a confidence interval for is given by
Further confidence intervals
- If is an observation from a Poisson distribution with mean then an approximate confidence interval for is given by , provided that is reasonably large, say
- If is an observation from a binomial distribution with parameters then an approximate confidence interval for is given by , provided is reasonably large, say
Further hypothesis testing for means
To carry out a hypothesis test for a mean based on a sample from a normal distribution with an unknown standard deviation:
The test statistic is where
Hypothesis tests for proportions and for the mean of a Poisson distribution
- To test hypotheses about a binomial population proportion, , either: a. Determine the cumulative binomial probability of State that the mean remains the same, and that it exhibits the expected change. Then find the probability that this value occurs, and reject if the probability is higher than the level of the test. b. use The hypotheses are then dealt with like a regular normal hypothesis test.
- To test hypotheses about a Poisson population mean , either a. Determine the cumulative Poisson probability of , or b. use
SS05
Continuous probability distributions
- The random variables having probability density function where and are constants, it is said to follow a rectangular distribution
- The mean of is and the variance of is
- The exponential distribution has probability density function
- The exponential distribution with parameter has mean and standard deviation
- is known as the cumulative distribution function and is usually denoted
- For the exponential distribution with parameter ,
- If and are two constants and , the probability that takes a value between and is
- The intervals between successive events from a Poisson distribution with mean are distributed according to the exponential distribution with parameter
Estimation
- If denotes the variance estimate from a random sample of size from a normal population with variance of , then
- The distribution is not symmetric so both lower and upper percentage points need to be read from tables
- A confidence interval for a normal population variance, , is given by and
- Confidence limits for a normal population standard deviation, are found by taking the square root of those calculated for the population variance
Hypothesis testing: one sample tests
- To test hypotheses about a normal population variance, or standard deviation , use
- To test hypotheses about a normal population with mean, , use
Hypothesis testing: two-sample tests
- To test hypotheses about the equality of two normal population variances, or standard deviations, use
- To test hypotheses about the equality of (or given differece in) two normal population means, based upon independent random samples and known population variances use
Note that for and the requirement for normal populations canbe relaxed and/or sample variances can be used as estimates of the population variances - To test hypotheses about the equality of (or given difference in) two normal population means, based upon independent random samples and unknown but equal population variances use
whereTesting for goodness of fit
- may be approximated by a distribution provided that
- The s are frequenceies,
- The s are at least five,
- The classes form a sample space that is, every possible observation fits into one and only one class
- The number of degrees of freedom is the number of classes, minus the number of independent pieces of information derived from the s in order to calculate the s
- If there are classes and any necessary parameters are estimated from the data the number of degrees of is for a Poisson, binomial, or exponential distribution, and for a normal distribution
SS06
Experimental design
- Experimental error is the effect of factors other than those controlled by the experimenter
- In a paired comparison, experimental error is reduced by applying both treatments to the same subjects or in the same conditions
- The purpose of randomisation is to eliminate bias
- Blocking is used to reduce experimental error by applying treatments (usually more than two) to the same subjects or in the same conditions
- If a new treatement is applied to an experimental group, a control group, which receives no treatment or the standard treatment, is needed to act as a measure of the effect of not applying the new treatment
- A placebo is a pill or treatment which contains no active ingredient
- In a blind trial subjects do not know whether they are receiving the treatment or a placebo
- In a double blind trial neither the subject nor the person administering the treatment knows whether a placebo or an active drug is being given
Analysis of paired comparisons
If and denote the mean and standard deviation, respectively, of a random sample of differences that can be assumed to be normally distributed with mean then
Analysis of variance (ANOVA)
-
The assumptions for the three models considered, one and two factor ANOVAs, and Latin square designs, are that: a. The observations are obtained independently and randomly from populations at each factor level (combination) b. These populations are (approximately) normally distributed with common variance $\sigma^2$ c. When two or more factors are involved, there is no interaction between them
-
One way ANOVA table
Source of variation Sum of squares Degrees of freedom Mean square F ratio Between samples Within samples Total -
Two way ANOVA table
Source of variation Sum of squares Degrees of freedom Mean square F ratio Between rows Between columns Error Total
Provided in the formulae booklet
Statistical process control
- Statistical process control may be used when a large number of similar items are being produced. Its purpose is to give a signal when the process mean has moved away from the target value or when item-to-item variability has increased
- For control charts for means:
- Sample mean between warning limits- No action
- Sample mean between arning and action limits- Take another sample immediately. If new sample mean outside warning limits take action
- Sample mean outside action limits- Take action
- The warning limits are set at , and the action limits at , where is the target value, is the short-term standard deviation, and is the sample size
- Variability may be controlled by plotting the sample ranges or standard deviations on control charts. The limits for these charts are found by multiplying the process short-term standard deviation found by factors in the control charts for variability (Table 12)
- When the standard deviation must be estimated from a number of small samples the average sample range can be calculated and a factor from table 12 applied.
Alternatively can be calculated for each sample and the formula evaluated - If the tolerance width exceeds six standard deviations the process should be able to meet the tolerances consistently, provided the mean is kept on target
- For charts for proportion non-conforming providing is reasonably large:
- The warning limits are
- The action limits are
Acceptance sampling
- Acceptance sampling may be applied to large batches of similar items. It is the process of deciding whether or not the batch is acceptable by testing a small sample of the items
- The operating characteristic for an acceptance sampling by attributes plan is a graph of probability of acceptance against proportion non-conforming in the batch
- The probabilities may be found from the binomial distribution provided the sample is random and the sample size is small compared to the batch
- In double sampling, the number of non-conforming items in the first sample will determine whether a decision is made immediately or whether it is delayed until a second sample has been inspected
- For acceptance sampling by variables the operating characteristic is a graph of probability of acceptance against batch mean
MS03
Bayes’ theorem
Example
Event | P(A) | P(B) |
---|---|---|
C | 0.3 | 0.8 |
D | 0.5 | 0.1 |
E | 0.2 | 0.4 |
After event A, event B occured. Find the probability that event E occured.
Let F be the probability that event B occured.
Linear combinations of random variables
Covariance is a measure of the joint variability of two random variables.
The covariance can be used to find the product moment correlation coefficient of two random variables:
The variance can also be computed as
Distributional approximations
Mean and variance of binomial and Poisson distributions
Proof of for binomial
Proof of for binomial
