Statistics Notes

Short notes on the AQA statistics A level.

Split into topics for each module and topic.

SS02

Time series analysis

Seasonal variation

When asked for the type of variation

Seasonal variation If the variations from the fit line appear to follow a pattern

Random variation If they don’t

Moving averages and seasonal effects

Moving averages For

Centered moving average The average of the current moving average and the surrounding moving averages

The seasonal effect is the average of the values of the difference between the actual values and the centered moving average.

Take the centered moving average and subtract it from the actual value for each item in a group. Then calculate the average of these values.

Estimation from seasonal effects

Once the seasonal effects has been calculated the fit line can be used to find a value for a given time, and the seasonal effect can then be applied to this value to produce an estimate

Sampling

Simple random samples

  1. Assign a range of values to the data
  2. Choose random values from the number table, starting from a random position

Each item has the same probability of being chosen.

If the data is in sections, each section may not be represented. May not give correct representations of strata.

Stratified random sampling

There may often be factors which divide up the population into groups (strata), and we may expect the measurement of interest to vary among the different groups. This can be accounted for when we select a sample from the population in order that we obtain a sample that is representative of the population.

We generally require that the proportion of each stratum in the sample should be the same as in the population.

Stratified sampling techniques are generally used when the population is heterogeneous, or dissimilar, where certain homogeneous sub populations can be isolated.

Some reasons for using stratified sampling over simple random sampling are:

  • The cost per observation may be reduced
  • Estimates of the population parameters may be wanted for each sub-population
  • Increased accuracy at a given cost

Cluster, quota, and systematic sampling

Cluster sampling

The data is is divided into separate groups. Then a simple random sample of clusters is selected from the population.

The population is divided into groups, called clusters

The researcher randomly selects clusters to include in the sample

The number of observations within each cluster is known, and .

One-stage sampling All of the elements within selected clusters are included in the sample

Two-stage sampling A subset of elements within selected clusters are randomly selected for inclusion in the sample

Sometimes, the cost per sample point is less for cluster sampling than for other sampling methods. Given a fixed budget, the researcher may be able to use a bigger sample with cluster sampling than with the other methods. When the increased sample size is sufficient to offset the loss in precision, cluster sampling may be the best choice.

Application:

  1. Select a cluster grouping as a sampling frame
  2. Mark each cluster with a unique number
  3. Choose a sample of clusters applying probability sampling

Quota sampling

Quota sampling requires, that representative individuals are chosen from a specific subgroup

Advantages:

  • Primary collection can be done in a short time
  • The application of quota sampling can save costs and time
  • Quota sampling is not dependent on the presence of the sampling frames.

Disadvantages:

  • It is not possible to calculate the sampling error
  • Other important characteristics may be disproportionately represented in the final sample group
  • There is a potential for researcher bis and the quality of work may suffer due to researcher incompetency

Systematic sampling

A sampling method in which the first position in the data set is randomly chosen, and every position after this is also chosen

Advantages:

  • Allows the researcher to add a degree of system or process into the random selection
  • Known and equal probability of selection
  • The assurance that the population will be evenly sampled. Simple random sampling allows a clustered selection of subjects

Disadvantages:

  • The process of selection can interact with a hidden periodic trait within the population
  • The process can also hide a periodic trait

Discrete probability distributions

Expectations and variance

Poisson

(Same as MS02)

Interpretation of data

Pie charts

Line diagrams

Box and whisker plots

Frequency diagrams

Scatter diagrams

Histograms not required

Hypothesis testing

Tests for means

Errors

Type I error: Rejecting a true null hypothesis

Type II error: Accepting a false null hypothesis

If it says test. State the fucking hypotheses!

Tests

Test Details Case to reject  
SS02      
Z/T test Determines whether two population means are different TV > CV  
SS03      
Contingency tables A test for independence CV > TV  
Sign test Test for difference in medians TV < CV  
Wilcoxon Test for difference in mean or median TV < CV  
Mann-Whitney Test for equality of population TV < CV  
Kruskal-Wallis Test for equality of population of two or more samples TV > CV  
Correlation coefficient Test for existence of correlation between two random variables TV > SL  
SS04      
Poisson Test for change in a Poisson variable TV < SL  
Proportion Test whether sample proportion represents the population TV < SL  
SS05      
Variance Tests a sample for a given population variance TV > CV(Upper) or TV < CV(Lower)  
Variance equality of samples (F) Test for equality of the variances of the populations of two samples of two normally distributed random variables TV > CV  
Difference in mean (Two sample Z) Test for the difference in the means of two independent populations |TV| > |CV|  
Difference in mean (Two sample T) Test for the difference in the means of two independent populations with unknown variances |TV| > |CV|  
Goodness of fit Test for the fit of a sample to a particular distributio n TV > CV  
SS06      
Paired comparisons Analysis of the difference between pairs of values sampled from two normal populations TV > CV  
Analysis of variance An extension of F tests with more than 2 populations TV > CV  
Two way analysis of variance An analysis of variance which accounts for a second factor TV > CV  
Latin squares      

SS03

Contingency tables

  1. State hypotheses (In context) i. : No assocation ii : Assocation
  2. Calculate the expected values.

    If any value is less than 5, merge the rows or columns

  3. Calculate the text statistic
  4. Find the critical value for degrees of freedom. This can either be done from the table or with
  5. If text , reject in context

Yates’ correction

For a 2 by 2 table, Yates’ correction is used.

Rather than the corrected formula is

Distribution free methods

Test For Use case
Sign test Median When Wilcoxon cannot be used as data is not symmetrical or is non-numeric
Wilcoxon signed-rank test Median or Mean When a z or t test cannot be used
Mann-Whitney U test Equality of populations of two samples There are two samples
Kruskal-Wallis test Equality of populations of two or more samples There are more than two samples

Sign test

The sign test checks for a difference in the median value by comparing each pair. It does not require a symmetric distribution and could be used on non-numeric data so long as the data can be assigned to two groups (e.g. boolean values of opinions)

  1. State hypotheses i. that population medians are equal ii. : that population medians are not equal or
  2. Find the differences between each pair, ignoring any equal values
  3. Count the number of positive differences and the number of negative differences
  4. Find the value of
  5. Compare the value above with the significance level. If it is less, reject

Wilcoxon Signed-rank test

The Wilcoxon test is similar to the sign test except that it ranks the differences ignoring their signs.

  1. State the hypotheses i. Population average difference (in mean or median) of 0 ii. Population average difference not 0
  2. Rank the absolutes of the differences, giving each rank the sign of its respective difference
  3. Calculate and , the sums of the positive and negative ranks respectively. Let
  4. Find the critical value, from the table
  5. Compare to , rejecting if is smaller

Mann-Whitney U-test

The Mann-Whitney U test tests whether two samples were taken from the same population. It is used when a t test cannot be applied as the data is not normal

  1. State the hypotheses i. The samples are from the same population, ii. The samples are from different populations,
  2. Rank the entire dataset and calculate the sum of the ranks for each set
  3. Calculate the test statistic for each set where is the sum of the ranks of the set and is the size of the set.
    Let
  4. Find the critical value for the significance level and the size of each data set
  5. If is less than the critical value, reject

Kruskal-Wallis test

The Kruskal-Wallis test is a non-parametric version of the ANOVA test. It determines a difference between samples

  1. State the hypotheses i. All samples are from the same population ii. Samples are from different populations
  2. Rank the entire dataset and calculate the sum of the ranks of each set
  3. Calculate the test statistic where is the sum of all samples sizes, and and are the rank sums and sizes of each sample
  4. Find the degrees of freedom, the number of samples minus one
  5. Calculate the critical value for the given significance level either from the tables or with
  6. If the test statistic is larger than the critical value, reject

Correlation

Spearman’s Rank Correlation Coefficient

Spearman’s rank correlation coefficient is used when the data is ranked.

The value is given by where is the difference between the rank of a pair of values, and is the (equal) size of each dataset

Testing with the correlation coefficient

A test can be carried out in order to determine whether there is or is not a correlation between two random variables. Assuming that the correlation coefficient, , is already known.

  1. State the hypotheses i. (independent) ii. (not-independent)
  2. Find the critical value from the table
  3. If the correlation coefficient is greater than the critical value, reject

SS04

Linear combinations of independent normal variables

If the letters and are variables and the letters and are constants then

  1. A linear combination of independent, normal variables will itself be normally distributed

Given a normal distribution , if an event is given which is times then this new event has a distribution . The standard deviation of the new event is then .

Given two normal distributions , and the normally distributed random variable has the distribution .

Approximating distributions

  1. The purpose of making an approximation is:
    • To reduce the amount of calculation
    • To allow tables to be used where they otherwise could not
    • To calculate confidence intervals
  2. The binomial distribution may be approximated by the Poisson distribution if and
  3. The conditions for the approximations are rules of theu,b. They are not shar dividing lines between good approximations and bad approximations
  4. The binomial distribution may be approximated by the normal distribution if and
  5. The Poisson distribution may be approximated by the normal distribution if

Confidence intervals

  1. An estimate of a population standard deviation calculated from a random sample of size has degrees of freedom
  2. If is the mean of a random sample of size from a normal distribution with mean a confidence interval for is given by

Further confidence intervals

  1. If is an observation from a Poisson distribution with mean then an approximate confidence interval for is given by , provided that is reasonably large, say
  2. If is an observation from a binomial distribution with parameters then an approximate confidence interval for is given by , provided is reasonably large, say

Further hypothesis testing for means

To carry out a hypothesis test for a mean based on a sample from a normal distribution with an unknown standard deviation:

The test statistic is where

Hypothesis tests for proportions and for the mean of a Poisson distribution

  1. To test hypotheses about a binomial population proportion, , either: a. Determine the cumulative binomial probability of State that the mean remains the same, and that it exhibits the expected change. Then find the probability that this value occurs, and reject if the probability is higher than the level of the test. b. use The hypotheses are then dealt with like a regular normal hypothesis test.
  2. To test hypotheses about a Poisson population mean , either a. Determine the cumulative Poisson probability of , or b. use

SS05

Continuous probability distributions

  1. The random variables having probability density function where and are constants, it is said to follow a rectangular distribution
  2. The mean of is and the variance of is
  3. The exponential distribution has probability density function
  4. The exponential distribution with parameter has mean and standard deviation
  5. is known as the cumulative distribution function and is usually denoted
  6. For the exponential distribution with parameter ,
  7. If and are two constants and , the probability that takes a value between and is
  8. The intervals between successive events from a Poisson distribution with mean are distributed according to the exponential distribution with parameter

Estimation

  1. If denotes the variance estimate from a random sample of size from a normal population with variance of , then
  2. The distribution is not symmetric so both lower and upper percentage points need to be read from tables
  3. A confidence interval for a normal population variance, , is given by and
  4. Confidence limits for a normal population standard deviation, are found by taking the square root of those calculated for the population variance

Hypothesis testing: one sample tests

  1. To test hypotheses about a normal population variance, or standard deviation , use
  2. To test hypotheses about a normal population with mean, , use

Hypothesis testing: two-sample tests

  1. To test hypotheses about the equality of two normal population variances, or standard deviations, use
  2. To test hypotheses about the equality of (or given differece in) two normal population means, based upon independent random samples and known population variances use
    Note that for and the requirement for normal populations canbe relaxed and/or sample variances can be used as estimates of the population variances
  3. To test hypotheses about the equality of (or given difference in) two normal population means, based upon independent random samples and unknown but equal population variances use
    where

    Testing for goodness of fit

  4. may be approximated by a distribution provided that
    • The s are frequenceies,
    • The s are at least five,
    • The classes form a sample space that is, every possible observation fits into one and only one class
  5. The number of degrees of freedom is the number of classes, minus the number of independent pieces of information derived from the s in order to calculate the s
  6. If there are classes and any necessary parameters are estimated from the data the number of degrees of is for a Poisson, binomial, or exponential distribution, and for a normal distribution

SS06

Experimental design

  1. Experimental error is the effect of factors other than those controlled by the experimenter
  2. In a paired comparison, experimental error is reduced by applying both treatments to the same subjects or in the same conditions
  3. The purpose of randomisation is to eliminate bias
  4. Blocking is used to reduce experimental error by applying treatments (usually more than two) to the same subjects or in the same conditions
  5. If a new treatement is applied to an experimental group, a control group, which receives no treatment or the standard treatment, is needed to act as a measure of the effect of not applying the new treatment
  6. A placebo is a pill or treatment which contains no active ingredient
  7. In a blind trial subjects do not know whether they are receiving the treatment or a placebo
  8. In a double blind trial neither the subject nor the person administering the treatment knows whether a placebo or an active drug is being given

Analysis of paired comparisons

If and denote the mean and standard deviation, respectively, of a random sample of differences that can be assumed to be normally distributed with mean then

Analysis of variance (ANOVA)

  1. The assumptions for the three models considered, one and two factor ANOVAs, and Latin square designs, are that: a. The observations are obtained independently and randomly from populations at each factor level (combination) b. These populations are (approximately) normally distributed with common variance $\sigma^2$ c. When two or more factors are involved, there is no interaction between them

  2. One way ANOVA table

    Source of variation Sum of squares Degrees of freedom Mean square F ratio
    Between samples
    Within samples  
    Total    
  3. Two way ANOVA table

    Source of variation Sum of squares Degrees of freedom Mean square F ratio
    Between rows
    Between columns  
    Error  
    Total    

Provided in the formulae booklet

Statistical process control

  1. Statistical process control may be used when a large number of similar items are being produced. Its purpose is to give a signal when the process mean has moved away from the target value or when item-to-item variability has increased
  2. For control charts for means:
    • Sample mean between warning limits- No action
    • Sample mean between arning and action limits- Take another sample immediately. If new sample mean outside warning limits take action
    • Sample mean outside action limits- Take action
  3. The warning limits are set at , and the action limits at , where is the target value, is the short-term standard deviation, and is the sample size
  4. Variability may be controlled by plotting the sample ranges or standard deviations on control charts. The limits for these charts are found by multiplying the process short-term standard deviation found by factors in the control charts for variability (Table 12)
  5. When the standard deviation must be estimated from a number of small samples the average sample range can be calculated and a factor from table 12 applied.
    Alternatively can be calculated for each sample and the formula evaluated
  6. If the tolerance width exceeds six standard deviations the process should be able to meet the tolerances consistently, provided the mean is kept on target
  7. For charts for proportion non-conforming providing is reasonably large:
    • The warning limits are
    • The action limits are

Acceptance sampling

  1. Acceptance sampling may be applied to large batches of similar items. It is the process of deciding whether or not the batch is acceptable by testing a small sample of the items
  2. The operating characteristic for an acceptance sampling by attributes plan is a graph of probability of acceptance against proportion non-conforming in the batch
  3. The probabilities may be found from the binomial distribution provided the sample is random and the sample size is small compared to the batch
  4. In double sampling, the number of non-conforming items in the first sample will determine whether a decision is made immediately or whether it is delayed until a second sample has been inspected
  5. For acceptance sampling by variables the operating characteristic is a graph of probability of acceptance against batch mean

MS03

Bayes’ theorem

Example

Event P(A) P(B)
C 0.3 0.8
D 0.5 0.1
E 0.2 0.4

After event A, event B occured. Find the probability that event E occured.

Let F be the probability that event B occured.

Linear combinations of random variables

Covariance is a measure of the joint variability of two random variables.

The covariance can be used to find the product moment correlation coefficient of two random variables:

The variance can also be computed as

Distributional approximations

Mean and variance of binomial and Poisson distributions

Proof of for binomial

Proof of for binomial