Hypothesis testing is a formal statistical procedure used to assess whether observed data provides sufficient evidence to reject a claim about a population parameter. It is one of the most important topics in A-Level Statistics, tested by all major exam boards (AQA, Edexcel, OCR), and it underpins much of the statistical reasoning used in science, medicine, business, and social research.
At A-Level, hypothesis testing is introduced using the binomial distribution (and later the normal distribution at A2 for some boards). You will learn to set up hypotheses, choose significance levels, calculate test statistics or probabilities, and draw conclusions in context. The language and logic of hypothesis testing are precise — examiners reward students who use correct terminology and structure their arguments clearly.
This guide covers the full A-Level hypothesis testing framework, with a focus on binomial tests, critical regions, and the interpretation of results.
Core Concepts
What is a Hypothesis Test?
A hypothesis test starts with a claim (or assumption) about a population parameter — typically a probability or a mean . We collect data and ask: "Is this data consistent with the claim, or does it provide evidence against it?"
The test follows a structured process:
- Define the hypotheses.
- Choose a significance level.
- Collect data and calculate the test statistic.
- Compare with the critical value (or calculate the -value).
- Draw a conclusion in context.
Null and Alternative Hypotheses
The null hypothesis is the default assumption — typically that nothing has changed or that a parameter takes a specified value. For example:
The alternative hypothesis specifies what we suspect might be true instead. It can take one of three forms:
- One-tailed (upper): (we suspect is larger)
- One-tailed (lower): (we suspect is smaller)
- Two-tailed: (we suspect is different, but don't specify the direction)
The choice of depends on the context of the problem and must be decided before looking at the data.
Significance Level
The significance level is the probability of incorrectly rejecting when it is actually true (a Type I error). Common significance levels are:
- (5%) — the most common
- (1%) — more stringent
- (10%) — more lenient
The smaller the significance level, the stronger the evidence needed to reject .
The Binomial Test
At A-Level, many hypothesis tests involve a binomial distribution. If we observe successes in independent trials, each with probability of success, then under :
where is the value of specified in .
To carry out the test, we calculate the probability of obtaining a result as extreme as (or more extreme than) the observed value, assuming is true.
For a one-tailed test ():
Calculate under . If this probability is less than , reject .
For a one-tailed test ():
Calculate under . If this probability is less than , reject .
For a two-tailed test ():
Calculate the probability in the relevant tail and compare with (since the significance level is split between both tails).
Critical Regions and Critical Values
The critical region is the set of values of the test statistic that lead to rejection of . The critical value is the boundary of this region.
For a binomial test with at the 5% level, the critical region consists of all values such that . The largest such is the critical value.
For , the critical region is in the upper tail: values such that .
When the observed value falls in the critical region, we reject . When it falls outside, we do not reject .
The Actual Significance Level
Because the binomial distribution is discrete, we usually cannot achieve exactly . The actual significance level is the probability of the critical region, which is as close to as possible without exceeding it.
For example, if the critical region is and , then the actual significance level is , not exactly .
Type I and Type II Errors
A Type I error occurs when we reject when it is actually true. The probability of a Type I error equals the significance level .
A Type II error occurs when we fail to reject when it is actually false. The probability of a Type II error depends on the true value of the parameter and is harder to calculate.
| true | false | |
|---|---|---|
| Reject | Type I error | Correct decision |
| Don't reject | Correct decision | Type II error |
Writing Conclusions
Conclusions must be written in context and with appropriate language:
- Reject : "There is sufficient evidence at the significance level to reject and conclude that [contextual statement about ]."
- Do not reject : "There is insufficient evidence at the significance level to reject . There is no significant evidence that [contextual statement about ]."
Important: we never say we "accept " — we only say we "do not reject" it, because failing to find evidence against is not the same as proving it true.
Strategy Tips
Tip 1: Read the Context Carefully
The wording of the question tells you which alternative hypothesis to use. Phrases like "believes the proportion has increased" suggest ; "claims it has changed" suggests .
Tip 2: Set Up Hypotheses Before Calculating
Always write down and before doing any calculations. This ensures you test the correct tail and use the correct comparison.
Tip 3: Use the Correct Tail Probability
For , calculate (lower tail). For , calculate (upper tail). Mixing these up is one of the most common errors.
Tip 4: State the Distribution Under
Explicitly write "Under , ". This earns a method mark and shows the examiner you understand the test framework.
Tip 5: Always Conclude in Context
A conclusion that says only "reject " without reference to the real-world situation will lose marks. Always relate your answer back to the scenario described in the question.
Worked Example: Example 1
A manufacturer claims that of items produced are defective. A quality inspector tests a random sample of items and finds defective. Test, at the significance level, whether the proportion of defective items is greater than .
(the proportion of defective items is )
(the proportion is greater than )
Significance level: (one-tailed test).
Under : , where is the number of defective items.
Observed value: .
Calculate under :
Using binomial tables or a calculator:
Since , we do not reject .
Conclusion: There is insufficient evidence at the significance level to conclude that the proportion of defective items is greater than .
Worked Example: Example 2
A coin is suspected of being biased. It is tossed times and lands on heads times. Test at the significance level whether the coin is biased towards heads.
(the coin is fair)
(the coin is biased towards heads)
Significance level: (one-tailed).
Under : .
Observed value: .
Since , we reject .
Conclusion: There is sufficient evidence at the significance level to conclude that the coin is biased towards heads.
Worked Example: Example 3
Historically, of students at a school achieve a grade A in maths. After introducing a new teaching method, a random sample of students is taken and achieve grade A. Test at the significance level whether there is evidence that the proportion has changed.
(two-tailed test)
Significance level: , so each tail has .
Under : .
Observed value: . Since , we test the upper tail.
Using a calculator: (approximately)
Since (the critical value for the upper tail in a two-tailed test), we do not reject .
Conclusion: There is insufficient evidence at the significance level to conclude that the proportion of students achieving grade A has changed following the new teaching method.
Worked Example: Example 4
Find the critical region for a test of against using at the significance level.
We need the largest value such that under .
✓
✗
So the critical region is , i.e., .
The actual significance level is ().
Practice Problems
Problem 1
A die is thought to be biased. The probability of rolling a six is tested. In rolls, sixes are observed. Test at the level whether the die is biased towards six. (, .) [Hint: where ]
Problem 2
A charity claims that of households donate. A survey of households finds donors. Test at the level whether the proportion is less than . [Answer: , do not reject ]
Problem 3
Find the critical region for testing against with at the significance level. [Answer: , actual significance ]
Problem 4
A factory's defect rate has historically been . After maintenance, a sample of items reveals defects. Is there evidence at the level that the defect rate has decreased? [Hint: one-tailed test, ]
Problem 5
Explain what is meant by a Type I error in the context of Problem 1 above. State its probability.
Want to check your answers and get step-by-step solutions?
Common Mistakes
-
Saying "accept " instead of "do not reject ". This is a critical language error. We never prove true — we merely find insufficient evidence to reject it.
-
Using the wrong tail. If , you need the upper tail probability , not . Read carefully to determine the correct direction.
-
Forgetting to halve for two-tailed tests. In a two-tailed test, compare the tail probability with , not . Forgetting this effectively doubles the significance level.
-
Not writing the distribution under . Always state explicitly. This is a required step in the method and earns marks.
-
Vague or non-contextual conclusions. "Reject " alone is not sufficient. You must relate the conclusion to the real-world scenario described in the question.
-
Confusing with . The -value for a lower-tailed test is the cumulative probability , not the probability of that single value.
Frequently Asked Questions
Why don't we "accept" the null hypothesis?
Because failing to reject does not prove it is true. It merely means we did not find enough evidence against it. A different sample might yield different results. The correct phrase is "there is insufficient evidence to reject ".
How do I decide between a one-tailed and two-tailed test?
If the question suggests a specific direction of change (e.g., "believes the proportion has increased"), use a one-tailed test. If it says "test whether the proportion has changed" without specifying direction, use a two-tailed test.
What if my $p$-value exactly equals the significance level?
Convention varies, but at A-Level, if the -value equals , we are on the boundary of the critical region. Most exam mark schemes treat this as "reject " (the critical region includes the boundary), but read the question carefully.
Do I need to calculate binomial probabilities by hand?
You should be able to use the binomial probability formula and cumulative probabilities. In practice, many exam boards provide statistical tables or expect calculator use. Check your board's guidance.
What is the actual significance level, and why does it differ from $\alpha$?
The actual significance level is the exact probability of the critical region. Because the binomial distribution is discrete, we cannot always achieve exactly . The actual significance level is the largest possible probability that does not exceed .
Key Takeaways
Hypothesis testing follows a rigid structure. Define and , state the significance level, identify the distribution under , compute the probability, compare, and conclude in context.
represents the status quo. The null hypothesis is what we assume to be true unless the data provides sufficient evidence against it.
The significance level controls Type I error. Choosing means we accept a chance of incorrectly rejecting a true .
Critical regions define rejection boundaries. If the observed test statistic falls in the critical region, we reject . Otherwise, we do not.
Language matters enormously. Use "sufficient evidence to reject" and "insufficient evidence to reject" — never "accept " or "prove ".
Context is king. Every conclusion must be expressed in terms of the original problem. Statistical jargon alone does not earn full marks.
