PastPaperHero | Probability and sampling - Hypothesis testing errors and p-values

Learning Outcomes

This article explains hypothesis testing errors and p-values in a CFA Level I context, including:

distinguishing clearly between Type I and Type II errors, and linking each error type to incorrect investment decisions and risk assessment;
interpreting the significance level (alpha) as the probability of a Type I error and explaining how changing alpha affects test conclusions and overall error trade-offs;
defining the power of a test, relating it to the probability of avoiding a Type II error, and recognizing how sample size, variance, and effect size influence power;
interpreting p-values as evidence against the null hypothesis, and determining when results are statistically significant for common alpha levels such as 1%, 5%, and 10%;
applying these concepts to CFA-style scenarios involving performance evaluation, trading strategies, and risk management, and justifying whether to reject or not reject the null hypothesis in exam questions;
evaluating the practical versus statistical significance of results, reducing common candidate mistakes when explaining findings in both written responses and multiple-choice items.

CFA Level 1 Syllabus

For the CFA Level 1 exam, you are required to understand hypothesis testing errors and how p-values assist in interpreting statistical significance, with a focus on the following syllabus points:

Define and distinguish between Type I and Type II errors in hypothesis testing.
Interpret the significance level (alpha), confidence level, and power of a test.
Explain what a p-value represents and how it relates to hypothesis test conclusions.
Explain statistical versus economic significance in an investment context.
Apply these concepts to make and justify statistical decisions in investment problems.

Test Your Knowledge

Attempt these questions before reading this article. If you find some difficult or cannot remember the answers, remember to look more closely at that area during your revision.

In testing whether a portfolio’s excess return is different from zero, a Type I error occurs when the analyst:
1. concludes the excess return is zero when it actually is zero.
2. concludes the excess return is different from zero when it actually is zero.
3. concludes the excess return is zero when it is actually non-zero.
4. uses the wrong test statistic for the hypothesis.
The significance level α in a hypothesis test is best described as:
1. the probability that the null hypothesis is true.
2. the probability of making a Type II error.
3. the probability of rejecting a true null hypothesis.
4. the probability that the sample mean equals the population mean.
An analyst tests whether a mutual fund outperforms its benchmark. The test produces a p-value of 0.03. If α = 0.05, the correct conclusion is to:
1. reject the null hypothesis because 0.03 < 0.05.
2. fail to reject the null hypothesis because 0.03 < 0.05.
3. reject the null hypothesis because 0.03 > 0.01.
4. fail to reject the null hypothesis because 0.03 > 0.01.
A Type II error in testing whether an active strategy adds value corresponds to:
1. concluding the strategy works when it actually does not.
2. concluding the strategy does not work when it actually does.
3. using a lower significance level than necessary.
4. using a larger sample size than necessary.
The power of a statistical test is:
1. 1 − α and represents the confidence level.
2. α and represents the probability of a Type I error.
3. β and represents the probability of a Type II error.
4. 1 − β and represents the probability of correctly rejecting a false null.

Introduction

Statistical hypothesis tests are central to quantitative methods and to much of investment research. Analysts use them to assess whether a manager has skill, whether a trading rule adds value, or whether a risk model is correctly specified. Correct interpretation, however, depends on understanding:

the two common types of decision errors (Type I and Type II);
how significance levels (α), confidence levels, and test power relate to those errors;
how to interpret p-values in relation to α; and
why statistical significance does not automatically imply economic significance.

Hypotheses and Test Statistics

A hypothesis test compares what we see in the sample with what would be expected if a particular statement about the population were true.

Key Term: Hypothesis test
A structured procedure for using sample data to decide whether to reject a stated hypothesis about a population parameter.

In every test we formulate two competing statements:

Key Term: Null hypothesis
A statement about a population parameter assumed true unless sample evidence strongly indicates otherwise; denoted $H_0$ .

Key Term: Alternative hypothesis
A statement that contradicts the null hypothesis and is accepted if the null is rejected; denoted $H_a$ .

For example, when evaluating a fund:

$H_0$ : The fund’s mean excess return equals 0% (no skill).
$H_a$ : The fund’s mean excess return is greater than 0% (positive skill).

We then compute a test statistic from the sample, such as a z-statistic or t-statistic, that measures how far the sample estimate lies from the hypothesized value in standard error units.

Key Term: Test statistic
A numerical value calculated from sample data that, together with a decision rule, determines whether to reject the null hypothesis.

The decision rule compares the test statistic to critical values, or equivalently compares the p-value to the chosen significance level. This is where error types, α, and power come in.

Type I and Type II Errors

When you perform a hypothesis test, you decide whether to reject $H_0$ in favor of $H_a$ . Because sample data are random, even a correctly conducted test can lead to a wrong decision. There are two mutually exclusive types of mistakes:

Key Term: Type I error
Rejecting a true null hypothesis; also called a false positive.

Key Term: Type II error
Failing to reject a false null hypothesis; also called a false negative.

In words:

Type I error: Saying “there is an effect” when in reality there is none.
Type II error: Saying “we see no effect” when in reality there is one.

In an investment context:

Type I error:
- Example: You conclude a trading strategy delivers abnormal positive returns, when in fact its true abnormal return is zero. You might allocate capital to a strategy that only earns the benchmark return, or worse, after costs, underperforms.
Type II error:
- Example: You conclude there is no evidence a manager adds value, when in fact the manager’s true alpha is positive. You might withdraw capital from a genuinely skilled manager.

The exam often frames questions exactly this way: “Which error corresponds to claiming an investment strategy is profitable when it is not?” (Type I) versus “Which error corresponds to missing a genuinely profitable strategy?” (Type II).

Significance Level (Alpha) and Confidence Level

To control Type I error risk, we choose a significance level, denoted α.

Key Term: Significance level (alpha)
The chosen probability of making a Type I error in a hypothesis test; denoted α.

If α = 5%, we accept that, in the long run, 5% of tests will incorrectly reject a true $H_0$ . For example, if no manager in a database truly has skill but we test 100 managers at α = 5%, we expect about 5 of them to appear “significant” just by chance.

The complement of α is the confidence level:

Key Term: Confidence level
The probability of not making a Type I error, equal to $1 - \alpha$ , often expressed as a percentage such as 95% or 99%.

So:

α = 0.10 → 90% confidence level
α = 0.05 → 95% confidence level
α = 0.01 → 99% confidence level

Lower α means we require stronger evidence to reject $H_0$ , which reduces the chance of a false positive but, as we will see, tends to increase the chance of a false negative.

Conceptually, every test has four possible outcomes:

Correctly fail to reject a true $H_0$ (probability $1 - \alpha$ , the confidence level).
Incorrectly reject a true $H_0$ (Type I error, probability α).
Correctly reject a false $H_0$ (this probability is the power).
Incorrectly fail to reject a false $H_0$ (Type II error, probability β).

Power of a Test and Type II Errors

Key Term: Power of a test
The probability of correctly rejecting a false null hypothesis, equal to $1 - \beta$ where β is the probability of a Type II error.

Holding everything else constant:

Decreasing α (for example, from 5% to 1%) reduces the probability of Type I error (false positive) but increases β, the probability of Type II error (false negative), so power falls.
Increasing α (for example, from 5% to 10%) has the opposite effect: more willingness to reject $H_0$ leads to higher power but higher Type I error risk.

The only way to reduce both α and β simultaneously is to improve the information content of the sample, mainly by increasing the sample size. Larger samples reduce the standard error of estimators, make true effects easier to detect, and therefore increase power without raising α.

Other factors that increase power (reduce the chance of Type II error) include:

larger true effect size (for example, a true alpha of 4% is easier to detect than 0.4%);
lower population variability (less noise in returns); and
choosing a one-sided test when appropriate (all else equal, it is easier to detect a positive effect if we do not allocate α to the opposite tail).

On the exam, if a question asks “Which change will increase the power of a test?”, answers like “increasing the sample size” or “reducing variance in the data” are correct, while “reducing α from 5% to 1%” will typically reduce power.

Understanding p-Values

Most statistical software reports p-values instead of critical values. You must be able to interpret them correctly.

Key Term: p-value
The probability, assuming the null hypothesis is true, of observing a test statistic at least as extreme as the one actually obtained.

Key points:

The p-value is computed under the assumption that $H_0$ is true.
It measures how incompatible the sample is with $H_0$ : smaller p-values indicate stronger evidence against $H_0$ .
It is not the probability that $H_0$ is true or false.

Decision rule using p-values:

If p-value ≤ α: reject $H_0$ . The result is statistically significant at level α.
If p-value > α: fail to reject $H_0$ . The result is not statistically significant at level α.

You can also interpret the p-value as:

the smallest significance level at which you would still reject $H_0$ .

So if the p-value is 0.03:

you would reject $H_0$ at α = 10% and α = 5%;
you would not reject $H_0$ at α = 1%.

Statistical Significance

Key Term: Statistical significance
A description of a test result when the null hypothesis is rejected at the chosen significance level.

In CFA exam language, phrases such as “significant at the 5% level” mean that, using α = 0.05, the p-value was ≤ 0.05 and $H_0$ was rejected.

For common α levels:

At α = 0.10 (10%): more willing to reject $H_0$ ; higher Type I error risk, higher power.
At α = 0.05 (5%): standard choice; balances Type I and Type II errors.
At α = 0.01 (1%): very strong evidence required; Type I error risk is low, but Type II risk is higher.

Worked Example 1.1 – Using a p-Value and α

A CFA candidate tests whether a fund’s mean return exceeds the benchmark’s 8% return. The test is:

$H_0$ : mean fund return ≤ 8%
$H_a$ : mean fund return > 8%

The test produces a p-value of 0.03, and the candidate uses α = 0.05.

Answer:
Because the p-value (0.03) is less than α (0.05), the candidate rejects $H_0$ at the 5% significance level. There is statistically significant evidence that the fund outperformed the benchmark. The probability of incorrectly rejecting a true $H_0$ in this test (Type I error risk) is 5%, not 3%; 3% is the observed p-value, not the chosen α.

Worked Example 1.2 – Failing to Reject the Null

You analyze returns for a stock and test:

$H_0$ : the average return is zero
$H_a$ : the average return is not zero

Your test statistic corresponds to a p-value of 0.15 using α = 0.05.

Answer:
Because the p-value (0.15) is greater than α (0.05), you do not reject $H_0$ . The sample provides insufficient evidence to claim the stock’s average return differs from zero. This does not prove that the true mean is exactly zero; it simply means that, given the data and sample size, any deviation from zero cannot be distinguished from random noise. A Type II error is possible: the true mean might be non-zero, but the test failed to detect it.

Worked Example 1.3 – α and Type I Error Probability

An analyst tests whether a new investment model adds value. She reports: “Statistically significant at the 1% level; p = 0.007.”

What is the probability of incorrectly rejecting the null hypothesis in this test?

Answer:
With α = 0.01, the probability of a Type I error (rejecting a true null) is 1%. Because p = 0.007 < 0.01, the result is significant at the 1% level, and the analyst rejects $H_0$ . There remains a 1% chance that this rejection is a false positive, assuming all test assumptions are satisfied.

Worked Example 1.4 – Linking α and Confidence Intervals

You test $H_0: \mu = 6\%$ versus $H_a: \mu \neq 6\%$ for a normally distributed return. A 95% confidence interval for the mean is estimated as 4.5% to 7.5%.

Should you reject $H_0$ at the 5% significance level?

Answer:
A 95% confidence interval corresponds to α = 5% in a two-sided test. Since the hypothesized value 6% lies inside the interval [4.5%, 7.5%], you fail to reject $H_0$ at α = 0.05. If the hypothesized value had been outside this range, you would have rejected $H_0$ at the 5% level.

Worked Example 1.5 – Type I vs Type II in a Trading Strategy

An analyst backtests a technical trading rule. She tests:

$H_0$ : the rule’s mean excess return is zero
$H_a$ : the rule’s mean excess return is positive

Case A: She uses α = 10%.
Case B: She uses α = 1%.

Assuming the true mean excess return is actually zero, in which case is she more likely to wrongly conclude that the rule works?

Answer:
When the true mean excess return is zero, wrongly concluding the rule works corresponds to a Type I error (rejecting a true $H_0$ ). The probability of a Type I error equals α. Therefore, she is more likely to make this error in Case A (α = 10%) than in Case B (α = 1%). Using a very low α (like 1%) makes it harder for spurious backtested rules to appear significant, but it also increases the risk of missing real opportunities when they exist (higher Type II error probability).

Economic vs Statistical Significance (Exam Warning)

Key Term: Economic significance
The extent to which a statistically estimated effect is large enough to matter in practice after considering costs, risk, and other real-world factors.

On the CFA exam, do not assume that statistical significance automatically implies real economic or investment significance.

Large samples reduce standard errors, making even tiny differences statistically significant. For example:

You test whether a strategy’s mean excess return differs from zero using thousands of daily observations.
You find a statistically significant mean excess return of 0.05% per year (p-value < 0.01).

Statistically, you reject $H_0$ : the mean is not zero. Economically, a 0.05% annual excess return is negligible and likely wiped out by transaction costs, management fees, and taxes.

Always consider:

the magnitude of the effect (effect size);
implementation costs and constraints;
additional risk taken to earn the return.

The curriculum emphasizes that an analyst must distinguish “statistically significant” from “economically meaningful.”

Multiple Testing and the Data Snooping Problem

When many hypothesis tests are run on the same dataset, the chance of at least one false positive rises.

Key Term: Multiple testing problem
The increased risk of finding apparently significant results purely by chance when many hypothesis tests are run on the same or related data.

Suppose:

You test 100 independent strategies, each with $H_0$ : “no abnormal return,” using α = 5%.
Even if every $H_0$ is true (no strategy truly adds value), on average about 5 strategies will appear significant purely by chance.

This is closely related to data snooping: repeatedly mining the same data until something “works.” Such apparently significant results often fail out-of-sample.

Implications for exam-style questions:

If a question mentions many models or rules tested on the same data, you should recognize a higher chance of Type I errors.
A single small p-value is less convincing when it comes from a large number of tests.
Robust evidence typically requires either strong economic rationale or confirmation on new, independent data.

Summary

Understanding Type I and Type II errors, the significance level, confidence level, and p-values is critical for interpreting hypothesis test results and communicating the risk of incorrect conclusions. Proper use of these concepts helps you decide when to reject or not reject a hypothesis and to explain clearly what kinds of errors are possible in performance evaluation, trading strategy assessment, and risk management applications.

Key Point Checklist

This article has covered the following key knowledge points:

The structure of a hypothesis test, including null and alternative hypotheses and the role of the test statistic.
The difference between Type I and Type II errors, and how each corresponds to different investment decision mistakes.
How the significance level (alpha) is chosen and interpreted as the probability of a Type I error; how it relates to the confidence level (1 − α).
The definition of test power (1 − β) and how sample size, variability, and true effect size influence the probability of a Type II error.
The definition and interpretation of p-values, and how to use them to decide whether to reject or fail to reject the null hypothesis at common α levels (10%, 5%, 1%).
The equivalence between critical-value decisions and p-value decisions and the link between hypothesis tests and confidence intervals.
Why statistically significant results may not be economically meaningful once transaction costs, fees, taxes, and risk are considered.
How multiple testing and data snooping increase the likelihood of false positives and why this matters when evaluating investment strategies.

Key Terms and Concepts

Hypothesis test
Null hypothesis
Alternative hypothesis
Test statistic
Type I error
Type II error
Significance level (alpha)
Confidence level
Power of a test
p-value
Statistical significance
Economic significance
Multiple testing problem

Probability and sampling - Hypothesis testing errors and p-v...