Suppose you are reading about a statistically significant result x that just reaches a threshold p-value α from a test T+ of the mean of a Normal distribution
H0: µ ≤ 0 against H1: µ > 0
with n iid samples, and (for simplicity) known σ. The test “rejects” H0 at this level & infers evidence of a discrepancy in the direction of H1.
I have heard some people say:
A. If the test’s power to detect alternative µ’ is very low, then the just statistically significant x is poor evidence of a discrepancy (from the null) corresponding to µ’. (i.e., there’s poor evidence that µ > µ’ ). See point* on language in notes.
They will generally also hold that if POW(µ’) is reasonably high (at least .5), then the inference to µ > µ’ is warranted, or at least not problematic.
I have heard other people say:
B. If the test’s power to detect alternative µ’ is very low, then the just statistically significant x is good evidence of a discrepancy (from the null) corresponding to µ’ (i.e., there’s good evidence that µ > µ’).
They will generally also hold that if POW(µ’) is reasonably high (at least .5), then the inference to µ > µ’ is unwarranted.
Which is correct, from the perspective of the frequentist error statistical philosophy?
(within which power and associated tests are defined). A big HINT is below.
*Allow the test assumptions are adequately met, at least to start with.
I have often said on this blog, and I repeat, the most misunderstood and abused (or unused) concept from frequentist statistics is that of a test’s power to reject the null hypothesis under the assumption alternative µ’ is true: POW(µ’). I deliberately write it in this long drawn-out, correct, manner because it is faulty to speak of the power of a test without specifying against what alternative it’s to be computed.
Because fallacious uses of power–power howlers as I call them–are so common, I make sure that the concept of severity is deliberately designed, not just to avoid them, but to make the basis for the fallacy so clear that no one will slip back into committing them. The claims of “good” and “poor” evidence get explicitly cashed out in terms of high/low severity accorded to the associated claims.
But here we are, 3.5 years since the publication of Statistical Significance as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP), [SIST], and these fallacies persist. We even hear that (what I claim is) the fallacious interpretation “is now well established”. Worse, the fallacious interpretations are taken as knock down criticisms of (my notion of) severity (which instructs you as to the right way to interpret results)!
So I’m going to focus some blogposts on power howlers (some earlier ones are linked to below).
For the BIG HINT, I will draw from (pp 239-240) Excursion 4 Tour II of SIST “Rejection Fallacies: Who’s Exaggerating What?” (in blue) [1]:
“How Could a Group of Psychologists be so Wrong? I’ll carry a single tome: Morrison and Henkel’s 1970 classic, The Significance Test Controversy. Some abuses of the proper interpretation of significance tests were deemed so surprising even back then that researchers in psychology conducted studies to try to understand how this could be. Notably, Rosenthal and Gaito (1963) discovered that statistical significance at a given level was often fallaciously taken as evidence of a greater discrepancy from the null the larger the sample size n. In fact, it is indicative of less of a discrepancy from the null than if it resulted from a smaller sample size.
What is shocking is that these psychologists indicated substantially greater confidence or belief in results associated with the larger sample size for the same p values. According to the theory, especially as this has been amplified by Neyman and Pearson (1933), the probability of rejecting the null hypothesis for any given deviation from null and p values increases as a function of the number of observations. The rejection of the null hypothesis when the number of cases is small speaks for a more dramatic effect in the population…The question is, how could a group of psychologists be so wrong? (Bakan 1970, p. 241)
(Our convention is for “discrepancy” to refer to the parametric, not the observed, difference [or effect size]. Their use of “deviation” from the null alludes to our “discrepancy”.)
As statistician John Pratt, notes “the more powerful the test, the more a just significant result favors the null hypothesis” (1961, p. 166). Yet we still often hear: “The thesis implicit in the [Neyman-Pearson, NP] approach, [is] that a hypothesis may be rejected with increasing confidence or reasonableness as the power of the test increases” (Howson and Urbach 1993, p. 209). In fact, the thesis implicit in the N-P approach, as Bakan remarks, is the opposite! The fallacy is akin to making mountains out of molehills according to severity (Section 3.2).
Mountains out of Molehills (MM) Fallacy (large n problem): The fallacy of taking a rejection of H0, just at level P, with larger sample size (higher power) as indicative of a greater discrepancy from H0 than with a smaller sample size.
Consider an analogy with two fire alarms: The first goes off with a sensor liable to pick up on burnt toast; the second is so insensitive, it doesn’t kick in until your house is fully ablaze. You’re in another state, but you get a signal when the alarm goes off. Which fire alarm indicates the greater extent of fire? Answer, the second, less sensitive one. When the sample size increases it alters what counts as a single sample. It is like increasing the sensitivity of your fire alarm. It is true that a large enough sample size triggers the alarm with an observed mean that is quite “close” to the null hypothesis. But, if the test rings the alarm (i.e., rejects H0) even for tiny discrepancies from the null value, then the alarm is poor grounds for inferring larger discrepancies. Now this is an analogy, you may poke holes in it. For instance, a test must have a large enough sample to satisfy model assumptions. True, but our interpretive question can’t get started without taking the P-values as legitimate and not spurious.”
A link to the proofs of Excursion 4 Tour II. For another big hint see [2].
Now the high power against alternative µ’ can result from increasing the sample size (as in the above variation of the MM fallacy), or it can result by selecting the value of alternative µ’ to be sufficiently far from µ0.
The paper discussed in my last post includes a criticism of severity that instantiates the second form of the MM fallacy. I will come back to this, and some other howlers (from other papers) later on. In this connection, see a question that might arise [3]
Share your constructive remarks in the comments.
Notes
*Point on language. “To detect alternative µ’” means, “produce a statistically significant result when µ = µ’.” It does not mean we infer µ’. Nor do we know the underlying µ’ after we see the data, obviously. The power of the test to detect µ’ just refers to the probability the test would produce a result that rings the significance alarm, if the data were generated from a world or experiment where µ = µ’.
[1] It must be kept in mind that inferences are going to be in the form of µ > µ’ =µ0 + δ, or µ < µ’ =µ0 + δ or the like. They are not to point values! (Not even to the point µ =M0.)
[2] Big Hint: Ask yourself: What is the power of test T+: H0: µ ≤ 0 against the null hypothesis H0?
The answer is α! so for example if we set α = .025, then the power of the test at 0 is POW(µ = 0) = α = .025. Because the power against µ = 0 is low, the statistically significant result is good evidence that µ > 0.
[3] Does A hold true if we assume that we know (based on previous severe tests) that µ < µ’? I’ll return to this.
OTHER RELEVANT POSTS ON POWER (you’ll find more by searching “power” on this blog)
- 3/4/14 Power, power everywhere–(it) may not be what you think! [illustration]
- 3/12/14 Get empowered to detect power howlers
- 3/17/14 Stephen Senn: “Delta Force: To what Extent is clinical relevance relevant?”
- 3/19/14 Power taboos: Statue of Liberty, Senn, Neyman, Carnap, Severity
- 12/29/14 To raise the power of a test is to lower (not raise) the “hurdle” for rejecting the null (Ziliac and McCloskey 3 years on)
- 01/03/15 No headache power (for Deirdre)
- 02/10/15 What’s wrong with taking (1 – β)/α, as a likelihood ratio comparing H0 and H1?