Truth, utility, and null hypotheses

Discussing criticisms of null hypothesis significance testing (NHST) McShane & Gal write (emphasis mine):

More broadly, statisticians have long been critical of the various forms of dichotomization intrinsic to the NHST paradigm such as the dichotomy of the null hypothesis versus the alternative hypothesis and the dichotomization of results into the different categories statistically significant and not statistically significant. … More specifically, the sharp point null hypothesis of $\theta = 0$ used in the overwhelming majority of applications has long been criticized as always false — if not in theory at least in practice (Berkson 1938; Edwards, Lindman, and Savage 1963; Bakan 1966; Tukey 1991; Cohen 1994; Briggs 2016); in particular, even were an effect truly zero, experimental realities dictate that the effect would generally not be exactly zero in any study designed to test it.

I really like this paper (and a similar one from a couple years ago), but this kind of reasoning has become one of my biggest statistical pet peeves. The fact that “experimental realities” generally produce non-zero effects in statistics calculated from samples is one of the primary motivations for the development of NHST in the first place. It is why, for example, the null hypothesis is typically – read always – expressed as a distribution of possible test statistics under the assumption of zero effect. The whole point is to evaluate how consistent an observed statistic is with a zero-effect probability model.

Okay, actually, that’s not true. The point of statistical testing is to evaluate how consistent an observed statistic is with a probability model of interest. And this gets at the more important matter. I agree with McShane & Gal (and, I imagine, at least some of the people they cite) that the standard zero-effect null is probably not true in many cases, particularly in social and behavioral science.

The problem is not that this model is often false. A false model can be useful. (Insert Box’s famous quote about this here.) The problem is that the standard zero-effect model is very often not interesting or useful.

Assuming zero effect makes it (relatively) easy to derive a large number of probability distributions for various test statistics. Also, there are typically an infinite number of alternative, non-zero hypotheses. So, fine, zero-effect null hypotheses provide non-trivial convenience and generality.

But this doesn’t make them scientifically interesting. And if they’re not scientifically interesting, it’s not clear that they’re scientifically useful.

In principle, we could use quantitative models of social, behavioral, and/or health-related phenomena as interesting and useful (though almost certainly still false) “null” models against which to test data or, per my preferences, to estimate quantitative parameters of interest. Of course, it’s (very) hard work to develop such models, and many academic incentives push pretty hard against the kind of slow, thorough work that would be required in order to do so.