This post is a response, of sorts, to the most recent episode of The Black Goat podcast. It is “of sorts” because it’s not just a response, but this post was definitely inspired by the episode. Anyway, the point here being that I stole the title of the episode for this post.
First things first – it’s a good episode, well worth a listen. I listened to (most of) it twice, in fact, taking notes the second time, since I wanted to make sure I was remembering things correctly so that I didn’t misrepresent what they (Srivastava, Tullett, and Vazire; henceforth ST&V) talk about in my discussion here.
There are two main parts: a discussion of a listener question about whether to trust open-science advocates’ pre-rep*-crisis work more than that of non-open-science advocates, and a longer discussion of the hosts’ history with p-values and the various discussions about p-values going on in behavioral and medical sciences (and probably elsewhere). My focus here will be on the latter (which is why the second listen was only to most of the episode – from the 26 minute mark, for what it’s worth).
The p-value discussion starts with each host giving a brief overview of his/her “history with p-values” in classes and in the lab. I’ll do the same here.
A Brief Overview of My History With P Values
The first class I took that was directly relevant to statistical analysis had relatively little statistical analysis in it. It was a second language testing class at the University of Hawai`i. It covered some descriptive stats, but the course focused much more on validity than a standard stats class would. I’m sure my memories aren’t totally accurate, but I think of it now as a course on measurement more than anything else. We probably discussed p values at some point, but statistical testing was not the primary focus of the class.
I took some more standard stats classes from a guy in Educational Psychology at UH, too (he sounded exactly like Donald Sutherland, and kinda-sorta looked like him, which was great). These were more standard, but, I think, not completely like a run of the mill psych stats class. This is maybe illustrated by the fact that the textbook for one course was Measurement, Design, and Analysis: An Integrated Approach. I still own and use it, 16 years later. I’m quite sure p values and NHST came up a lot in the classes I took from Professor Sutherland, but they were secondary to how design and analysis are linked.
At Indiana University, where I went after UH to pursue a PhD in Linguistics and Cognitive Science, I took the two stats courses required for Psychology students. The first was taught by John Kruschke. He was always very careful to define p-values correctly, so this got drilled into my head, early and thoroughly. Also, his class was structured exceptionally well, so the relationships between different statistical tests and the general logic of NHST have stuck with me.
I also took a seminar in Bayesian statistics from Kruschke, the first time he taught a course on this topic. He used the chalkboard and didn’t have prepared slides, which was very different from the other classes I had taken from him. I asked him why at one point, and he said that it was a one-off course for him, so it wasn’t worth the time and energy to prepare elaborate materials he wouldn’t use again. Later on, I was Kruschke’s grader for what became a two-semester sequence on Bayesian stats as he wrote the first edition of his textbook on Bayesian stats. Now he’s on the second edition. Predicting the future is hard. Also, problems with NHST are covered in his book and lectures, so p values were at least a small part of my initial training with Bayesian stats, too.
In addition to those courses, I took a lot of mathematical psychology and cognitive modeling courses at IU, some of which touched on various aspects of statistical testing. A fair amount of this training had nothing to do with p values, though there was a lot of probability theory.
After I was done with my PhD coursework, I also took a couple semesters of classes to fulfill the requirements for the then newly-established MS in Applied Statistics (the rest of the requirements for this degree were fulfilled by the earlier stats and math psych classes I had taken). These post-PhD-coursework courses deepened (and filled some gaps in) my knowledge, and I learned how to do statistical consulting effectively (from the soon-to-be president elect of the ASA). Perhaps not surprisingly, p values came up a lot in some of these classes (e.g., the two-semester sequence on linear models), but less so in others (another course on Bayesian stats, and the statistical theory course, which focused a lot more on things like the central limit theorem, consistency, bias, maximum likelihood estimation, and various other mathematical underpinnings of statistics).
In the research I was doing in the lab(s) I worked in (the Mathematical Psychology Lab, though I ran experiments in the Auditory Perception Lab on the other side of campus; I also did some work in the Linguistic Speech Lab), p values played a relatively small role. Early on, I remember using p values in fairly standard ways (e.g., to pick out what was worth talking about; see my first paper, for example). As I learned more about math psych and Bayesian stats, p values became more irrelevant, though they haven’t disappeared from my research completely.
Finally, I got valuable experience with new (to me) statistical tools in my first job after graduating. I participated in workshops on diagnostic measurement and structural equation modeling at the University of Maryland, and the work I was doing was essentially I/O psychology (as confirmed by an actual I/O psychologist). Plenty of p values here, many of them, in retrospect, likely misused, though not misinterpreted.
Okay, so, well, that was maybe less brief than I envisioned, but for the purposes of discussing my significant feelings, I think it’s important to explain where I’m coming from.
ST&V touch on a wide range of topics in their discussion of p values. As mentioned above, the whole thing is worth listening to, but I’m not going to talk about everything they get into.
Their discussion spurred me to think some more about
- how evaluation of statistical analyses is related to evaluation of design and measurement (and how the former is subordinate to the latter);
- how individuals evaluate evidence and how this relates to group-level evaluation of evidence;
- how we think about (or struggle to think about) continuous vs discrete reports, and how these do or do not relate to how we (struggle to) think about probability;
- and how all of this relates (or does not relate) to Bayesian and frequentist statistical philosophies.
Around the 42-minute mark, ST&V start talking about the paper Redefine Statistical Significance (RSS), which Vazire is a co-author of (here is my very negative take on RSS). This leads to a more general discussion about statistical cutoffs and dichotomous vs graded thinking. At around the 53-minute mark, Tullett* brings up differences between evaluation of study designs and evaluation of statistical results, positing that it’s easier to give nuanced evaluation of the former than the latter.
(* I am reasonably confident that it is Tullett talking here and not Vazire. Their voices are different, but I don’t know either voice well enough to be able to map the voices to the names consistently. That is, I would do very well with a discrimination task, but probably substantially less well with a recognition task.)
(1) This is fine as far as it goes – design is, along with measurement, complex and multifaceted. But I want to push this distinction further. It occurred to me a while ago that it is problematic to discuss isolated statistics as “evidence.” I think ST&V mention “statistical evidence” a few times in this discussion, the word “evidence” is used a lot in RSS and the response papers (see links at the top of my RSS post), and I’m pretty sure I’ve seen Bayesians refer to the denominator in Bayes’ rule as “evidence.”
But evidence has to be of something. Without a larger context – a research question, a design, a measurement tool, and a statistical model, at least – a p value is just (tautological) “evidence” of a range of test statistic values being more or less probable. It’s just – per the definition – a statement about the probability of observing a test statistic as or more extreme as what you’ve observed under the assumption that the null hypothesis is true. Which is to say that, in isolation, a p value doesn’t tell us anything at all. Similarly for Bayes factors, sets of *IC fit statistics, or whatever other statistical summary you like.
(2) So, when it comes to evaluation of evidence, an individual must take research questions, design, measurement, and statistics into account. And, as I mentioned in my post on RSS, individuals have now, and always have had, the ability to evaluate evidence as stringently or as laxly as they believe is appropriate for any given situation.
Of course, things get complicated if an individual wants to persuade other individuals about a scientific issue. Shared standards undoubtedly help individuals achieve scientific consensus, but the complications of scientific disagreement and agreement can’t be solved by fixed, arbitrary statistical thresholds. Such thresholds might be a necessary condition of scientific consensus, though I’m skeptical, and they definitely aren’t sufficient.
(3) All that said, when an individual evaluates a scientific report, there can be value in discrete categories. I like the discussion (around the 54-minute mark) of thinking about results in terms of whether you buy them or not and how much you would bet on a replication being successful. I also find the argument for the utility of a three-category fire safety system vs a continuous numerical system compelling.
But it’s worth keeping in mind that, even if a discrete system can better for certain kinds of decisions, a continuous system can be better for other kinds of decisions. “Buy it or not” might be the best way to evaluate a study for some purposes, but from a meta-analytic mind-set, dichotomizing evidence is just throwing information away.
Tangentially related, I agree wholeheartedly with Vazire* that people are bad at probabilistic thinking. Even with all the statistical training and research experience described above, I often have to exert substantial cognitive effort to be careful with probabilistic thinking. That said, I think it’s worth drawing a distinction between, on the one hand, categorical vs continuous thinking, and, on the other, deterministic vs probabilistic thinking.
Obviously, probability plays a role in all this, and the part of the discussion about intuitions about p value distributions under null and alternative hypotheses (~38 minutes) is both interesting and important. But the relative value of dichotomization (or trichotomization) vs indefinitely graded descriptions of statistical analyses is not (only) about probability. It’s also about (mis)interpretation of basic, non-random arithmetical facts (see, e.g., McShane & Gal, 2016, 2017).
(4) Finally, I agree exuberantly with the point that Vazire* makes late in the podcast that these issues are orthogonal to debates about Bayesian and frequentist philosophies of statistics. Srivastava is absolutely right that categorizing Bayes factors as “anecdotal” or “strong evidence” is just as problematic (or not) as categorizing p values as statistically significant, (suggestive,) or not statistically significant. Or maybe I should say that he should have gone further than saying that this kind of categorization “starts to feel a little threshold-y.” It is 100% threshold-y, and so it has all the problems that fixed, arbitrary thresholds have.
If you insist on using thresholds, you should justify your Bayes factor thresholds just as surely as you should Justify Your Alpha. I’m more and more convinced lately that we should be avoiding thresholds (see the McShane & Gal papers linked above, see also the papers linked in the post on the meta-analytic mind-set), but I agree with the larger point of the Justify Your Alpha paper that research decisions (statistical and non) should be described as fully and as transparently as possible.
One last point about metric vs nominal reports and Bayesian vs frequentist statistics: although the former is logically unrelated to the latter, in my experience, Bayesian statistics involves a lot more modeling (model building, parameter estimation, and data and model visualization) than does frequentist statistical testing. Obviously, you can do frequentist model building, parameter estimation, and visualization, but standard psych statistical training typically focuses on (null hypothesis significance) testing, which is inherently, inextricably tied up with categorical reports (e.g., dichotomization). The point being that, while I don’t think there are any deep philosophical reason for Bayesians to be more friendly to continuous, graded reports than are frequentists, there are contingent historical reasons for things to play out this way.
Significant Feelings is a very good episode of The Black Goat podcast, but arbitrary thresholds are bad.