Multivariate normal CDF values in Python

I was very happy to realize recently that a subset of Alan Genz’s multivariate normal CDF functions are available in Scipy. I first learned of Dr. Genz’s work when I started using the mnormt R package, which includes a function called sadmvn that gives very precise, and very accurate, multivariate normal CDF values very quickly.

In case you don’t know, this is quite an achievement, since there is not a closed form solution. I’ve spent far too much time reading strange, complicated papers found in the deepest recesses of google (i.e., the third page of search results) that claim to provide fast, accurate approximations to multivariate normal CDFs. As far as I can tell, none of these claims hold any water. None other than Genz’s, anyway.

Okay, so Scipy has two relevant functions, but they’re kind of buried, and it might not be obvious how to use them (at least if you don’t know to look at Genz’s Fortran documentation). So, for the benefit of others (and myself, in case I need a refresher), here’s where they are and how to use them.

First, where. In the Scipy stats library, there is a chunk of compiled Fortran code called I’ve copied it here, just in case it disappears from Scipy someday. Should that come to pass, and should you want this file, just save that ‘plain text’ file and rename it and you should be good to go.

Otherwise, if you’ve got Scipy, you can just do this:

from scipy.stats import mvn

Now, mvn will have three methods, two of which – mvndst and mvnun – are what we’re looking for here.

The first works like this:

error,value,inform = mvndst(lower,upper,infin,correl,...)

Which is to say that it takes, as arguments, lower and upper limits of integration, ‘infin’ (about which more shortly), and correl (as well as some optional arguments). This is, in turn, to say that it assumes that your multivariate normal distribution is centered at the origin and that you’ve normalized all the variances.

This function is straightforward to use, except for, perhaps, the ‘infin’ argument. From Genz’s documentation:

*     INFIN  INTEGER, array of integration limits flags:
*           if INFIN(I) < 0, Ith limits are (-infinity, infinity);
*           if INFIN(I) = 0, Ith limits are (-infinity, UPPER(I)];
*           if INFIN(I) = 1, Ith limits are [LOWER(I), infinity);
*           if INFIN(I) = 2, Ith limits are [LOWER(I), UPPER(I)].

Which is to say that you put a negative number in if you want, on dimension I, to integrate from -Inf to Inf, 0 if you want to integrate from -Inf to your designated upper bound, 1 if you want to integrate from your designated lower bound to Inf, and 2 if you want to use both of your designated bounds.

Also from Genz’s documentation:

*     INFORM INTEGER, termination status parameter:
*          if INFORM = 0, normal completion with ERROR < EPS;
*          if INFORM = 1, completion with ERROR > EPS and MAXPTS 
*                         function vaules used; increase MAXPTS to
*                         decrease ERROR;
*          if INFORM = 2, N > 500 or N < 1.

Here, N seems to be the number of dimensions (and is the first argument in Genz’s MVNDST Fortran function, but is not in the similar/corresponding R or Python functions).  In any case, it’s the 0 and 1 that seem most informative, and the MAXPTS variable is one of the optional arguments I mentioned above.

The other function allows for non-zero means and covariance (as opposed to correlation) matrices, but it doesn’t, technically speaking, allow for integration to or from +/-Infinity:

value,inform = mvnun(lower,upper,means,covar,...])

As it happens, and as shouldn’t be too surprising, you can give it large magnitude bounds and get essentially the same answer. As long as you’re sufficiently far away from the mean (meaning as long as you’re more than a few standard deviation units away), the difference between the +/-Inf bound and the finite bound will only show up quite a few decimal places into your answer.

If you’ve got numpy imported as np, you could, for example, do this:

In [54]: low = np.array([-10, -10])

In [55]: upp = np.array([.1, -.2])

In [56]: mu = np.array([-.3, .17])

In [57]: S = np.array([[1.2,.35],[.35,2.1]])

In [58]: p,i = mvn.mvnun(low,upp,mu,S)

In [59]: p
Out[59]: 0.2881578675080012

With more extreme values for low, we get essentially the same answer (with a difference only showing up in the 12th decimal place):

In [60]: low = array([-20, -20])

In [61]: p,i = mvncdf(low,upp,mu,S)

In [62]: p
Out[62]: 0.2881578675091007

Still more extreme values doesn’t change it at all:

In [63]: low = array([-100, -100])

In [64]: p,i = mvncdf(low,upp,mu,S)

In [65]: p
Out[65]: 0.2881578675091007

All of this is important to me because I’m working on building a Bayesian GRT (e.g.) model in PyMC, and I’m hoping I’ll be able to use this function to get fast and accurate probabilities, given a set of mean and covariance parameters.

Not (an argument against) the “old evidence” problem

I recently started reading Deborah G. Mayo’s new book Statistical Inference as Severe Testing. So far, so good. It’s thought-provoking, and I wholeheartedly agree with her that it’s important for researchers to understand the philosophical foundations of statistics (and of science in general). I also happen to think that the logic of test severity is important and worth understanding.

Perhaps I will write more about my agreements with her arguments this book at a later date, but today I am writing about what seems to me to be a pretty obvious error in an early chapter. Or maybe it’s not an error, but, rather, a little bit of logical and mathematical sleight of hand.

On pages 50 and 51, Mayo is discussing the likelihood principle (LP). She writes:

… Royall, who obeys the LP, speaks of "the irrelevance of the sample space" once the data are in hand. It’s not so obvious what’s meant. To explain, consider Jay Kadane: "Significance testing violates the Likelihood Principle, which states that, having observed the data, inference must rely only on what happened, and not on what might have happened but did not" (Kadane 2011, p. 439). According to Kadane, the probability statement \Pr(|d(\mathbf{X})|\geq 1.96)=0.05 "is a statement about d(\mathbf{X}) before it is observed. After it is observed, the event {d(\mathbf{X})>1.96} either happened or did not happen and hence has probability either one or zero" (ibid.).

Knowing d(\mathbf{X})=1.96, Kadane is saying there’s no more uncertainty about it. But would he really give it probability 1? That’s generally thought to invite the problem of "known (or old) evidence" made famous by Clark Glymour (1980). If the probability of the data \mathbf{x} is 1, Glymour argues, then \Pr(\mathbf{x}|H) also is 1, but the \Pr(H|\mathbf{x})=\Pr(H)\Pr(\mathbf{x}|H)/\Pr(\mathbf{x})=\Pr(H), so there is no boost in probability given \mathbf{x}. So does that mean known data don’t supply evidence? Surely not. …

From a quick glance at the linked Glymour chapter (specifically, p. 86), it’s not clear to me that this is actually an example of his "old evidence" problem. Glymour seems to be concerned with whether or not information that is known before a theory is developed can count as evidence for that theory under a Bayesian confirmation model of evidence (and, for what it’s worth, while looking for a relevant Glymour link, I found plenty of other people who are concerned with addressing his "old evidence" argument, e.g., van Fraassen, Howson, and whoever wrote this). By way of contrast, Mayo seems to conflate the temporal relationship between putative evidence and theory development, on the one hand, and the presence or absence of randomness in an observed test statistic, on the other.

Mayo also conflates the event {d(\mathbf{X})>1.96} and the likelihood \Pr(\mathbf{x}|H). She steps quickly from the former to d(\mathbf{X}=1.96) to "the probability of the data \mathbf{x}" to the likelihood \Pr(\mathbf{x}|H). (This is what feels like sleight of hand to me.) But even if we accept Kadane’s argument that {d(\mathbf{X})>1.96} has probability zero or one once we’ve observed the data, this doesn’t imply that the likelihood takes the same value. The probability of the event that a test statistic is more extreme than some criterion is not equivalent to the likelihood of the data conditional on some hypothesis, a point that is not lost on Mayo elsewhere, even earlier in this same chapter.

To be clear, I like Mayo’s (and Popper’s, and probably various other’s) notion of severity, the basic idea of which is that the result of a test doesn’t provide corroboration of a hypothesis unless it would have (probably) provided refutation if the hypothesis were false. I recently read Theory and Reality, in which similar notions are discussed, and I am convinced that it’s a crucial component of science in general and statistical analysis in particular. I’m still working out my thoughts on how this fits with my (mostly skeptical) attitude towards a Bayesian confirmation model of evidence and the likelihoodist school of statistical inference.

But regardless of where I end up with respect to all this, it does no one any good to (deliberately or accidentally) marshal such clearly flawed arguments.

Affirming the consequent is (still) a fallacy

I find it fascinating when an idea seems to be “in the air,” when I run into repeated instances of a concept for no obvious reason. Not too long ago, in a relatively short span of time, I encountered three cases that illustrate the importance of remembering that affirming the consequent is (still) a logical fallacy.

You can follow that link, of course, but the gist of it is that the following argument is not valid:

If P, then Q


Therefore, P

In words: If P implies Q and you observe Q, you cannot conclude that P holds. Antecedents other than P may also imply the consequent Q.

Of course, P is a sufficient condition for Q, and Q is a necessary condition for P, so both of the following are valid:

modus ponens

If P, then Q


Therefore, Q

modus tollens

If P, then Q

not Q

Therefore, not P

The wikipedia article linked above gives some less abstract examples.

Okay, so one encounter I had with affirmding the consequent was in an Twitter argument about p values, another was in a blog post about scientific realism, and the third was in one of the funnier academic papers I’ve ever read.

Argument about p values

In honor of Ronald Fisher’s birthday, Deborah Mayo tweeted “Ban bad science, not statistical significance tests. R.A. Fisher: 17 February 1890–29 July 1962”, along with an image containing this quote from Fisher: “No isolated experiment, however significant in itself, can suffice for the experimental demonstration of any natural phenomenon.”

I responded “Fisher’s position seems to imply that stat[istical] sig[nificance] is irrelevant for any individual result. A p value can provide graded information about a discrepancy, but there’s no reason to categorize as stat[istically] sig[nificant] or not. A reader can decide if a p is small enough for his/her purposes, right?”

To which Mayo responded “To spoze it’s irrelevant is to say any necessary but insufficient piece of info is irrelevant. A genuine effect in sci[ence] differs from an indication worth following up. Evidence of an exper[imental] phen[omenon] for Fisher requires being able to trigger results that rarely fail to reach stat[istical] sig[nificance].”

You can follow the links to see a couple more responses, but this last one is my focus here.

I don’t remember seeing it stated so plainly before that a small p-value is a necessary, but not sufficient, condition of there being a true effect. In case you need convincing, here are a couple shiny apps you can fiddle with to illustrate this fact: one, two.

This seems very obvious to me in retrospect, but I’m not sure I really appreciated the implications of this until recently. Specifically, the logic of modus tollens and modus ponens, and the illogic of affirming the consequent, tell us that even if a true effect (probabilistically) implies a small p value, we can’t conclude (only) from a small p value that a true effect exists. Small p values can occur for reasons other than the presence of a true effect (see, e.g., this paper for some discussion of this).

On the other hand, this suggests that we can conclude from the absence of a small p value (i.e., not Q) that there is no true effect (i.e., not P). This is just (probabilistic) modus tollens. This in turn suggests that null hypothesis significance testing (NHST) gets everything exactly backwards (and that maybe McShane and Gal (2015, 2017) should be praising their subjects’ grasp of logic rather than deriding their misunderstanding of statistical inference).

In my argument with Mayo on Twitter, I was focused on the arbitrariness of the criterion used to dichotomize statistical test results and the idea that statistical significance is inherently meta-analytic.

But her comment about the “necessary but insufficient” relationship between small p values and true effects reminded me that it’s important to remember not to affirm the consequent. If NHST can justify rejecting the null (e.g., via severity; see section 1.3 of this pdf), it cannot do so because the presence of a true effect implies a small p value.

Scientific realism

In a very interesting blog post about scientific realism and statistical methods, Daniel Lakens writes:

Scientific realism is the idea that successful scientific theories that have made novel predictions give us a good reason to believe these theories make statements about the world that are at least partially true… [and that scientific realists] think that corroborating novel and risky predictions makes it reasonable to believe that a theory has some ‘truth-likeness’, or verisimilitude. The concept of verisimilitude is based on the intuition that a theory is closer to a true statement when the theory allows us to make more true predictions, and less false predictions.

Putting aside the argument that theories cannot be tested in isolation (i.e., the Duhem-Quine thesis), I think it is pretty reasonable, at least to a first approximation, to argue that a true theory will make accurate predictions about observable phenomena. But the history of science is full of examples of theories that (we have decided) are not true but that nonetheless made at least some accurate predictions.

This is a key component of Laudan’s pessimistic meta-induction, which Lakens summarizes as “If theories that were deemed successful in the past turn out to be false, then we can reasonably expect all our current successful theories to be false as well.”

Lakens doesn’t explain why he disagrees with Laudan here, and it seems to me that he (Lakens) doesn’t fully appreciate that inferring truth or truthlikeness from accurate predictions is an(other) instance of affirming the consequent. As long as false theories can, on occasion, make accurate predictions, accurate predictions cannot license inferences of verisimilitude.

(Since I’m already attacking scientific realism, I may as well point out that it’s never been clear to me how truthlikeness is acceptable to realists. If a theory is only partially true, then it’salso partially untrue. But then which aspects of a theory do we require to be truthlike? And which aspects can we accept as not truthlike? And, even if, per Lakens, “only realism can explain the success of science,” it’s not at all clear that realism can explain how or why the verisimilitude of theories can steadily increase, on the one hand, while theory change can consist of (partially) mutually incompatible theories, on the other. If a later theory comes to replace an earlier theory, and if the later theory rules out some properties or mechanisms of the earlier theory, in what way can the earlier theory be said to have been truthlike? I think these are substantial, perhaps insurmountable, difficulties for the scientific realist.)

Survey validation

Maul (2017; pdf) is one of the funniest academic papers I’ve read. Granted, this is a low bar to clear, and the paper has non-humorous implications, but it made me laugh.

In the first study, Maul adapted a survey designed to measure “growth mindset”, the belief that intelligence is malleable rather than fixed. The adaptation consisted of substituting the nonsense word gavagai for “intelligence” or “intelligence level” in each survey item. In the second study, the survey items were replaced in their entirety with lorem ipsum text (i.e., nosense). In the third study, the items consisted of nothing at all – just an item number and a six-point Likert scale ranging from “strongly disagree” to “strongly agree” (this is the bit that made me laugh).

In each study, Maul calculated reliability statistics, did exploratory factor analysis (ostensibly to test the dimensionality of the constructs measured by the surveys), and calculated correlations between the survey items and (ostensibly) theoretically-related outcomes. That is, Maul carried out standard procedures for estimating survey reliability and validity.

Maul found acceptable reliability, acceptable single-factor solutions, and in the first two studies, statistically significant correlations with various measures of “Big Five” personality traits. He goes into some detail in discussing the implications of all this, and I think the paper is well worth reading.

But if you’ve made it this far, then you won’t be surprised to find out that my purpose in bringing it up here is to point out that it’s affirming the consequent to infer from acceptable reliability statistics, factor analysis solutions, and/or other-outcome correlations that a survey is valid.

If you have a valid survey, then you should get acceptable reliability and validity statistics. But Maul’s results illustrate very clearly that the converse is not true. A valid survey is not the only way to get such acceptable statistics.


I teach undergraduate and graduate versions of a research methods course, and in an early lecture I emphasize that research is systematic. A key component of this systematicity is basic logic. It’s not always easy, but it is always important, to remember the lessons of basic logic. Key among these lessons is the fact that affirming the consequent is now, and will forever be, a fallacy.

Statistical Significance & Frequentist Probability

Deborah Mayo and Richard Morey recently posted a very interesting criticism of the diagnostic screening model of statistical testing (e.g., this, this). Mayo & Morey argue that this approach to criticism of null hypothesis significance testing (NHST) relies on a misguided hybrid of frequentist and Bayesian reasoning. The paper is worth reading in its entirety, but in this post I will focus narrowly on a non-central point that they make.

At the beginning of the section Some Well-Known Fallacies (p. 6), Mayo & Morey write:

From the start, Fisher warned that to use p-values to legitimately indicate incompatibility (between data and a model), we need more than a single isolated low p-value: we must demonstrate an experimental phenomenon.

They then quote Fisher:

[W]e need, not an isolated record, but a reliable method of procedure. In relation to the test of significance, we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result. (Fisher 1947, p. 14)

It’s worth noting that in the Fisherian approach to NHST, p values provide a continuous measure of discrepancy from a null model (see also, e.g., the beginning of the second section in this commentary by Andrew Gelman). If a continuous measure of discrepancy is dichotomized (e.g., into the categories "statistically significant" and "not statistically significant"), the criterion for dichotomization is typically arbitrary; common p value thresholds like 0.05 and 0.01 are not generally given any kind of rationale, though there is a case for explicit justification of such thresholds.

(As a brief aside, in the Neyman-Pearson approach to NHST, dichotomization of test results is baked in from the start, and p values are not treated as continuous measures of discrepancy between model and data. A priori specification of α sets a hard, though still typically arbitrary, threshold. Here is a tutorial with detailed discussion of the differences between the Fisher and Neyman-Pearson approaches to NHST.)

The point here being that the Fisher quote above could (and probably should) be revised to say that we can describe a phenomenon as experimentally demonstrable when we know how to conduct an experiment that will rarely fail to produce a large discrepancy.

Clearly, in determining what counts as "large", we will run into some of the same problems that we run into in determining cutoffs for statistical significance. But focusing on discrepancies in the space in which our measurements are taken will force us to focus on "clinical significance" rather than "statistical significance". This will make it much easier to argue for or against any particular criterion (assuming you are okay with criteria for "large" vs "not large") or to take costs and benefits directly into account and use statistical decision theory (if you are not okay with such criteria).

To be as clear as I can be, I’m not in favor of "banning" p values (or the statistics that would allow a reader to calculate an otherwise unreported p value, e.g., means and standard errors). If you are concerned with error control, the information provided by p values is important. But the interested reader can decide for him- or herself how to balance false alarms and misses. There is no need for the researcher reporting a study to declare a result "statistically significant" or not.

I wholeheartedly agree with Lakens et al when they write:

[W]e recommend that the label "statistically significant" should no longer be used. Instead, researchers should provide more meaningful interpretations of the theoretical or practical relevance of their results.

In reading and thinking (and writing, e.g., see here, here, here, and here) about statistical significance lately, I feel like some important implications of frequentist statistics have really clicked for me.

A while back, it occurred to me that, at least under certain interpretations of confidence intervals (CIs), it doesn’t make much sense to actually report CI limits. The coverage of CIs is a property of the procedure of constructing CIs, but any particular set of observed CI limits do not tell you anything either probabilistic or useful about the location of a "true" parameter value (scare quotes because it’s not at all obvious to me that the notion of a true parameter value is of much use).

(Under a test-inversion interpretation of CIs, reporting a particular set of CI limits can be useful, since this indicates the range of parameter values that you cannot reject, given a particular test and statistical significance criterion. But, then again, the test-inversion interpretation is not without its own serious problems.)

Anyway, I bring all this up here at the end just to point out an important parallel between CIs and statistical significance. By the logic of frequentist probability – probabilities just are long-run relative frequencies of subsets of events – both CIs and statistical significance are only meaningful (and are only meaningfully probabilistic) across repeated replications. Given this, I am more and more convinced that individual research reports should focus on estimation (and quantification of uncertainty) rather than testing (and uncertainty laundering).

Strangely, and quite possibly incoherently, I think I may have convinced myself that frequentist statistical testing is inherently meta-analytic.

More new notebooks

I’ve been posting Jupyter notebooks for a programming course I’m teaching this semester, and I just posted a second signal detection theory notebook a few minutes ago.

Links to html versions of all of my notebooks are here. The actual notebooks for the programming course are here, and the signal detection theory notebooks are here.

A new notebook

I made a Jupyter notebook about signal detection theory and posted it on Github. You can view it as a webpage by clicking the link here, or you can view it on and/or download it from Github here. It is the first of a planned unknown-number of notebooks on signal detection theory. Any and all feedback is welcome.

Significant Feelings

This post is a response, of sorts, to the most recent episode of The Black Goat podcast. It is “of sorts” because it’s not just a response, but this post was definitely inspired by the episode. Anyway, the point here being that I stole the title of the episode for this post.

First things first – it’s a good episode, well worth a listen. I listened to (most of) it twice, in fact, taking notes the second time, since I wanted to make sure I was remembering things correctly so that I didn’t misrepresent what they (Srivastava, Tullett, and Vazire; henceforth ST&V) talk about in my discussion here.

There are two main parts: a discussion of a listener question about whether to trust open-science advocates’ pre-rep*-crisis work more than that of non-open-science advocates, and a longer discussion of the hosts’ history with p-values and the various discussions about p-values going on in behavioral and medical sciences (and probably elsewhere). My focus here will be on the latter (which is why the second listen was only to most of the episode – from the 26 minute mark, for what it’s worth).

The p-value discussion starts with each host giving a brief overview of his/her “history with p-values” in classes and in the lab. I’ll do the same here.

A Brief Overview of My History With P Values

The first class I took that was directly relevant to statistical analysis had relatively little statistical analysis in it. It was a second language testing class at the University of Hawai`i. It covered some descriptive stats, but the course focused much more on validity than a standard stats class would. I’m sure my memories aren’t totally accurate, but I think of it now as a course on measurement more than anything else. We probably discussed p values at some point, but statistical testing was not the primary focus of the class.

I took some more standard stats classes from a guy in Educational Psychology at UH, too (he sounded exactly like Donald Sutherland, and kinda-sorta looked like him, which was great). These were more standard, but, I think, not completely like a run of the mill psych stats class. This is maybe illustrated by the fact that the textbook for one course was Measurement, Design, and Analysis: An Integrated Approach. I still own and use it, 16 years later. I’m quite sure p values and NHST came up a lot in the classes I took from Professor Sutherland, but they were secondary to how design and analysis are linked.

At Indiana University, where I went after UH to pursue a PhD in Linguistics and Cognitive Science, I took the two stats courses required for Psychology students. The first was taught by John Kruschke. He was always very careful to define p-values correctly, so this got drilled into my head, early and thoroughly. Also, his class was structured exceptionally well, so the relationships between different statistical tests and the general logic of NHST have stuck with me.

I also took a seminar in Bayesian statistics from Kruschke, the first time he taught a course on this topic. He used the chalkboard and didn’t have prepared slides, which was very different from the other classes I had taken from him. I asked him why at one point, and he said that it was a one-off course for him, so it wasn’t worth the time and energy to prepare elaborate materials he wouldn’t use again. Later on, I was Kruschke’s grader for what became a two-semester sequence on Bayesian stats as he wrote the first edition of his textbook on Bayesian stats. Now he’s on the second edition. Predicting the future is hard. Also, problems with NHST are covered in his book and lectures, so p values were at least a small part of my initial training with Bayesian stats, too.

In addition to those courses, I took a lot of mathematical psychology and cognitive modeling courses at IU, some of which touched on various aspects of statistical testing. A fair amount of this training had nothing to do with p values, though there was a lot of probability theory.

After I was done with my PhD coursework, I also took a couple semesters of classes to fulfill the requirements for the then newly-established MS in Applied Statistics (the rest of the requirements for this degree were fulfilled by the earlier stats and math psych classes I had taken). These post-PhD-coursework courses deepened (and filled some gaps in) my knowledge, and I learned how to do statistical consulting effectively (from the soon-to-be president elect of the ASA). Perhaps not surprisingly, p values came up a lot in some of these classes (e.g., the two-semester sequence on linear models), but less so in others (another course on Bayesian stats, and the statistical theory course, which focused a lot more on things like the central limit theorem, consistency, bias, maximum likelihood estimation, and various other mathematical underpinnings of statistics).

In the research I was doing in the lab(s) I worked in (the Mathematical Psychology Lab, though I ran experiments in the Auditory Perception Lab on the other side of campus; I also did some work in the Linguistic Speech Lab), p values played a relatively small role. Early on, I remember using p values in fairly standard ways (e.g., to pick out what was worth talking about; see my first paper, for example). As I learned more about math psych and Bayesian stats, p values became more irrelevant, though they haven’t disappeared from my research completely.

Finally, I got valuable experience with new (to me) statistical tools in my first job after graduating. I participated in workshops on diagnostic measurement and structural equation modeling at the University of Maryland, and the work I was doing was essentially I/O psychology (as confirmed by an actual I/O psychologist). Plenty of p values here, many of them, in retrospect, likely misused, though not misinterpreted.

Okay, so, well, that was maybe less brief than I envisioned, but for the purposes of discussing my significant feelings, I think it’s important to explain where I’m coming from.

The Discussion

ST&V touch on a wide range of topics in their discussion of p values. As mentioned above, the whole thing is worth listening to, but I’m not going to talk about everything they get into.

Their discussion spurred me to think some more about

  1. how evaluation of statistical analyses is related to evaluation of design and measurement (and how the former is subordinate to the latter);
  2. how individuals evaluate evidence and how this relates to group-level evaluation of evidence;
  3. how we think about (or struggle to think about) continuous vs discrete reports, and how these do or do not relate to how we (struggle to) think about probability;
  4. and how all of this relates (or does not relate) to Bayesian and frequentist statistical philosophies.

Around the 42-minute mark, ST&V start talking about the paper Redefine Statistical Significance (RSS), which Vazire is a co-author of (here is my very negative take on RSS). This leads to a more general discussion about statistical cutoffs and dichotomous vs graded thinking. At around the 53-minute mark, Tullett* brings up differences between evaluation of study designs and evaluation of statistical results, positing that it’s easier to give nuanced evaluation of the former than the latter.

(* I am reasonably confident that it is Tullett talking here and not Vazire. Their voices are different, but I don’t know either voice well enough to be able to map the voices to the names consistently. That is, I would do very well with a discrimination task, but probably substantially less well with a recognition task.)

(1) This is fine as far as it goes – design is, along with measurement, complex and multifaceted. But I want to push this distinction further. It occurred to me a while ago that it is problematic to discuss isolated statistics as “evidence.” I think ST&V mention “statistical evidence” a few times in this discussion, the word “evidence” is used a lot in RSS and the response papers (see links at the top of my RSS post), and I’m pretty sure I’ve seen Bayesians refer to the denominator in Bayes’ rule as “evidence.”

But evidence has to be of something. Without a larger context – a research question, a design, a measurement tool, and a statistical model, at least – a p value is just (tautological) “evidence” of a range of test statistic values being more or less probable. It’s just – per the definition – a statement about the probability of observing a test statistic as or more extreme as what you’ve observed under the assumption that the null hypothesis is true. Which is to say that, in isolation, a p value doesn’t tell us anything at all. Similarly for Bayes factors, sets of *IC fit statistics, or whatever other statistical summary you like.

(2) So, when it comes to evaluation of evidence, an individual must take research questions, design, measurement, and statistics into account. And, as I mentioned in my post on RSS, individuals have now, and always have had, the ability to evaluate evidence as stringently or as laxly as they believe is appropriate for any given situation.

Of course, things get complicated if an individual wants to persuade other individuals about a scientific issue. Shared standards undoubtedly help individuals achieve scientific consensus, but the complications of scientific disagreement and agreement can’t be solved by fixed, arbitrary statistical thresholds. Such thresholds might be a necessary condition of scientific consensus, though I’m skeptical, and they definitely aren’t sufficient.

(3) All that said, when an individual evaluates a scientific report, there can be value in discrete categories. I like the discussion (around the 54-minute mark) of thinking about results in terms of whether you buy them or not and how much you would bet on a replication being successful. I also find the argument for the utility of a three-category fire safety system vs a continuous numerical system compelling.

But it’s worth keeping in mind that, even if a discrete system can better for certain kinds of decisions, a continuous system can be better for other kinds of decisions. “Buy it or not” might be the best way to evaluate a study for some purposes, but from a meta-analytic mind-set, dichotomizing evidence is just throwing information away.

Tangentially related, I agree wholeheartedly with Vazire* that people are bad at probabilistic thinking. Even with all the statistical training and research experience described above, I often have to exert substantial cognitive effort to be careful with probabilistic thinking. That said, I think it’s worth drawing a distinction between, on the one hand, categorical vs continuous thinking, and, on the other, deterministic vs probabilistic thinking.

Obviously, probability plays a role in all this, and the part of the discussion about intuitions about p value distributions under null and alternative hypotheses (~38 minutes) is both interesting and important. But the relative value of dichotomization (or trichotomization) vs indefinitely graded descriptions of statistical analyses is not (only) about probability. It’s also about (mis)interpretation of basic, non-random arithmetical facts (see, e.g., McShane & Gal, 2016, 2017).

(4) Finally, I agree exuberantly with the point that Vazire* makes late in the podcast that these issues are orthogonal to debates about Bayesian and frequentist philosophies of statistics. Srivastava is absolutely right that categorizing Bayes factors as “anecdotal” or “strong evidence” is just as problematic (or not) as categorizing p values as statistically significant, (suggestive,) or not statistically significant. Or maybe I should say that he should have gone further than saying that this kind of categorization “starts to feel a little threshold-y.” It is 100% threshold-y, and so it has all the problems that fixed, arbitrary thresholds have.

If you insist on using thresholds, you should justify your Bayes factor thresholds just as surely as you should Justify Your Alpha. I’m more and more convinced lately that we should be avoiding thresholds (see the McShane & Gal papers linked above, see also the papers linked in the post on the meta-analytic mind-set), but I agree with the larger point of the Justify Your Alpha paper that research decisions (statistical and non) should be described as fully and as transparently as possible.

One last point about metric vs nominal reports and Bayesian vs frequentist statistics: although the former is logically unrelated to the latter, in my experience, Bayesian statistics involves a lot more modeling (model building, parameter estimation, and data and model visualization) than does frequentist statistical testing. Obviously, you can do frequentist model building, parameter estimation, and visualization, but standard psych statistical training typically focuses on (null hypothesis significance) testing, which is inherently, inextricably tied up with categorical reports (e.g., dichotomization). The point being that, while I don’t think there are any deep philosophical reason for Bayesians to be more friendly to continuous, graded reports than are frequentists, there are contingent historical reasons for things to play out this way.


Significant Feelings is a very good episode of The Black Goat podcast, but arbitrary thresholds are bad.


Unworkable and empty

Benjamin et (many) al recently proposed that the p-value for declaring a (new) result “statistically significant” should be divided by 10, reducing it from 0.05 to 0.005. Lakens et (many) al responded by arguing that “that researchers should transparently report and justify all choices they make when designing a study, including the alpha level.” Amrhein and Greenland, on the one hand, and McShane et al, on the other, responded with suggestions that we simply abandon statistical significance entirely (McShane et al pdf; blog post). Trafimow et (many) al also argue against Benjamin et al’s proposal and the scientific utility of p-values in general. A statistician named Crane recently wrote a narrower, more technical criticism of Benjamin et al, arguing that p-hacking (broadly construed) calls the substantive claims of Redefine Statistical Significance (RSS) into question.

The fifth author of RSS (Wagenmakers) and one of his PhD students (Gronau) recently posted an exceptionally disingenuous response to Crane’s paper. It’s exceptionally disingenuous for two reasons. First, Wagenmakers and Gronau just ignore Crane’s argument, defending a component of RSS that Crane isn’t arguing against. Second, the proposal in RSS – to reduce the set of p-values associated with (new) effects that deserve the label “statistically significant” from 0.05 to 0.005 – is explicitly non-Bayesian, even if it relies on some Bayesian reasoning, but most of Wagenmakers and Gronau’s post consists of a fanciful metaphor in which Crane is directly attacking Bayesian statistics. The non-Bayesian nature of RSS is made clear in the fourth paragraph, which begins with “We also restrict our recommendation to studies that conduct null hypothesis significance tests.” To Wagenmakers and Gronau’s credit, they published Crane’s response at the end of the post.

So, why am I chiming in now? To point out that the original RSS proposal is unworkable as stated and, ultimately, essentially free of substantive content. I think Crane makes a pretty compelling case that, even working within the general framework that RSS seems to assume, the proposal won’t do what Benjamin et al claim it will do (e.g., reduce false positive rates and increase reproducibility by factors of two or more). But I don’t even think you need to dig into the technicalities the way Crane does to argue against RSS.

To be clear, I think Benjamin et al are correct to point out that a p-value of just less than 0.05 is “evidentially weak” (as Wagenmakers and Gronau describe it in the Bayesian Spectacles post). Be that as it may, the allegedly substantive proposal to redefine statistical significance is all but meaningless.

Benjamin et al “restrict our recommendation to claims of discovery of new effects,” but they do not even begin to address what would or wouldn’t count as a new effect. Everyone agrees that exact replications are impossible. Even the most faithful replication of a psychology experiment will have, at the very least, a new sample of subjects. And, of course, even if you could get the original subjects to participate again, the experience of having participated once (along with everything else that has happened to them since participating) will have changed them, if only minimally. As it happens, psychology replications tend to differ in all sorts of other ways, too, often being carried out in different locations, with newly developed materials and changes to experimental protocols.

As you change bits and pieces of experiments, eventually you shift from doing a direct replication to doing a conceptual replication (for more on this distinction, see here, among many other places). It seems pretty clear to me that there’s no bright line distinction between direct and conceptual replications.

How does this bear on the RSS proposal? I think you could make a pretty compelling case that conceptual replications should count as “new” effects. I strongly suspect that Benjamin et al would disagree, but I don’t know for sure, because, again, they haven’t laid out any criteria for what should count as new. Without doing so, the proposal cannot be implemented.

But it’s not clear to me that it’s worth doing this (undoubtedly difficult) work. Here’s a sentence from the second paragraph and a longer chunk from the second-to-last paragraph in RSS (emphasis mine):

Results that would currently be called significant but do not meet the new threshold should instead be called suggestive….

For research communities that continue to rely on null hypothesis significance testing, reducing the P value threshold for claims of new discoveries to 0.005 is an actionable step that will immediately improve reproducibility. We emphasize that this proposal is about standards of evidence, not standards for policy action nor standards for publication. Results that do not reach the threshold for statistical significance (whatever it is) can still be important and merit publication in leading journals if they address important research questions with rigorous methods. This proposal should not be used to reject publications of novel findings with 0.005 < P < 0.05 properly labelled as suggestive evidence.

The proposal is explicitly not about policy action or publication standards. It is all and only about labels applied to statistical test results.

Researchers are now, and have always been, well within their rights and abilities not to consider p \approx 0.05 strong evidence of an effect. Anyone interested in (directly or conceptually) replicating an interesting original finding is free to do so only if the original finding meets their preferred standard of evidence, however stringent. (I note in passing that hardcore frequentists are exceedingly unlikely to be moved by their Bayesian argument for the evidential weakness of p \approx 0.05.)

To the extent that Benjamin et al are arguing that using more stringent standards of evidence is also researchers’ responsibility, then I agree. But Benjamin et al are manifestly not just arguing that p \approx 0.05 is weak evidence. They are arguing that reserving the label “statistically significant” for p \leq 0.005 (and the label “suggestive” for 0.05 \geq p \geq 0.005) will improve reproducibility and reduce false alarms.

The substance of the proposal, such as it is, is concerned entirely with changing how we use semi-esoteric statistical jargon to label different sets of test statistics.

I agree with Benjamin et al that important research questions addressed with rigorous methods can merit publication. In fact, I would go even further and argue that publication decisions should be based entirely on how interesting the research questions are and how rigorous the methods used to answer the questions are. This is at the heart of the meta-analytic mind-set I discussed in an earlier post.

Truth, utility, and null hypotheses

Discussing criticisms of null hypothesis significance testing (NHST) McShane & Gal write (emphasis mine):

More broadly, statisticians have long been critical of the various forms of dichotomization intrinsic to the NHST paradigm such as the dichotomy of the null hypothesis versus the alternative hypothesis and the dichotomization of results into the different categories statistically significant and not statistically significant. … More specifically, the sharp point null hypothesis of \theta = 0 used in the overwhelming majority of applications has long been criticized as always false — if not in theory at least in practice (Berkson 1938; Edwards, Lindman, and Savage 1963; Bakan 1966; Tukey 1991; Cohen 1994; Briggs 2016); in particular, even were an effect truly zero, experimental realities dictate that the effect would generally not be exactly zero in any study designed to test it.

I really like this paper (and a similar one from a couple years ago), but this kind of reasoning has become one of my biggest statistical pet peeves. The fact that “experimental realities” generally produce non-zero effects in statistics calculated from samples is one of the primary motivations for the development of NHST in the first place. It is why, for example, the null hypothesis is typically – read always – expressed as a distribution of possible test statistics under the assumption of zero effect. The whole point is to evaluate how consistent an observed statistic is with a zero-effect probability model.

Okay, actually, that’s not true. The point of statistical testing is to evaluate how consistent an observed statistic is with a probability model of interest. And this gets at the more important matter. I agree with McShane & Gal (and, I imagine, at least some of the people they cite) that the standard zero-effect null is probably not true in many cases, particularly in social and behavioral science.

The problem is not that this model is often false. A false model can be useful. (Insert Box’s famous quote about this here.) The problem is that the standard zero-effect model is very often not interesting or useful.

Assuming zero effect makes it (relatively) easy to derive a large number of probability distributions for various test statistics. Also, there are typically an infinite number of alternative, non-zero hypotheses. So, fine, zero-effect null hypotheses provide non-trivial convenience and generality.

But this doesn’t make them scientifically interesting. And if they’re not scientifically interesting, it’s not clear that they’re scientifically useful.

In principle, we could use quantitative models of social, behavioral, and/or health-related phenomena as interesting and useful (though almost certainly still false) “null” models against which to test data or, per my preferences, to estimate quantitative parameters of interest. Of course, it’s (very) hard work to develop such models, and many academic incentives push pretty hard against the kind of slow, thorough work that would be required in order to do so.

Puzzle Pieces

I don’t remember when or where I first encountered the distinction between (statistical) estimation and testing. I do remember being unsure exactly how to think about it. I also remember feeling like it was probably important.

Of course, in order to carry out many statistical tests, you have to estimate various quantities. For example, in order to do a t-test, you have to estimate group means and some version of pooled variation. Estimation is necessary for this kind of testing.

Also, at least some estimated quantities provide all of the information that statistical tests provide, plus some additional information. For example, a confidence interval around an estimated mean tells you everything that a single-sample t-test would tell you about that mean, plus a little more. For a given \alpha value and sample size, both would tell you if the estimated mean is statistically significantly different from a reference mean \mu_0, but the CI gives you more, defining one set of all of the means you would fail to reject (i.e., the confidence interval) and another (possibly disjoint) set of the means you would reject (i.e., everything else).

The point being that estimation and testing are not mutually exclusive.

I have recently read a few papers that have helped me clarify my thinking about this distinction. One is Francis (2012) The Psychology of Replication and Replication in Psychology. Another is Stanley & Spence (2014) Expectations for Replications: Are Yours Realistic? A third is McShane & Gal (2016) Blinding Us to the Obvious? The Effect of Statistical Training on the Evaluation of Evidence.

One of the key insights from Francis’ paper is nicely illustrated in his Figure 1, which shows how smaller effect size estimates are inflated by the file drawer problem (non-publication of results that are not statistically significant) and data peeking (stopping data collection if a test is statistically significant, otherwise collecting more data, rinse, repeat):

One of the key insights from Stanley & Spence is nicely illustrated in their Figure 4, which shows histograms of possible replication correlations for four different true correlations and a published correlation of 0.30, based on simulations of the effects of sampling and measurement error:

Finally, a key insight from McShane & Gal is nicely illustrated in their Table 1, which shows how the presence or absence of statistical significance can influence the interpretation of simple numbers:

This one requires a bit more explanation, so: Participants were given a summary of a (fake) cancer study in which people in Group A lived, on average, 8.2 months post-diagnosis and people in Group B lived, on average, 7.5 months. Along with this, a p value was reported, either 0.01 or 0.27. Participants were asked to determine (using much less obvious descriptions) if 8.2 > 7.5 (Option A), if 7.5 > 8.2 (Option B), if 8.2 = 7.5 (Option C), or if the relative magnitudes of 8.2 and 7.5 were impossible to determine (Option D). As you can see in the table above, the reported p value had a very large effect on which option people chose.

Okay, so how do these fit together? And how do they relate to estimation vs testing?

Francis makes a compelling case against using statistical significance as a publication filter. McShane & Gal make a compelling case that dichotomization of evidence via statistical significance leads people to make absurd judgments about summary statistics. Stanley & Spence make a compelling case that it is very easy to fail to appreciate the importance of variation.

Stanley & Spence’s larger argument is that researchers should adopt a meta-analytic mind-set, in which studies, and statistics within studies, are viewed as data points in possible meta-analyses. Individual studies provide limited information, and they are never the last word on a topic. In order to ease synthesis of information across studies, statistics should be reported in detail. Focusing too much on statistical significance produces overly optimistic sets of results (e.g., Francis’ inflated effect sizes), absurd blindspots (e.g., McShane & Gal’s survey results), and gives the analyst more certainty than is warranted (as Andrew Gelman has written about many times, e.g., here).

Testing is concerned with statistical significance (or Bayesian analogs). The meta-analytic mind-set is concerned with estimation. And since estimation often subsumes testing (e.g., CIs and null hypothesis tests), a research consumer that really feels the need for a dichotomous conclusion can often glean just that from non-dichotomous reports.