in Python, Statistics

Multivariate normal CDF values in Python

I was very happy to realize recently that a subset of Alan Genz’s multivariate normal CDF functions are available in Scipy. I first learned of Dr. Genz’s work when I started using the mnormt R package, which includes a function called sadmvn that gives very precise, and very accurate, multivariate normal CDF values very quickly.

In case you don’t know, this is quite an achievement, since there is not a closed form solution. I’ve spent far too much time reading strange, complicated papers found in the deepest recesses of google (i.e., the third page of search results) that claim to provide fast, accurate approximations to multivariate normal CDFs. As far as I can tell, none of these claims hold any water. None other than Genz’s, anyway.

Okay, so Scipy has two relevant functions, but they’re kind of buried, and it might not be obvious how to use them (at least if you don’t know to look at Genz’s Fortran documentation). So, for the benefit of others (and myself, in case I need a refresher), here’s where they are and how to use them.

First, where. In the Scipy stats library, there is a chunk of compiled Fortran code called I’ve copied it here, just in case it disappears from Scipy someday. Should that come to pass, and should you want this file, just save that ‘plain text’ file and rename it and you should be good to go.

Otherwise, if you’ve got Scipy, you can just do this:

from scipy.stats import mvn

Now, mvn will have three methods, two of which – mvndst and mvnun – are what we’re looking for here.

The first works like this:

error,value,inform = mvndst(lower,upper,infin,correl,...)

Which is to say that it takes, as arguments, lower and upper limits of integration, ‘infin’ (about which more shortly), and correl (as well as some optional arguments). This is, in turn, to say that it assumes that your multivariate normal distribution is centered at the origin and that you’ve normalized all the variances.

This function is straightforward to use, except for, perhaps, the ‘infin’ argument. From Genz’s documentation:

*     INFIN  INTEGER, array of integration limits flags:
*           if INFIN(I) < 0, Ith limits are (-infinity, infinity);
*           if INFIN(I) = 0, Ith limits are (-infinity, UPPER(I)];
*           if INFIN(I) = 1, Ith limits are [LOWER(I), infinity);
*           if INFIN(I) = 2, Ith limits are [LOWER(I), UPPER(I)].

Which is to say that you put a negative number in if you want, on dimension I, to integrate from -Inf to Inf, 0 if you want to integrate from -Inf to your designated upper bound, 1 if you want to integrate from your designated lower bound to Inf, and 2 if you want to use both of your designated bounds.

Also from Genz’s documentation:

*     INFORM INTEGER, termination status parameter:
*          if INFORM = 0, normal completion with ERROR < EPS;
*          if INFORM = 1, completion with ERROR > EPS and MAXPTS 
*                         function vaules used; increase MAXPTS to
*                         decrease ERROR;
*          if INFORM = 2, N > 500 or N < 1.

Here, N seems to be the number of dimensions (and is the first argument in Genz’s MVNDST Fortran function, but is not in the similar/corresponding R or Python functions).  In any case, it’s the 0 and 1 that seem most informative, and the MAXPTS variable is one of the optional arguments I mentioned above.

The other function allows for non-zero means and covariance (as opposed to correlation) matrices, but it doesn’t, technically speaking, allow for integration to or from +/-Infinity:

value,inform = mvnun(lower,upper,means,covar,...])

As it happens, and as shouldn’t be too surprising, you can give it large magnitude bounds and get essentially the same answer. As long as you’re sufficiently far away from the mean (meaning as long as you’re more than a few standard deviation units away), the difference between the +/-Inf bound and the finite bound will only show up quite a few decimal places into your answer.

If you’ve got numpy imported as np, you could, for example, do this:

In [54]: low = np.array([-10, -10])

In [55]: upp = np.array([.1, -.2])

In [56]: mu = np.array([-.3, .17])

In [57]: S = np.array([[1.2,.35],[.35,2.1]])

In [58]: p,i = mvn.mvnun(low,upp,mu,S)

In [59]: p
Out[59]: 0.2881578675080012

With more extreme values for low, we get essentially the same answer (with a difference only showing up in the 12th decimal place):

In [60]: low = array([-20, -20])

In [61]: p,i = mvncdf(low,upp,mu,S)

In [62]: p
Out[62]: 0.2881578675091007

Still more extreme values doesn’t change it at all:

In [63]: low = array([-100, -100])

In [64]: p,i = mvncdf(low,upp,mu,S)

In [65]: p
Out[65]: 0.2881578675091007

All of this is important to me because I’m working on building a Bayesian GRT (e.g.) model in PyMC, and I’m hoping I’ll be able to use this function to get fast and accurate probabilities, given a set of mean and covariance parameters.