Un garçon pas comme les autres (Bayes)

Diffusion models; or Yet another way to sample from an arbitrary distribution

Dan Simpson — Wed, 08 Feb 2023 13:00:00 GMT

The other day I went to the cinema and watched M3GAN, a true movie masterpiece¹ about the death and carnage that ensues when you simply train your extremely complex ML model and don’t do proper ethics work. And that, of course, made me want to write a little bit about something relatively hip, hop, and happening² in the ML/AI space. But, like, I’m not gonna be that on trend³ because fuck that noise, so I’m gonna talk about diffusion models.

It’s worth noting that I know bugger all about diffusion models. But when they first came out, I had a quick look at how they worked and then promptly forgot about them because, let’s face it, I work on different things. But hey. If that’s not enough⁴ knowledge to write a blog post, I don’t know what is.

And here’s the thing. Most of the time when I blog about something I know a lot about it. Sometimes too much. But this is not one of those times. There are plenty of resources on the internet if you want to learn about diffusions models from an expert. Oodles. But where else but here can you read the barely proof-read writing of a man who read a couple of papers yesterday?

And who doesn’t want⁵ that?

A prelude: Measure transport for sampling from arbitrary distributions

One of the fundamental tasks in computational statistics is to sample from a probability distribution. There are millions of ways of doing this, but the most popular generic method is Markov chain Monte Carlo. But this is not the post about MCMC methods. I’ve already made a post about MCMC methods.

Instead, let’s focus on stranger ways to do it. In particular, let’s think about methods that, create a mapping that may depend on some properties of the target distribution such that the following procedure constructs a sample :

Sample for some known distribution
Set

The general problem of starting with a distribution and mapping it to another distribution is an example of a problem known as measure transport. Transport problems have been studied by mathematicians for yonks. It turns out that there are an infinite number of mappings that will do the job, so it’s up to us to choose a good one.

Probably the most famous⁶ transport problem is the optimal transport problem that was first studied by Monge and Kantorovich that tries to find a mapping that minimises subject to the constraint that whenever , where is some sort of cost function. There are canonical choices of cost function, but for the most part we are free to choose something that is convenient.

The measure transport concept is underneath the method of normalising flows, but the presentation that I’m most familiar with is due to Youssef Marzouk and his collaborators in 2011 and predates the big sexy normalising flow papers by a few years.

Continuous distributions in 1D

If and are both continuous, univariate distributions, it is pretty easy to construct a transport map. In particular, if is the cumulative distribution function of , then is a transport map. This works because, if , then . From this, we can use everyone’s favourite result that you can sample from a continuous univariate random variable by evaluating the quantile function at a uniform random value.

There are, of course, two problems with this: it only works in one dimension and we usually don’t know explicitly.

The second of these isn’t really a problem if we are willing to do something splendifferously dumb. And I am. Because I’m gay and frivolous⁷.

If I write then I can differentiate this to get This is a very non-linear differential equation. We can make it even more non-linear differential equation by repeating the procedure to get Noting that we get This is a rubbish differential equation, but it has the singular advantage that it doesn’t depend⁸ on the normalising constant for , which can be useful. The downside is that the boundary conditions are infinite on both ends.

Regardless of that particular challenge, we could use this to build a generic algorithm.

Sample
Use a numerical differential equation solver to solve the equation with boundary conditions for some sufficiently large number and return

This will sample from truncated to .

I was going to write some python code to do this, but honestly it hurts my soul. So I shan’t.

Transport maps: A less terrible method that works on general densities

Outside of one dimension, there is (to the best of my knowledge) no direct solution to the transport problem. That means that we need to solve our own. Thankfully, the glorious Youssef Marzouk and a bunch of his collaborators have spent some quality time mapping out this idea. A really nice survey of their results can be found in this paper.

Essentially the idea is that we can try to find the most convenient transport map available to us. In particular, it’s useful to minimise the Kullback-Leibler divergence between and its transport. After a little bit⁹ of maths, this is equivalent to minimising where is the Jacobian of . To finish the specification of the optimisation problem, it’s enough to consider triangular maps¹⁰ with the additional constraint that their Jacobians have positive determinants. Using a triangular map has two distinct advantages: it’s parsimonious and it makes the positive determinant constraint much easier to deal with. Triangular maps are also sufficient for the problem (my man Bogachev showed it in 2005).

That said, this can be a somewhat tricky optimisation problem. Youssef and his friends have spilt a lot of ink on this topic. And if you’re the sort of person who just fucking loves a weird optimisation problem, I’m sure you’ve got thoughts. With and without the triangular constraint, this can be parameterised as the composition of a sequence of simple functions, in which case you turn three times and scream neural net and a normalising¹¹ flow appears.

What if we only have samples from the target density

All of that is very lovely. And quite nice in its context. But what happens if you don’t actually have access of the (unnormalised) log density of the target? What if you only have samples?

The good news is that you’re not shit out of luck. But it’s a bit tricky. And once again, that lovely review paper by Youssef and friends will tell us how to do it.

In particular, they noticed that if you swap the direction of the KL divergence, you get the optimisation problem for the inverse mapping that aims to minimise where is once again a triangular map subject to the monotonicity constraints Because we have the freedom to choose the reference density , we can choose it to be iid standard normals, in which case we get the optimisation problem which is a convex, separable optimisation problem that can be solved¹² using, for instance, a stochastic gradient method. This can be turned into an unconstrained optimisation problem by explicitly parameterising the monotonicity constraint.

The monotonicity of makes the resulting nonlinear solve to compute relatively straightforward. In fact, if isn’t too big you can solve this sequentially dimension-by-dimension. But, of course, when you’ve got a lot of parameters this is a poor method and it would make more sense¹³ to attack it with some sort of gradient descent method. It might even be worth taking the time to learn the inverse function so that can be applied for, essentially, free.

So does it work?

To some extent, the answer is yes. This is very much normalising flows in its most embryonic form. They work to some extent. And this presentation makes some of the problems fairly obvious:

There’s no real guarantee that is going to be a nice smooth map, which means that we may have problems moving beyond the training sample.
The most natural way to organise the computations are naturally sequential involving sweeps across the parameters. This can be difficult to parallelise efficiently on modern architectures.
The complexity of the triangular map is going to depend on the order of variables. This is fine if you’re processing something that is inherently sequential, but if you’re working with image data, this can be challenging.

Of course, there are a pile of ways that these problems can be overcome in whole or in part. I’d point you to the last five years of ML conference papers. You’re welcome.

Continuous normalising flows: Making the problem easier by making it harder

A really clever idea, which is related to normalising flows, is to ask what if, instead of looking for a single¹⁴ map , we tried to find a sequence of maps that smoothly move from the identity map to to the transport map.

This seems like it would be a harder problem. And it is. You need to make an infinite number of maps. But the saving grace is that as changes slightly, the map is also only going to change slightly. This means that we can parameterise the change relatively simply.

To this end, we write for some relatively simple function that models the infinitesimal change in the transport map as we move along the path. The hope is that learning the vector field will be easier than learning directly. To finish the specification, we require that

The question is _can we learn the function from data? If we can, it will be (relatively) easy to evaluate the transport map for any sample by just solving¹⁵ the differential equation.

It turns out that the map is most useful for training the normalising flow, while is useful for generating samples from the trained model. If we were using the methods in the previous section, we would have had to commit to either modelling or . One of the real advantages of the continuous formulation is that we can just as easily solve the equation with the terminal condition¹⁶ and solve the equation backwards in time to calculate ! The dynamics of both equations are driven by the vector field !

A very quick introduction to inverse problems

It turns out that learning parameters of differential equation (and other physical models) has a long and storied history in applied mathematics under the name of inverse problems. If that sounds like statistics, you’d be right. It’s statistics, except with no interest in measurement or, classically, uncertainty.

The classic inverse problem framing involves a forward map that takes as its input some parameters (often a function) and returns the full state of a system (often another function). For instance, the forwards map could be the solution of a partial differential equation like The thing that you should notice about this is that the forward map is a) possibly expensive to compute, b) not explicitly known, and c) extremely¹⁷ non-linear.

The problem is specified with data points and the aim is to find the value of that best fits the data. The traditional choice is to minimise the mean-square error

Now every single one of you will know immediately that this question is both vague and ill-posed. There are many functions that will fit the data. This means that we need to enforce¹⁸ some sort of complexity penalty on . This leads to the method known as Tikhonov regularisation¹⁹ where is some Banach space and is some tuning parameter.

As you can imagine, there’s a lot of maths under this about when there is a unique minimum, how the reconstruction behaves as and , and how the choice of effects the estimation of . There is also quite a lot of work²⁰ looking at how to actually solve these sorts of optimisation problems.

Eventually, the field evolved and people started to realise that it’s actually fairly important to quantify the uncertainty in the estimate. This is … tricky under the Tikhonov regularlisation framework, which became a big motivation for Bayesian inverse problems.

As with all Bayesianifications, we just need to turn the above into a likelihood and a prior. Easy. Well, the likelihood part, at least, is easy. If we want to line up with Tikhonov regularisation, we can choose a Gaussian likelihood

This is familiar to statisticians, the forward model is essentially working as a non-standard link function in a generalised linear model. There are two big practical differences. The first one is that is very non-linear and almost certainly not monotone. The second problem is that evaluations of are typically very²¹ expensive. For instance, you may need to solve a system of differential equations. This means that any computational method²² is going to need to minimise the number of likelihood evaluations.

The choice of prior on can, however, be a bit tricky. The problem is that in most traditional inverse problems is a function²³ and so we need to put a carefully specified prior on it. And there is a lot of really interesting work on what this means in a Bayesian setting. This is really the topic for another blogpost, but it’s certainly an area where you need to be aware of the limitations of different high-dimensional priors and how they perform in various contexts. For instance, if the function you are trying to reconstruct is likely to have a lot of sharp boundaries²⁴ then you need to make sure that your prior can support functions with sharp boundaries. My little soldier bois²⁵ don’t, so you need to get more²⁶ creative.

The likelihood for a normalising flow

Our aim now is to cast the normalising flow idea into the inverse problems framework. To do this, we remember that we begin our flow from a sample from and we then deform it until it becomes a sample from at some known time (which I’m going to choose as ). This means that if , then

We can now derive a relationship between and using the change of variables formula. In particular, which means that our log likelihood will be

The log-determinant term looks like it might cause some trouble. If is parameterised as a triangular map it can be written explicitly, but there is, of course, another route.

For notational ease, let’s consider , for some . Then We can differentiate this with respect to to get ²⁷ to get where I used one of those magical vector calculus identities to get that trace. Remembering that , the log-determinant of the Jacobian at zero is zero and so we get the initial condition

The likelihood can be evaluated²⁸ by solving the system of differential equations and the log likelihood is evaluated as

It turns out that you can take gradients of the log-likelihood efficiently by solving an augmented system of differential equations that’s twice the size of the original. This allows for all kinds of gradient-driven inferential shenanigans.

But oh that complexity

One big problem with normalising flows as written is that we only have two pieces of information about the entire trajectory :

we know that , and
we know that .

We know absolutely nothing about outside of those boundary conditions. This means that our model for basically gets to freestyle in those areas.

We can avoid this to some extent by choosing appropriate neural network architectures and/or appropriate penalties in the classical case or priors in the Bayesian case. There’s a whole mini-literature on choosing appropriate penalties.

Just to show how complex it is, let me quickly sketch what Finlay etc suggest as a way to keep the dynamics as boring as possible in the information desert. They lean into the literature on optimal transport theory to come up with the double penalty where the first term minimises the kinetic energy and, essentially, finds the least exciting path from to , while the second term ensures that the Jacobian of doesn’t get too big²⁹, which means that the mapping doesn’t have many sharp changes. Both of these penalty terms are designed to both aid generalisation and to make sure the differential equation isn’t unnecessarily difficult for a ODE solver.

A slightly odd feature of these penalties is that they are both data dependent. That suggests that a prior would, probably, require an amount of work. This is work that I don’t feel like doing today. Especially because this blog post isn’t about bloody normalising flows.

Diffusion models

Ok, so normalising flows are cool, but there are a couple of places where they could potentially be improved. There is a long literature on diffusion models, but the one I’m mostly stealing from is this one by Song et al. (2021)

Firstly, the vector field directly effects how easy the differential equations are to solve. This means that if is too complicated, it can take a long time to both train the model and generate samples from the trained model. To get around this you need to put fairly strict penalties³⁰ and/or structural assumptions on .

Secondly, we only have information³¹ at two ends of the flow. The problem would become a lot easier if we could somehow get information about intermediate states. In the inverse problems literature, there’s a concept of value of information that talks about how useful sampling a particular time point can be in terms of reducing model uncertainty. In general this, or other criteria, can be used to design a set of useful sampling times. I don’t particularly feel like working any of this out but one thing I am fairly certain of is that no optimal design would only have information at and !

Diffusion models fix these two aspects of normalising flows at the cost of both a more complex mathematical formulation and some inexactness³² around the base distribution when generating new samples.

Diffusions and stochastic differential equations

Diffusions are to applied mathematicians what gaffer tape is to³³ a roadie. They are a ubiquitous, convenient, and they hold down the fort when nothing else works.

There are a number of diffusions that are familiar in statistics and machine learning. The most famous one is probably the Langevin diffusion which is asymptotically distributed according to . This forms the basis of a bunch of MCMC methods as well as some faster, less adjusted methods.

But that is not the only diffusion. Today’s friend is the Ornstein-Uhlenbeck (OU) process, which is a Gaussian process that The OU process can be thought of as a mean-reverting Brownian motion. As such, it has continuous but nowhere differentiable sample paths

The stationary distribution of is , where is the identity matrix. In fact, if we start the diffusion at stationarity by setting then X_t is a stationary Gaussian process with covariance function

More interestingly in our context, however, is what happens if we start the diffusion from a fixed point , that will eventually be a sample from . In that case, we can solve the linear stochastic differential equation exactly to get where the integral on the right hand side can be interpreted³⁴ as a white noise integral and so and the variance is From these equations, we see that the mean of the diffusion hurtles exponentially fast towards zero and the variance moves at the same speed towards .

More importantly, this means that, given a starting point , we can generate data from any part of the diffusion ! If we want a sequence of observations from the same trajectory, we can generate them sequentially using the fact that and OU process is a Markov³⁵ process. This means that we are no longer limited to information at just two points along the trajectory.

Reversing the diffusion

So far, there is nothing to learn here. The OU process has a known drift and variance, so everything is splendid. It’s even easy to simulate from. The challenge pops up when we try to reverse the diffusion, that is, when we try to remove noise from a sample rather than add noise to it.

In some sense, this shouldn’t be too disgusting. A diffusion is a Markov process and, if we run the Markov process back in time, we still get a Markov process. In fact, we are going to get another diffusion process.

The twist is that the new diffusion process is going to be quite a bit more complex than the original one. The problem is that unless comes from a Gaussian distribution, this process will be non-Gaussian, and thus somewhat tricky to find the reverse trajectory of.

To see this, consider and recall that and The first two terms in that integrand are Gaussian densities and thus their product is a bivariate Gaussian density Unfortunately, as is not Gaussian, the marginal distribution will be non-Gaussian. This means that our reverse time transition density is also going to be very non-linear.

In order to work out a stochastic differential equation that runs backwards in time and generates the same trajectory, we need a little bit of theory on how the unconditional density and the transition density evolves in time (here and everywhere st). These are related through the Kolmogorov equations.

To introduce these, we need to briefly consider the more general diffusion for nice³⁶ vector/matrix-valued functions and . Kolmogorov showed that the unconditional density evolves according the the partial differential equation subject to the initial condition This is known as Kolmogorov’s forward equation or the Fokker-Planck equation.

The other key result is about the density of conditioned on some future value , . We write this density as and it satisfies the partial differential equation subject to the terminal condition This is known as the Kolmogorov backward equation. Great names. Beautiful names.

Let’s consider a differential equation for the joint density Going ham with the product rule gives The first-order derivatives simplify, using the product rule, to

Staring at this for a moment, we notice that this looks has the same structure as the first-order term on the forward equation. In that case, the second-order term would be

If we notice that and we can re-write the second-order derivative terms in Equation 1 as

This is almost, but not quite, what we want. We are a single minus sign away. Remembering that we probably don’t want it to turn up in any derivatives³⁷. To this end, let’s make the substitution With this substitution the second order terms are where

If we write we get the joint PDE

In order to identify the reverse time diffusion, we are going to find the reverse time backward equation, which confusingly, is for As is a constant in both and , we can divide both sides of Equation 2 by it to get where again and and are known.

This is the forward Kolmogorov equation for the time-reversed³⁸ diffusion where is another white nose. Anderson (1982) shows how to connect the white noise that’s driving the forward dynamics with the white noise that’s driving the reverse dynamics , but that’s overkill for our present situation.

In the context of an OU process, we get the reverse equation where time runs backwards and I’ve used the formula for the logarithmic derivative.

Unlike the forward process, the reverse process is the solution to a non-linear stochastic differential equation. In general, this cannot be solved in closed form and we need to use a numerical SDE solver to generate a sample.

It’s worth noting that the OU process is an overly simple cartoon of a diffusion model. In practice, is usually an increasing function of time so the system injects more noise as the diffusion moves along. This changes some of the exact equations slightly, but you can still sample analytically for any (as long as you choose a fairly simple function for ). There is a large literature on these choices and, to be honest, I can’t be bothered going through them here. But obviously if you want to implement a diffusion model yourself you should look this stuff up.

Estimating the score

The reverse dynamics are driven by the score function Typically, we do not know the density and while we could solve the forward equation in order to estimate it, that is wildly inefficient in high dimensions.

If we can assume that for each , is approximately , then the resulting reverse diffusion is linear In this case is Gaussian with a mean and covariance that has closed form solution in terms of and (perhaps after some numerical quadrature and matrix exponentials).

Unfortunately, as discussed above this is not true. A better approximation would be a mixture of Gaussians but, in general, we can use any method to approximate There are no particular constraints on it, except we expect it to be fairly smooth³⁹ in both and . Hence, we can just learn the score.

As we are going to solve the SDE numerically, we only need to estimate the score at a finite set of locations. In every application that I’ve seen, these are pre-specified, however it would also be possible to use a basis function expansion to interpolate to arbitrary time points. But, to be honest, I think every single example I’ve seen just uses a regularly spaced grid.

So how do we estimate ? Well just like every other situation, we need to define a likelihood (or, I guess, an optimisation criterion). One way to think about this would be to note that you’ll never perfectly recover the initial signal. This is because we need to solve a non-linear stochastic partial differential equation and there will, inherently, be noise in that solution. So instead, assume that we have an initial sample and that after solving the backward equation we have an unbiased estimator of with standard deviation , where is the number of time steps. We know a lot about how the error of SDE solvers scale with and so we can use that to set an appropriate scale for . For instance, if you’re using the Euler–Maruyama method, then it has strong order and would likely be an appropriate scaling.

This strongly suggests a likelihood that looks like where is the estimate of you get by running the reverse diffusion conditioned on , where is an exact sample at time from the forward diffusion started at .

This is the key to the success of diffusion models: given our training sample , we generate new data and we can generate as much of that data as we want. Furthermore, we can choose any set of s we want. We can sample a single pair multiple times or we can look at a diversity of sampling data.

We can even try to recover an intermediate state from information about a future state , . This gives us quite the opportunity to target our learning to areas of the space where we have relatively poor estimates of the score function.

Of course, that’s not what people do. They do stochastic gradient descent to minimise possibly subject to some penalties on . In fact, the distribution on is usually a discrete uniform. As with any sufficiently complex task, there is a lot of detailed work on exactly how to best parameterise, solve, and evaluate this optimisation procedure.

Generating samples

Once the model is trained and we have an estimate of the score function, we can generate new samples by first sampling and running the reverse diffusion starting from for some sufficiently large . One of the advantages of using a variant of the OU process with a non-constant is that we can choose to be smaller. Nevertheless, there will always be a little bit of error introduced by the fact that is only approximately . But really, in the context of all of the other errors, this one is pretty small.

Anyway, run the diffusion backwards and if you’ve estiamted well for the entire trajectory, you will get something that looks a lot like a new sample from .

Some closing thoughts

So there you have it, a very high-level mathematical introduction to diffusion models. Along the way, I accidentally put them in some sort of historical context, which hopefully helped make some things clearer.

Obviously there are a lot of cool things that can happen. The ability to, essentially, design our training trajectories should definitely be utilised. To do that, we would need some measure of uncertainty in the recovery of . A possible way to do this would be to insert a probabilistic layer into neural net architecture. If this isn’t the final layer in the network, it should be possible to clean up any artifacts it introduces with further layers, but the uncertainty estimates from this hidden layer would still be indicative of the uncertainty in the recovery of the scores. Assuming, of course, that this is successful, it would be possible to target the training at improving the uncertainty.

Beyond the possibility of using a non-uniform distribution for , these uncertainty estimates might also help indicate the reliability of the generated sample. If the reverse diffusion spends too much time in areas with highly uncertain scores, it is unlikely that the generated data will be a good sample.

I am also somewhat curious about whether or not this type of system could be a reasonable alternative to bootstrap resampling in some contexts. I mean image creation is cool, but it’s not the only time people want to sample from a distribution that we only know empirically.

Footnotes

Maybe my favourite running gag was Ronny Chieng refusing to use the American pronunciation of Megan. ↩︎
I mean, my last post was recounting literature on the Markov property from the 70s and 80s. My only desire for this blog is for it to be very difficult to guess the topic of the next post.↩︎
I can’t stress enough that I made that tomato and feta tiktok pasta for dinner. Because that’s exactly how on trend I am.↩︎
I am very much managing expectations here↩︎
I cannot stress enough that this post will not help you implement a diffusion model. It might help you understand what is being implemented, but it also might not.↩︎
Really fucking relative.↩︎
Find a lesbian and follow her blog. Then you’ll get the good shit. There are tonnes of queer women in statistics. If you don’t know any it’s because they probably hate you.↩︎
The wokerati among you will notice that the quotient is the derivative of .↩︎
Look. I love you all. But I don’t want to introduce measure push-forwards. So if you want the maths read the damn paper.↩︎
This is the Knothe-Rosenblatt rearrangement of the optimal transport problem if you’re curious. And let’s face it, you’re not curious.↩︎
The normalising flow literature also has a lot of nice chats about how to model the s using masked versions of the same neural net.↩︎
If you don’t have too much data, you could just replace that expectation with its empirical approximation. But when there is a lot of data, that will be expensive and stochastic gradient methods will perform better.↩︎
And be more likely to appropriately use your computational resources↩︎
We will see later that it doesn’t matter if we model or , but the likelihood calculations come out nicer if we map from to rather than the other way around↩︎
There is a tonne of excellent software for efficiently solving differential equations!↩︎
My notation here is a bit awkward. The in is keeping track of the initial condition, which in this case we do not know. But hey. Whatever.↩︎
Potentially even multi-modal↩︎
Classically this is done with a penalty, but you could also do it with things like early stopping and specific representations of the function. Which is nice because the continuous nomalising flow people use neural nets↩︎
The square on the norm isn’t always there↩︎
This was a big-sexy area in optimisation.↩︎
or at least a lot more expensive than, say, evaluating an exponential!↩︎
If you’re familiar with scalable ML methods, you might think well we have solved this problem. But I promise that it is not solved. The problem is that there’s no convenient analogue to subsampling the data. You can’t be half pregnant and you can’t half evaluate the forward map. There are, however, a pile of fabulous techniques that do their best to use multiple resolutions to get something that resembles a sensible MCMC scheme.↩︎
In our context, it’s a vector-valued function↩︎
Examples abound, but they include image reconstruction, tomographic inversion, and really anything where you’re estimating diffusivity↩︎
Gaussian processes↩︎
But not necessarily too creative. Not every transformation of a penalty makes a sensible prior. I’m looking at you lasso on increments.↩︎
Using the “well known” fact that the derivative of the log-determinant is the trace ↩︎
There are some complexities in practice around computing that trace. A straightforward implementation would require autodiff sweeps, which would make the model totally impractical. There are basically two options: massively simplify to be something like for a smooth function or use a stochastic trace estimator.↩︎
Measured in the Frobenius norm, of course↩︎
or priors↩︎
data + distributional assumptions = information↩︎
will be the asymptotic distribution of the diffusion, but it isn’t achieved at finite time.↩︎
Arguably, gradient descent is to machine learners what arse crack is to roadies. It’s always present, but with just enough variation to make it interesting.↩︎
Technically it’s an Ito integral, but because the integrand is deterministic it reduces to a white noise integral↩︎
The Markov property implies that . ↩︎
Lipschitz and bounded↩︎
I hate the quotient rule↩︎
This is why the signs don’t seem to match the forwards equation from before, but you can convince yourself if you do the change of variables , the new variable runs forward in time and switches signs, which gives the right forwards equations (with different signs on the first and second order terms) in .↩︎
If the is very rough, then, for very small , will also be quite rough but it will quickly become infinitely differentiable. It turns out that mathematicians know quite a lot about parabolic equations!↩︎

Reuse

https://creativecommons.org/licenses/by-nc/4.0/

Citation

BibTeX citation:

@online{simpson2023,
  author = {Dan Simpson},
  editor = {},
  title = {Diffusion Models; or {Yet} Another Way to Sample from an
    Arbitrary Distribution},
  date = {2023-02-09},
  url = {https://dansblog.netlify.app/posts/},
  langid = {en}
}

For attribution, please cite this work as:

Dan Simpson. 2023. “Diffusion Models; or Yet Another Way to Sample from an Arbitrary Distribution.” February 9, 2023. https://dansblog.netlify.app/posts/.

Markovian Gaussian processes: A lot of theory and some practical stuff

Dan Simpson — Fri, 20 Jan 2023 13:00:00 GMT

Gaussian processes are lovely things. I’m a big fan. They are, however, thirsty. They will take your memory, your time, and anything else they can. Basically, the art of fitting Gaussian process models is the fine art of reducing the GP model until it’s simple enough to fit while still being flexible enough to be useful.

There’s a long literature on effective approximation to Gaussian Processes that don’t turn out to be computational nightmares. I’m definitely not going to summarise them here, but I’ll point to an earlier (quite technical) post that mentioned some of them. The particular computational approximation that I am most fond of makes use of the Markov property and efficient sparse matrix computations to reduce memory use and make the linear algebra operations significantly faster.

One of the odder challenges with Markov models is that information about how Markov structures work in more than one dimension can be quite difficult to find. So in this post I am going to lay out some of the theory.

A much more practical (and readable) introduction to this topic can be found in this lovely paper by Finn, David, and Håvard. So don’t feel the burning urge to read this post if you don’t want to. I’m approaching the material from a different viewpoint and, to be very frank with you, I was writing something else and this section just became extremely long so I decided to pull it out into a blog post.

So please enjoy today’s entry in Dan writes about the weird corners of Gaussian processes. I promise that even though this post doesn’t make it seem like this stuff is useful, it really is. If you want to know anything else about this topic, essentially all of the Markov property parts of this post come from Rozanov’s excellent book Markov Random Fields.

Gaussian processes via the covariance operator

By the end of today’s post we will have defined¹ a Markovian process in terms of its reproducing Kernel Hilbert space (RKHS), that is the space of functions that contain the posterior mean² when there are Gaussian observations. This space always exists and its inner product is entirely determined by the covariance function of a GP. That said, for a given covariance function, the RKHS can. be difficult to find. Furthermore, the problem with basing our modelling off a RKHS is that it is not immediately obvious how we will do the associated computations This is in contrast to a covariance function approach, where it is quite easy³ to work out how to convert the model specification to something you can attack with a computer. By the end of this post we will have tacked that.

The extra complexity of the RKHS pays off in modelling flexibility, both in terms of the types of model that can be build and the spaces⁴ you can build them on. I am telling you this now because things are about to get a little mathematical.

To motivate the technique, let’s consider the covariance operator where is the domain over which the GP is defined (usually but maybe you’re feeling frisky).

To see how this could be useful, we are going to need to think a little bit about how we can simulate a multivariate Gaussian random variable . To do this, we first compute the square root⁵ and sample a vector of iid standard normal variables . Then . You can check it by checking the covariance. (it’s ok. I’ll wait.)

While the square root of the covariance operator is a fairly straightforward mathematical object⁶, the analogue of the iid vector of standard normal random variables is a bit more complex.

White noise and its associated things

Thankfully I’ve covered this in a previous blog. The engineering definition of white noise as a GP such that for every , is an iid random variable is not good enough for our purposes. Such a process is hauntingly irregular⁷ and it’s fairly difficult to actually do anything with it. Instead, we consider white noise as a random function defined on the subsets of our domain. This feels like it’s just needless technicality, but it turns out to actually be very very useful.

Definition 1 (White noise) A (complex) Gaussian white noise is a random measure⁸ such that, for every⁹ disjoint¹⁰ pair of sets satisfies the following properties

If and are disjoint then
If and are disjoint then and are uncorrelated¹¹, ie .

This doesn’t feel like we are helping very much because how on earth am I going to define the product ? Well the answer, you may be shocked to discover, requires a little bit more maths. We need to define an integral, which turns out to not be shockingly difficult to do. The trick is to realise that if I have an indicator function then¹² In that calculation, I just treated like I would any other measure. (If you’re more of a probability type of girl, it’s the same thing as noticing .)

We can extend the above by taking the sum of two indicator function where and are disjoint and and are any real numbers. By the same reasoning above, and using the linearity of the integral, we get that where the last line follows by doing the ordinary¹³ integral of .

It turns out that every interesting function can be written as the limit of piecewise constant functions¹⁴ and we can therefore define for any function¹⁵

With this notion in hand, we can finally define the action of an operator on white noise.

Definition 2 (The action of an operator on white noise) Let be an operator on some Hilbert space of functions with adjoint , then we define to be the random measure that satisfies, for every ,

The generalised Gaussian process

One of those inconvenient things that you may have noticed from above is that is not going to be a function. It is going to be a measure or, as it is more commonly known, a generalised Gaussian process. This is the GP analogue of a generalised function and, as such, only gives an actual value when you integrate it against some sufficiently smooth function.

Definition 3 (Generalised Gaussian Process) A generalised Gaussian process is a random signed measure (or a random generalised function) that, for any , is Gaussian. We will often write which helps us understand that a generalised GP is indexed by functions.

In order to separate this out from the ordinary GP , we will write it as These two ideas coincide in the special case where which will occur when smooths the white noise sufficiently. In all of the cases we really care about today, this happens. But there are plenty of Gaussian processes that can only be considered as generalised GPs¹⁶

Approximating GPs when is a differential operator

This type of construction for is used in two different situations: kernel convolution methods directly use the representation, and the SPDE methods of Lindgren, Lindström and Rue use it indirectly.

I’m interested in the SPDE method, as it ties into today’s topic. Also because it works really well. This method uses a slightly modified version of the above equation where is the (left) inverse of . I have covered this method in a previous post, but to remind you the SDPE method in its simplest form involves three steps:

Approximate for some set of weights and a set of deterministic functions that we are going to use to approximate the GP
Approximate¹⁷ the test function for some set of deterministic weights
Plug these approximations into the equation to get the equation

As this has to be true for every vector , this is equivalent to the linear system where and .

Obviously this method is only going to be useful if it’s possible to compute the elements of and efficiently. In the special case where is a differential operator¹⁸ and the basis functions are chosen to have compact support¹⁹, these calculations form the basis of the finite element method for solving partial differential equations.

The most important thing, however, is that if is a differential operator and the basis functions have compact support, the matrix is sparse and the matrix can be made²⁰ diagonal, which means that has a sparse precision matrix. This can be used to make inference with these GPs very efficient and is the basis for GPs in the INLA software.

A natural question to ask is when will we end up with a sparse precision matrix? The answer is not quite when is a differential operator. Although that will lead to a sparse precision matrix (and a Markov process), it is not required. So the purpose of the rest of this post is to quantify all of the cases where a GP has the Markov property and we can make use of the resulting computational savings.

The Markov property for on abstract spaces

Part of the reason why I introduced the notion of a generalised Gaussian process is that it is useful in the definition of the Markov process. Intuitively, we know what this definition is going to be: if I split my space into three disjoint sets , and in such a way that you can’t get from to without passing through , then the Markov property should say, roughly, that every random variable is conditionally independent of every random variable given (or conditional on) knowing the values of the entire set .

A graphical illustration of the three sets used above Markov property.

That definition is all well and good for a hand-wavey approach, but unfortunately it doesn’t quite hold up to mathematics. In particular, if we try to make a line²¹, we will hit a few problems. So instead let’s do this properly.

All of the material here is covered in Rozanov’s excellent but unimaginatively named book Markov Random Fields.

To set us up, we should consider the types of sets we have. There are three main sets that we are going to be using: the open²² set , its boundary²³ . For example, if and is the interior of the unit circle, and its open complement . For a 2D example, if is the interior of the unit circle, then could be the unit circle, and would be the exterior of the unit circle.

One problem with these sets, is that while will be a 2D set, is only one dimensional (it’s a circle, so it’s a line!). This causes some troubles mathematically, which we need to get around by using the fattening of , which is the set where is the distance from to the nearest point in .

With all of this in hand we can now give a general definition of the Markov property.

Definition 4 (The Markov property for a generalised Gaussian process) Consider a zero mean generalised GP²⁴ . For any²⁵ subset , we define the collection of random variables²⁶ We will call the random field²⁷ associated with .

Let be a system of domains²⁸ in . We say that has the Markov²⁹ property (with respect to ) if, for all and for any sufficiently small , where and .

Rewriting the Markov property I: Splitting spaces

The Markov property defined above is great and everything, but in order to manipulate it, we need to think carefully about the how the domains , and can be used to divide up the space . To do this, we need to basically localise the Markov property to one set of , , . This concept is called a splitting³⁰ of and by .

Definition 5 For some domain and , set . The space splits and if where is the sum of orthogonal components³¹ and if and only if there is some such that³²

This emphasizes that we can split our space into three separate components: inside , outside and on the boundary of and the ability to do that for any³³ domain is the key part of the Markov³⁴ property.

A slightly more convenient way to deal with splitting spaces is the case where the we have overlapping sets , that cover the domain (ie ) and the splitting set is their intersection . In this case, the splitting equation becomes I shan’t lie: that looks wild. But it makes sense when you take and , in which case and .

The final thing to add before we can get to business is a way to get rid of all of the annoying s. The idea is to take the intersection of all of the as the splitting space. If we define we can re-write³⁵ the splitting equation as

This gives the following statement of the Markov property.

Definition 6 Let be a system of domains³⁶ in . We say that has the Markov property (with respect to ) if, for all , ,, we have, for some and

Rewriting the Markov property II: The dual random field

We are going to fall further down the abstraction rabbit hole in the hope of ending up somewhere useful. In this case, we are going to invent an object that has no reason to exist and we will show that it can be used to compactly restate the Markov property. It will turn out in the next section that it is actually a useful characterization that will lead (finally) to an operational characterisation of a Markovian Gaussian process.

Definition 7 (Dual random field) Let be a generalised Gaussian process with an associated random field , and let be a complete system of open domains in . The dual to the random field , on the system is the random field , that satisfies and

This definition looks frankly a bit wild, but I promise you, we will use it.

The reason for its structure is that it directly relates to the Markov property. In particular, the existence of a dual field implies that, if we have any , then That’s the first thing we need to show to demonstrate the Markov property.

The second part is much easier. If we note that , it follows that

This gives us our third (and final) characterisation of the (second-order) Markov property.

Definition 8 Let be a system of domains³⁷ in . Assume that the random field has an associated dual random field .

We say that , has the Markov property (with respect to³⁸ ) if and only if for all , When this holds, we say that the dual field is orthogonal with respect to .

There is probably more to say about dual fields. For instance, the dual of the dual field is the original field. Neat, huh. But really, all we need to do is know that an orthogonal dual field implies a the Markov property. Because next we are going to construct a dual field, which will give us an actually useful characterisation of Markovian GPs.

Building out our toolset with the conjugate GP

In this section, our job is to construct a dual random field. To do this, we are going to exploit the notion of a conjugate³⁹ Gaussian process, which is a generalised⁴⁰ GP such that⁴¹ It is going to turn out that is the random field generated by . The condition that can be assumed a fortiori. What we need to show is that the existence of a conjugate Gaussian process implies that, for all , .

We will return to the issue of whether or not actually exists later, but assuming it does let’s see how it’s associated random field relates to for . While it is not always true that these things are equal, it is always true that We will consider when equality holds in the next section. But first let’s show the inclusion.

The space contains all random variables of the form , where the support of is compact in , which means that it is a positive distance from . That means that, for some , the support of is outside⁴² of . So if we fix that and consider any smooth with support in⁴³ , then, from the definition of the conjugate GP, we have⁴⁴ This means that is perpendicularity to and, therefore, . Now, is defined as the intersection of these spaces, but it turns out that⁴⁵ for any spaces and , This is because and so every function that’s orthogonal to functions in is also orthogonal to functions in . The same goes for . We have shown that and every is in for some . This gives the inclusion

To give conditions for when it’s an actual equality is a bit more difficult. It, maybe surprisingly, involves thinking carefully about the reproducing kernel Hilbert space of . We are going to take this journey together in two steps. First we will give a condition on the RKHS that guarantees that exists. Then we will look at when .

When does exits? or, A surprising time with the reproducing kernel Hilbert space

First off, though, we need to make sure that exists. Obviously⁴⁶ if it exists then it is unique and .

But does it exist? The answer turns out to be sometimes. But also usually. To show this, we need to do something that is, frankly, just a little bit fancy. We need to deal with the reproducing kernel Hilbert space⁴⁷. This feels somewhat surprising, but it turns out that it is a fundamental object⁴⁸ and intrinsically tied to the space .

The reproducing kernel space, which we will now⁴⁹ call because we are using for something else in this section, is a set of deterministic generalised functions , that can be evaluated at functions⁵⁰ as A generalised function if there is a corresponding random variable in that satisfies It can be shown⁵¹ that there is a one-to-one correspondence between and , in the sense that for every there is a unique .

We can use this correspondence to endow with an inner product

So far, so abstract. The point of the conjugate GP is that it gives us an explicit construction of the⁵² mapping . And, importantly for the discussion of existence, if there is a conjugate GP then the RKHS has a particular relationship with .

To see this, let’s assume exists. Then, for each , the generalised function is in because, by the definition of we have that Hence, the embedding is given by .

Now, if we do a bit of mathematical trickery and equate things that are isomorphic, . On its face, that doesn’t make much sense because on the left we have a space of actual functions and on the right we have a space of generalised functions. To make it work, we associate each smooth function with the generalised function defined above.

This make the closure⁵³ of under the norm and hence we have showed that if there is a conjugate GP, then It turns out that if is dense in then that implies that there exists a conjugate function defined through the isomorphism . This is because and is continuous. Hence if we choose then .

We have shown the following.

Theorem 1 A conjugate GP exists if and only if is dense in .

This is our first step towards making statements about the stochastic process into statements about the RKHS. We shall continue along this road.

You might, at this point, be wondering if that condition ever actually holds. The answer is yes. It does fairly often. For instance, if is a stationary GP with spectral density , the biorthogonal function exists if and only if there is some such that This basically says that the theory we are developing doesn’t work for GPs with extremely smooth sample paths (like a GP with the square-exponential covariance function). This is not a restriction that bothers me at all.

For non-stationary GPs that aren’t too smooth, this will also hold as long as nothing too bizarre is happening at infinity.

But when does ?

We have shown already⁵⁴ that (that last bit with all the complements can be read as “the support of is inside and always more than from the boundary.”). It follows then that This is nice because it shows that is related to the space that is if is a function that is the limit of a sequence of functions with for some , then and every such random variable has an associated .

So, in the sense⁵⁵ of isomorphisms these are equivalent, that is

This means that if we can show that , then we have two spaces that are isomorphic to the same space and use the same isomorphism . This would mean that the spaces are equivalent.

This can also be placed in the language of function spaces. Recall that Hence will be isomorphic to if and only if that is, if and only if every is the limit of a sequence of smooth functions compactly supported within .

This turns out to not always be true, but it’s true in the situations that we most care about. In particular, we get the following theorem, which I am certainly not going to prove.

Assume that the conjugate GP exists. Assume that either of the following holds:

Multiplication by a function is bounded in , ie
The shift operator is bounded under both the RKHS norm and the covariance⁵⁶ norm for small , ie holds in both norms for all , sufficiently small.

Then is the dual of over the system of sets that are bounded or have bounded complements in .

The second condition is particularly important because it always holds for stationary GPs with as their covariance structure is shift invariant. It’s not impossible to come up with examples of generalised GPs that don’t satisfy this condition, but they’re all a bit weird (eg the “derivative” of white noise). So as long as your GP is not too weird, you should be fine.

At long last, an RKHS characterisation of the Markov property

And with that, we are finally here! We have that is the dual random field to , and we have a lovely characterisation of in terms of the RKHS . We can combine this with our definition of a Markov property for GPs with a dual random field and get that a GP is Markovian if and only if We can use the isomorphism to say that if , , then there is a such that Moreover, this isomorphism is unitary (aka it preserves the inner product) and so Hence, has the Markov property if and only if

Let’s memorialise this as a theorem.

Theorem 2 A GP with a conjugate GP is Markov if and only if its RKHS is local, ie if and have disjoint supports, then

This result is particularly nice because it entirely characterises the RHKS inner product of a Markovian GP. The reason for this is a deep result from functional analysis called Peetre’s Theorem, which states, in our context, that locality implies that the inner product has the form where⁵⁷ are integrable functions and only a finite number of them are non-zero at any point .

This connection between the RKHS and the dual space also gives the following result for stationary GPs.

Theorem 3 Let be a stationary Gaussian process. Then has the Markov property if and only if its spectral density is the inverse of a non-negative, symmetric polynomial.

This follows from the characterisation of the RKHS as having the inner product as where is the Fourier transform of and the fact that a differential operator can is transformed to a polynomial in Fourier space.

Putting this all in terms of

Waaaay back near the top of the post I described a way to write a (generalised) GP in terms of its covariance operator and the white noise process From the discussions above, it follows that the corresponding conjugate GP is given by This means that the RKHS inner product is given by From the discussion above, if is Markovian, then is⁵⁸ a differential⁵⁹ operator.

Using the RKHS to build computationally efficient approximations to Markovian GPs

To close out this post, let’s look at how we can use the RKHS to build an approximation to a Markovian GP. This is equivalent⁶⁰ to the SPDE method that was very briefly sketched above, but it only requires knowledge of the RKHS inner product.

In particular, if we have a set of basis functions , , we can define the approximate RKHS as the space of all functions equipped with the inner product where the LHS and are functions and on the right they are the vectors of weights, and

For a finite dimensional GP, the matrix that defines the RKHS inner product is⁶¹ the inverse of the covariance matrix. Hence the finite dimensional GP associated with the RKHS is the random function where the weights .

If the GP is Markovian and the basis functions have compact support, then is a sparse matrix and maybe he’ll love me again.

Footnotes

or redefined if you’ve read my other post↩︎
For other observation models it contains the posterior mode↩︎
Step 1: Open Rasmussen and Williams.↩︎
For example, the process I’m about to describe is not meaningfully different for a process on a sphere. Whereas if you want to use a covariance function on a sphere you are stuck trying to find a whole new class of positive definite functions. It’s frankly very annoying. Although if you want to build a career out of characterising positive definite functions on increasingly exotic spaces, you probably don’t find it annoying.↩︎
Or the Cholesky factor if you add a bunch of transposes in the right places, but let’s not kid ourselves this is not a practical discussion of how to do it↩︎
Albeit a bit advanced. It’s straightforward in the sense that for an infinite-dimensional operator it happens to work a whole like a symmetric positive semi-definite matrix. It is not straightforward in the sense that your three year old could do it. Your three year old can’t do it. But it will keep them quiet in the back seat of the car while you pop into the store for some fags. It’s ok. The window’s down.↩︎
For any subset , and ↩︎
Countably additive set-valued function taking any value in ↩︎
measurable↩︎
↩︎
If is also Gaussian then this is the same as them being independent↩︎
Recall that is our whole space. Usually , but it doesn’t matter here.↩︎
A bit of a let down really.↩︎
like but with more subsets↩︎
is the space of functions with the property that .↩︎
eg the Gaussian free field in physics, or the de Wijs process.↩︎
You can use a separate set of basis functions here, but I’m focusing on simplicity↩︎
The standard example is ↩︎
In particular piecewise linear tent functions build on a triangulation↩︎
Read the paper, it’s a further approximation but the error is negligible↩︎
()-dimensional sub-manifold↩︎
This set does not include its boundary↩︎
This is defined as the set , where is the closure of . But let’s face it. It’s the fucking boundary. It means what you think it means.↩︎
I’m using here as a generic generalised GP, rather than , which is built using an ordinary GP. This doesn’t really make much of a difference (the Markov property for one is the same as the other), but it makes me feel better.↩︎
measurable↩︎
Here is the support of , that is the values of such that .↩︎
This is the terminology of Rozanov. Random Field is also another term for stochastic process. Why only let words mean one thing?↩︎
non-empty connected open sets↩︎
Strictly, this is the weak or second-order Markov property↩︎
If you’re curious, this is basically the same thing as a splitting -algebra. But, you know, sans the -algebra bullshit.↩︎
That is, any can be written as the sum , where , , and are mutually orthogonal (ie !).↩︎
This is using the idea that the conditional expectation is a projection.↩︎
Typically any open set, or any open connected set, or any open, bounded set. A subtlety that I don’t really want to dwell on is that it is possible to have a GP that is Markov with respect to one system of domains but not another.↩︎
The Markov property can be restated in this language as for every system of complementary domains and boundary , , , there exists a small enough such that splits and ↩︎
Technically we are assuming that for small enough . This is not a particularly onerous assumption.↩︎
non-empty connected open sets↩︎
non-empty connected open sets↩︎
The result works with some subsystem . To prove it for it’s enough to prove it for some subset that separates points of . This is a wildly technical aside and if it makes no sense to you, that’s very much ok. Frankly I’m impressed you’ve hung in this long.↩︎
Rozanov also calls this the biorthogonal GP. I like conjugate more.↩︎
Up to this point, it hasn’t been technically necessary for the GP to be generalised. However, here is very much is. It turns out that if realisations of are almost surely continuous, then realisations of are almost surely generalised functions.↩︎
I’m writing this as if all of these GPs are real valued, but for full generality, we should be dealing with complex GPs. Just imagine I put complex conjugates in all the correct places. I can’t stop you.↩︎
That is, inside and more than from the boundary↩︎
can be non-zero inside but only if it’s less than away from the boundary.↩︎
It’s zero because the two functions are never non-zero at the same time, so their product is zero.↩︎
Here, and probably in a lot of other places, we are taking the union of spaces to be the span of their sum. Sorry.↩︎
Really Daniel. Really. (It’s an isomorphism so if you do enough analysis courses this is obvious. If that’s not clear to you, you should just trust me. Trust issues aren’t sexy. Unless you have cum gutters. In which case, I’ll just spray my isomorphisms on them and you can keep scrolling TikTok.)↩︎
This example is absolutely why I hate that we’ve settled on RKHS as a name for this object because the thing that we are about to construct does not always have a reproducing kernel property. Cameron-Martin space is less confusing. But hey. Whatever. The RKHS for the rest of this section is not always a Hilbert space with a reproducing kernel. We are just going to have to be ok with that.↩︎
Nothing about this analysis relies on Gaussianity. So this is a general characterisation of a Markov property for any stochastic process with second moments.↩︎
In previous blogs, this was denoted and truly it was too confusing when I tried to do it here. And by that point I wasn’t going back and re-naming .↩︎
is the space of all infinitely differentiable compactly supported functions on ↩︎
The trick is to notice that the set of all possible is dense in .↩︎
unitary↩︎
the space containing the limits (in the -norm) of all sequences in ↩︎
If you take some limits↩︎
I mean, really. Basically we say that if there is an isomorphism between and . Could I be more explicit? Yes. Would that make this unreadable? Also yes.↩︎
.↩︎
is a multi-index, which can be interpreted as , and ↩︎
in every local coordinate system↩︎
Because defines an inner product, it’s actually a symmetric elliptic differential operator↩︎
Technically, you need to choose different basis functions for . In particular, you need to choose where . This is then called a Petrov-Galerkin approximation and truly we don’t need to think about it at all. Also I am completely eliding issues of smoothness in all of this. It maters, but it doesn’t matter too much. So let’s just assume everything exists.↩︎
If you don’t believe me you are welcome to read the monster blog post, where it’s an example.↩︎

Reuse

https://creativecommons.org/licenses/by-nc/4.0/

Citation

BibTeX citation:

@online{simpson2023,
  author = {Dan Simpson},
  editor = {},
  title = {Markovian {Gaussian} Processes: {A} Lot of Theory and Some
    Practical Stuff},
  date = {2023-01-21},
  url = {https://dansblog.netlify.app/posts/},
  langid = {en}
}

For attribution, please cite this work as:

Dan Simpson. 2023. “Markovian Gaussian Processes: A Lot of Theory and Some Practical Stuff.” January 21, 2023. https://dansblog.netlify.app/posts/.

Sparse matrices part 7a: Another shot at JAX-ing the Cholesky decomposition

Dan Simpson — Thu, 01 Dec 2022 13:00:00 GMT

The time has come once more to resume my journey into sparse matrices. There’s been a bit of a pause, mostly because I realised that I didn’t know how to implement the sparse Cholesky factorisation in a JAX-traceable way. But now the time has come. It is time for me to get on top of JAX’s weird control-flow constructs.

And, along the way, I’m going to re-do the sparse Cholesky factorisation to make it, well, better.

In order to temper expectations, I will tell you that this post does not do the numerical factorisation, only the symbolic one. Why? Well I wrote most of it on a long-haul flight and I didn’t get to the numerical part. And this was long enough. So hold your breaths for Part 7b, which will come as soon as I write it.

You can consider this a much better re-do of Part 2. This is no longer my first python coding exercise in a decade, so hopefully the code is better. And I’m definitely trying a lot harder to think about the limitations of JAX.

Before I start, I should probably say why I’m doing this. JAX is a truly magical thing that will compute gradients and every thing else just by clever processing of the Jacobian-vector product code. Unfortunately, this is only possible if the Jacobian-vector product code is JAX traceable and this code is structurally extremely similar¹ to the code for the sparse Cholesky factorisation.

I am doing this in the hope of (eventually getting to) autodiff. But that won’t be this blog post. This blog post is complicated enough.

Control flow of the damned

The first an most important rule of programming with JAX is that loops will break your heart. I mean, whatever, I guess they’re fine. But there’s a problem. Imagine the following function

def f(x: jax.Array, n: Int) -> jax.Array:
  out = jnp.zeros_like(x)
  for j in range(n):
    out = out + x
  return out

This is, basically, the worst implementation of multiplication by an integer that you can possibly imagine. This code will run fine in Python, but if you try to JIT compile it, JAX is gonna get angry. It will produce the machine code equivalent of

def f_n(x):
  out = x
  out = out + x
  out = out + x
  // do this n times
  return out

There are two bad things happening here. First, note that the “compiled” code depends on n and will have to be compiled anew each time n changes. Secondly, the loop has been replaced by n copies of the loop body. This is called loop unrolling and, when used judiciously by a clever compiler, is a great way to speed up code. When done completely for every loop this is a nightmare and the corresponding code will take a geological amount of time to compile.

A similar thing² happens when you need to run autodiff on f(x,n). For each n an expression graph is constructed that contains the unrolled for loop. This suggests that autodiff might also end up being quite slow (or, more problematically, more memory-hungry).

So the first rule of JAX is to avoid for loops. But if you can’t do that, there are three built-in loop structures that play nicely with JIT compilation and sometimes³ differentiation. These three constructs are

A while loop jax.lax.while(cond_func, body_func, init)
An accumulator jax.lax.scan(body_func, init, xs)
A for loop jax.lax.fori_loop(lower, upper, body_fun, init)

Of those three, the first and third work mostly as you’d expect, while the second is a bit more hairy. The while function is roughly equivalent to

`
def jax_lax_while_loop(cond_func, body_func, init):
  x  = init
  while cond_func(x):
    x = body_func(x)
  return x

So basically it’s just a while loop. The thing that’s important is that it compiles down to a single XLA operation⁴ instead of some unrolled mess.

One thing that is important to realise is that while loops are only forwards-mode differentiable, which means that it is very expensive⁵ to compute gradients. The reason for this is that we simply do not know how long that loop actually is and so it’s impossible to build a fixed-size expression graph.

The jax.lax.scan function is probably the one that people will be least familiar with. That said, it’s also the one that is roughly “how a for loop should work”. The concept that’s important here is a for-loop with carry over. Carry over is information that changes from one step of the loop to the next. This is what separates us from a map statement, which would apply the same function independently to each element of a list.

The scan function looks like

def jax_lax_scan(body_func, init, xs):
  len_x0 = len(x0)
  if not all(len(x) == len_x0 for x in xs):
    raise ValueError("All x must have the same length!!")
  carry = init
  ys = []
  for x in xs:
    carry, y = body_func(carry, x)
    ys.append(y)
  
  return carry, np.stack(ys)

A critically important limitation to jax.lax.scan is that is that every x in xs must have the same shape! This mean, for example, that

xs = [[1], [2,3], [4], 5,6,7]

is not a valid argument. Like all limitations in JAX, this serves to make the code transformable into efficiently compiled code across various different processors.

For example, if I wanted to use jax.lax.scan on my example from before I would get

from jax import lax
from jax import numpy as jnp

def f(x, n):
  init = jnp.zeros_like(x)
  xs = jnp.repeat(x, n)
  def body_func(carry, y):
    val = carry + y
    return (val, val)
  
  final, journey = lax.scan(body_func, init, xs)
  return (final, journey)

final, journey = f(1.2, 7)
print(final)
print(journey)

8.4
[1.2       2.4       3.6000001 4.8       6.        7.2       8.4      ]

This translation is a bit awkward compared to the for loop but it’s the sort of thing that you get used to.

This function can be differentiated⁶ and compiled. To differentiate it, I need a version that returns a scalar, which is easy enough to do with a lambda.

from jax import jit, grad

f2 = lambda x, n: f(x,n)[0]
f2_grad = grad(f2, argnums = 0)

print(f2_grad(1.2, 7))

7.0

The argnums option tells JAX that we are only differentiating wrt the first argument.

JIT compilation is a tiny bit more delicate. If we try the natural thing, we are going to get an error.

f_jit_bad = jit(f)
bad = f_jit_bad(1.2, 7)

ConcretizationTypeError: Abstract tracer value encountered where concrete value is expected: Tracedwith
When jit-compiling jnp.repeat, the total number of repeats must be static. To fix this, either specify a static value for `repeats`, or pass a static value to `total_repeat_length`.
The error occurred while tracing the function f at /var/folders/08/4p5p665j4d966tr7nvr0v24c0000gn/T/ipykernel_24749/3851190413.py:4 for jit. This concrete value was not available in Python because it depends on the value of the argument 'n'.

See https://jax.readthedocs.io/en/latest/errors.html#jax.errors.ConcretizationTypeError

In order to compile a function, JAX needs to know how big everything is. And right now it does not know what n is. This shows itself through the ConcretizationTypeError, which basically says that as JAX was looking through your code it found something it can’t manipulate. In this case, it was in the jnp.repeat function.

We can fix this problem by declaring this parameter static.

f_jit = jit(f, static_argnums=(1,))
print(f_jit(1.2,7)[0])

8.4

A static parameter is a parameter value that is known at compile time. If we define n to be static, then the first time you call f_jit(x, 7) it will compile and then it will reuse the compiled code for any other value of x. If we then call f_jit(x, 9), the code will compile again.

To see this, we can make use of a JAX oddity: if a function prints something⁷, then it will only be printed upon compilation and never again. This means that we can’t do debug by print. But on the upside, it’s easy to check, when things are compiling.

def f2(x, n):
  print(f"compiling: n = {n}")
  return f(x,n)[0]

f2_jit = jit(f2, static_argnums=(1,))
print(f2_jit(1.2,7))
print(f2_jit(1.8,7))
print(f2_jit(1.2,9))
print(f2_jit(1.8,7))

compiling: n = 7
8.4
12.6
compiling: n = 9
10.799999
12.6

This is a perfectly ok solution as long as the static parameters don’t change very often. In our context, this is going to have to do with the sparsity pattern.

Finally, we can talk about jax.lax.fori_loop, the in-built for loop. This is basically a convenience wrapper for jax.lax.scan (when lower and upper are static) or jax.lax.while (when they are not). The Python pseudocode is

def jax_lax_fori_loop(lower, upper, body_func, init):
  out = init
  for i in range(lower, upper):
    out = body_func(i, out)
  return out

To close out this bit where I repeat the docs, there is also a traceable if/else: jax.lax.cond which has the pseudocode

def jax_lax_cond(pred, true_fun, false_fun, val):
  if pred:
    return true_fun(val)
  else:
    return false_fun(val)

Building a JAX-traceable symbolic sparse Choleksy factorisation

In order to build a JAX-traceable sparse Cholesky factorisation , we are going to need to build up a few moving parts.

Build the elimination tree of and find the number of non-zeros in each column of
Build the symbolic factorisation⁸ of (aka the location of the non-zeros of )
Do the actual numerical decomposition.

In the previous post we did not explicitly form the elimination tree. Instead, I used dynamic memory allocation. This time I’m being more mature.

Building the expression graph

The elimination tree⁹ is a (forest of) rooted tree(s) that compactly represent the non-zero pattern of the Cholesky factor . In particular, the elimination tree has the property that, for any , if and only if there is a path from to in the tree. Or, in the language of trees, if and only if is a descendant of in the tree .

We can describe¹⁰ by listing the parent of each node. The parent node of in the tree is the smallest with .

We can turn this into an algorithm. An efficient version, which is described in Tim Davies book takes about operations. But I’m going to program up a slower one that takes operations, but has the added benefit¹¹ of giving me the column counts for free.

To do this, we are going to walk the tree and dynamically add up the column counts as we go.

To start off, let’s do this in standard python so that we can see what the algorithm look like. The key concept is that if we write as the elimination tree encoding the structure of¹² L[:j, :j], then we can ask about how this tree connects with node j.

A theorem gives a very simple answer to this.

Theorem 1 If , then implies that is a descendant of in . In particular, that means that there is a directed path in from to .

This tells us that the connection between and node is that for each non-zero elements of the th row of , we can walk $ must have a path in from and we will eventually get to a node that has no parent in . Because there must be a path from to in , it means that the parent of this terminal node must be .

As with everything Cholesky related, this works because the algorithm proceeds from left to right, which in this case means that the node label associated with any descendant of is always less than .

The algorithm is then a fairly run-of-the-mill¹³ tree traversal, where we keep track of where we have been so we don’t double count our columns.

Probably the most important thing here is that I am using the full sparse matrix rather than just its lower triangle. This is, basically, convenience. I need access to the left half of the th row of , which is conveniently the same as the top half of the th column. And sometimes you just don’t want to be dicking around with swapping between row- and column-based representations.

import numpy as np

def etree_base(A_indices, A_indptr):
  n = len(A_indptr) - 1
  parent = [-1] * n
  mark = [-1] * n
  col_count = [1] * n
  for j in range(n):
    mark[j] = j
    for indptr in range(A_indptr[j], A_indptr[j+1]):
      node = A_indices[indptr]
      while node < j and mark[node] != j:
        if parent[node] == -1:
          parent[node] = j
        mark[node] = j
        col_count[node] += 1
        node = parent[node]
  return (parent, col_count)

To convince ourselves this works, let’s run an example and compare the column counts we get to our previous method.

Some boilerplate from previous editions.

MCMC with the wrong acceptance probability

Dan Simpson — Tue, 22 Nov 2022 13:00:00 GMT

Just the other day¹ I was chatting with a friend² about MCMC and he asked me a fundamental, but seldom asked, question: What happens my acceptance probability is a bit off?.

This question comes up a bunch. In this context, they were switching from double to single precision³ and were a little worried that some of their operations would be a bit more inexact than they were used to. Would this tank MCMC? Would everything still be fine?

What is Markov chain Monte Carlo

Markov chain Monte Carlo (MCMC) is, usually, guess-and-check for people who want to be fancy.

It is a class of algorithms that allow you to construct a⁴ Markov chain that has a given stationary distribution⁵ . In Bayesian applications, we usually want to choose , but there are other applications of MCMC.

Most⁶ MCMC algorithms live in the Metropolis-Hastings family of algorithms. These methods require only one component: a proposal distribution . Given basically any⁷ proposal distribution, we can go from our current state to the new state using the following three steps:

Propose a potential new state
Sample a Bernoulli random variable with
Set according to the formula

The acceptance probability⁸ is chosen⁹ to balance¹⁰ out the proposal with the target distribution .

You can interpret the two ratios in the acceptance probability separately. The first one prefers proposals from high-density regions over proposals from low-density regions. The second ratio balances this by down-weighting proposed states that were easy to propose from the current location. When the proposal is symmetric, ie , the second ratio is always 1. However, in better algorithms like MALA¹¹, the proposal is not symmetric. If we look at the MALA proposal it’s pretty easy to see that we are biasing our samples towards the mode of the distribution. If we did not have the second ratio in the acceptance probability we would severely under-sample the tails of the distribution.

MCMC with approximate acceptance probabilities

With this definition in hand, it’s now possible to re-cast the question my friend asked as > What happens to my MCMC algorithm if, instead of I accidentally compute and use that instead to simulate ?

So let’s go about answering that!

A bit of a literature review

Unsurprisingly, this type of question has popped up over and over again in the literature:

This exact question was asked by Gareth Roberts and Jeff Rosenthal first¹² with Peter Schwartz and a second, more¹³ ¹⁴ realistic, time with Laird Breyer. They found that as long as the chain’s convergence is sufficiently nice¹⁵ then the perturbed chain will converge nicely and have¹⁶ a central limit theorem.
About 10 years ago, an absolute orgy¹⁷ ¹⁸ of research happened around the question What happens if the acceptance probability is random but unbiased?. These exact approximate¹⁹ or pseudo-marginal methods. These have some success in situations²⁰ where the likelihood has a parameter dependent normalising constant that can’t be computed exactly, but can be estimated unbiasedly. The problem with this class of methods is that the extra noise tends to make the Markov chain perform pretty badly²¹. This limits its practical use to models where we really can’t do anything else²². That said, there is some interesting literature on random sub-sampling of data where it doesn’t really work and where it does work.
A third branch of literature is on truly approximate algorithms. These try to understand what happens if you’re just wrong with and you don’t do anything to correct it. There are a lot of papers on this, and I’m not going to do anything approaching a thorough review. I have work²³ ²⁴ to do. So I will just list two older papers that were influential for me. The first was by Geoff Nichols, Colin Fox, and Alexis Muir Watt, which looks at what happens when you don’t correct your pseudo-marginal method correctly. It’s a really neat theory paper that is a great presentation²⁵ of the concepts. The second paper is by Pierre Alquier, Nial Friel, Richard Everitt, and Aidan Boland, which looks at general approximate Markov chains. They show empirically that these methods work extremely well relative to pseudo-marginal methods for practical settings. There are also some nice results on perturbations of Markov chains in general, for instance this paper by Daniel Rudolf and Nikolaus Schweizer.

Trying to understand noisy Markov chains

So how do I think of noisy Markov chains. Despite all appearances²⁶ I am not really a theory person. So while I know that there’s a massive literature on the stability of Markov chains, it doesn’t really influence how I think about it.

Instead, I think about it in terms of that Nicholls, Fox, and Muir Watt paper paper. Or, specifically, a talk I saw Colin give at some point that was really clear.

The important thing to recognise is that it is not important how well you compute . What is important is if you get the same outcome. Imagine we have two random variables and . If our realisation of is the same as our realisation of , then we get the same . Or, to put it another way, when , no one can tell²⁷ that it’s an approximate Markov chain.

This means that one way to understand inexact MCMC is to think of the Markov chain where²⁸ indicates whether or not we made the wrong decision. It’s important to note that while is marginally a Markov chain, is not. You can actually think of as the observation of a hidden Markov model if you want to. I won’t stop you. Nothing will. There is no morality, there is no law. It is The Purge.

Although we can never actually observe , thinking about it is really useful. In particular, we note that until for the first time, the samples of are identical to a correct Metropolis-Hastings algorithm. After this point, the approximate chain and the (imaginary) exact chain will be different. But we can iterate this argument.

To do this, we can define the length of the Markov chain that would be the same as the exact MCMC algorithm started at by and

If we run our algorithm for steps, we can then think of the output as being the same as running Markov chains of different lengths. The th chain starts at and is length . It is worth remembering that these chains are not started from independent points. In particular, if is small, then the starting position of the th and the th chain will be heavily correlated.

To think about this we need to think about what happens after steps of a Markov chain. We are going to need the notation denotes steps of the exact algorithm.

The topic of convergence of Markov chains is a complex business, but we are going to assume that our exact Markov chain is²⁹ geometrically ergodic, which means that for some function³⁰ and .

Geometric ergodicity is a great condition because, among other things, it ensures that sample means from the Markov chain satisfy a central limit theorem. It’s also bloody impossible to prove. But usually indicators like R-hat do a decent job at suggesting that there might be problems. Also if you are spending a lot of time rejecting proposals in certain parts of the space, there’s a solid chance that you’re not geometrically ergodic.

Now let’s assume that we are interested in computing for some nice³¹ function . Then the nice thing about Markov chains is that, give or take³² where might depend on if is unbounded.

This suggests that the error is bounded by, roughly,

This suggests a few things:

If is small relative to , we are going to get very similar estimates to just running parallel Markov chains and combining them without removing any warm up iterations. In particular, if almost all are big, it will be a lot like combining warmed up independent chains.
Effective sample size and Monte Carlo standard error estimates will potentially be very wrong. This is because instead of computing them based on multiple dependent chains, we are pretending that all of our samples came from a single ergodic Markov chain. Is this a problem? I really don’t know. Again, if the s are usually large, we will be fine.
Because can be pretty large when is large, we might have some problems. It’s easy to imagine cases where we get stuck out in a tail and we just fire off a lot of events when is really big. This will be a problem. But also, if we are stuck out in a tail, we are rightly fucked anyway and all of the MCMC diagnostics should be screaming at you. We can take heart that is usually finite³³ and not, you know, massive.

What do the look like?

So the take away from the last section was that if the random variables are usually pretty big, then everything will work ok. Intuitively this makes sense. If the s were always small, it would be very difficult to ever get close to any sort of stationary distribution.

The paper by Nicholls, Fox, and Muir Watt paper talks about potential sizes for . The general construction that they use is a coupling, which is a bivariate Markov chain that start from the same position and are updated as follows:

Propose
Generate a uniform random number
Update as
Update as

This Markov chain is coupled in three ways ways. The chain starts at the same values , the proposed is the same for both chains, and the randomness³⁴ used to do the accept/reject step is the same. Together, this things mean that for all .

For this coupling construction, we can get the exact distribution of the . To do this, we remember that we will only make different decisions in the two chains (or uncouple) if is on different sides of the two acceptance probabilities. The probability of happening is

I guess you could write down the distribution of the in terms of this. In particular, you get , but honestly it would be an absolute nightmare.

When people get stuck in probability questions, the natural thing to do is to make the problem so abstract that you can make the answer up. In that spirit, let’s ask a slightly different: what is the distribution of the maximal decoupling time between the exact and the approximate chain. This is the distribution of the longest possible coupling of the two chains over all³⁵ possible random sequences such that the distribution of is the same as our exact Markov chain and the distribution of is the same as our approximate Markov chain.

This maximal value of is called the maximal agreement coupling time or, more whimsically, the MEXIT time. It turns out that getting the distribution of is … difficult, but we³⁶ can construct a random variable that is independent of such that almost surely and where is the transition distribution for the exact Markov³⁷ chain and is the transition distribution for the approximate Markov chain.

For a Metropolis-Hastings algorithm, the transition distribution has the form where is the probability associated with the proposal density and I have been very explicit about the dependence of the acceptance probability on . (The term takes into account the probability of starting at and not accepting the proposed state.)

That definition of looks pretty nasty, but it’s not too bad: in particular, the infinitum only cares of . This means that the condition simplifies to

This simplifies further if we assume that the proposal distribution is absolutely continuous and has a strictly positive density. Then, it truly does not matter what is. For the first term, it just cancels, while the second term is monotone³⁸ in , so we can take this term to be either zero or one and get³⁹

This is, as the Greeks would say, not too bad.

If, for instance, we know the relative error then and if we know⁴⁰ , we get Similarly, if and , then we get

The nice thing is that we can choose our upper bounds so that and get the upper bound It follows that

Now this is a bit nasty. It’s an upper bound on the probability of a lower bound on the maximal decoupling time. Probability, eh.

Probably the most useful thing we can get from this is an upper bound on , which is⁴¹

This confirms our intuition that if the relative error is large, we will have, on average, quite small . It’s not quite enough to show the opposite (good floating point error begets big ), but that’s probably true as well.

And that is where we end this saga. There is definitely more that could be said, but I decided to spend exactly one day writing this post and that time is now over.

Footnotes

Usually this is a lie, but it was actually a thing that happened last week↩︎
Don’t judge me (or my friends) based on this. I promise we also talk about other shit.↩︎
Hi GPUs!↩︎
usually reversible, although a lot of cool but not ready for prime time work is being done on non-reversible chains.↩︎
A stationary distribution, if it exists, is the distribution that is preserved by the Markov chain. If is the stationary distribution and , then if we construct by running the Markov chain then for every , the marginal distribution is .↩︎
But critically not all! The dynamic HMC algorithm used in Stan, for instance, is not a Metropolis-Hastings algorithm. Instead of doing an accept/reject step it samples from the proposed trajectory. Betancourt’s long intro to Hamiltonian Monte Carlo covers this very well.↩︎
The conditions for this to work are very light. But that’s because the definition of “working” only thinks about what happens after infinitely many steps. To get a practically useful Metropolis-Hastings algorithm, you’ve got to work very hard on choosing your proposal density.↩︎
sometimes called the Hastings correction↩︎
This is not the only choice that will work, but in some sense it is the most efficient one.↩︎
Technically, it is chosen by requiring that the Markov proposal satisfies the detailed balance condition , but everything about that equation is beyond the scope of this particular post.↩︎
Metropolis-adjusted Langevin Algorithm↩︎
Under the assumption that the total floating point error was bounded by a constant ↩︎
This time the assumption was that the rounding error for the acceptance probability at state was bounded by . This is a lot closer to how floating point arithmetic actually works. The trade off is that it requires a tighter condition on the drift function .↩︎
IEEE floating point arithmetic represents a real number using bits. Typically (double precision) or (single precision). You can read a great intro to this on Nick Higham’s blog. But in general, the best we can represent a real number by is by a floating point number that satisfies where in single precision and in double precision. Of course, the acceptance probability is a non-linear combination of floating point numbers, so the actual error is going to be more complicated than that. I strongly recommend you read Nick Higham’s book on the subject.↩︎
-geometrically ergodic with some light conditions on ↩︎
Geometric ergodicity implies the existence of a CLT! Which is nice, because all of our intuition about how to use the output from MCMC depends on a CLT.↩︎
Like all good orgies, this one was mostly populated by men↩︎
Yes, I know. My (limited) contribution this literature was some small contributions to a paper lead by Anne-Marie Lyne. But if years of compulsory catholicism taught me anything (other than “If you’re drinking with a nun or an aging homosexual, don’t try to keep up”) it’s that something does not have to be literally true to be morally true.↩︎
We have to slightly redefine the word “exact” to mean “targets the correct stationary distribution” for this name to make sense↩︎
Random graph models and point processes are two great examples↩︎
for instance, it gets stuck for long times at single values↩︎
the aforementioned point process and graph models↩︎
Playing God of War: Ragnarok↩︎
The first run of God of War Games were not my cup of tea, but the 2008 game, which is essentially a detailed simulation of what happens when a muscle bear is entrusted with walking an 11 year old up a hill, was really enjoyable. So far this is too.↩︎
Does it talk about involutions for not fucking reason? Of course it does. Read past that.↩︎
Yeah, like I have also read my blog. Think of it as being like social media. It is not a representation of me a whole person. It’s actually biased towards stuff that I have either found or find difficult.↩︎
A friend of mine has a “No one knows I’m a transexual” t-shirt that she likes to wear to supermarkets.↩︎
Note that both and are computed using the same value .↩︎
The norm here is usually either the total variation norm of the -norm. But truly it’s not important for the hand waving.↩︎
In most cases as .↩︎
Bounded and continuous always works. But everything is probably ok for unbounded functions as long as has a pile of finite moments.↩︎
This is roughly true. I basically used the geometric ergodicity bound to bound and summed it up. There are smarter things to do, but it’s close enough for government work. ↩︎
Sometimes, if you squint, this term will kinda, sorta start to look like , which isn’t usually toooo big. But also, sometimes it looks totally different. Theory is wild.↩︎
If you’ve ever wondered how rbinom(1,p) works, there you are.↩︎
Think of this as the opposite of an adversarial example. We are trying to find the exact chain that is scared to leave the approximate chain behind. Which is either romantic or creepy, depending on finer details.↩︎
Well not me. Florian Völlering did it in his Theorem 1.4. I most certainly could not have done it.↩︎
Well the result does not need this to be a Markov chain!↩︎
it goes up if otherwise it goes down↩︎
The 1 case can basically never happen except in the trivial case where both acceptance probabilities are the same. And if we thought that was going to happen we would’ve done something bloody else↩︎
The the relative error being bounded does not stop the absolute error growing!↩︎
Look above and recognize the Geometric distribution↩︎

Reuse

https://creativecommons.org/licenses/by-nc/4.0/

Citation

BibTeX citation:

@online{simpson2022,
  author = {Dan Simpson},
  editor = {},
  title = {MCMC with the Wrong Acceptance Probability},
  date = {2022-11-23},
  url = {https://dansblog.netlify.app/posts/2022-11-23-wrong-mcmc/wrong-mcmc.html},
  langid = {en}
}

For attribution, please cite this work as:

Dan Simpson. 2022. “MCMC with the Wrong Acceptance Probability.” November 23, 2022. https://dansblog.netlify.app/posts/2022-11-23-wrong-mcmc/wrong-mcmc.html.

On that example of Robins and Ritov; or A sleeping dog in harbor is safe, but that’s not what sleeping dogs are for

Dan Simpson — Mon, 14 Nov 2022 13:00:00 GMT

Sometimes it’s the parable of the barren fig tree. Sometimes you’re just pissed at a shrub.

Paradoxes and counterexamples live in statistics as our morality plays and our ghost stories. They serve as the creepy gas station attendants that populate the roads leading to the curséd woods; existing not to force change on the adventurer, but to signpost potential danger.¹

As a rule, we should also look in askance at attempts to resolve these paradoxes and counterexamples. That is not what they are for. They are community resources, objects of our collective culture, monuments to thwarted desire.

But sometimes, driven by the endless thirst for content, it’s worth diving down into a counterexample and resolving it. This quixotic quest is not to somehow patch a hole, but to rather expand the hole until it can comfortably encase our wants, needs, and prayers.

To that end, let’s gather ’round the campfire and attend the tale of The Bayesian and the Ancillary Coin.

This example² was introduced by Robins and Ritov, and greatly popularised (and frequently reformulated) by Larry Wasserman³. It says⁴ this:

A committed subjective Bayesian (one who cleaves to the likelihood priniciple tighter than Rose clings to that door) will sometimes get a very wrong answer under some simple, but realistic, forms of randomization. Only a less committed Bayesian will be able to skirt the danger.

So this is what we’re going to do now. First let’s introduce a version of the problem that does not trigger the counterexample. We then introduce the randomization scheme that leads to the error and talk about exactly how things go wrong. As someone who is particular skeptical of any claims to purity⁵, the next job is going to be deconstructing this idea of a committed⁶ subjective Bayesian. I will, perhaps unsurprisingly, argue that this is the only part of the Robins and Ritov (and Wasserman) conclusions that are somewhat questionable. In fact, a true committed subjective Bayesian⁷ can solve the problem. It’s just a matter of looking at it through the correct lens.

A counterexample always proceedes from the least interesting premise

This example exists in a number of forms, that each add important corners to the problem, but in the interest of simplicity, we will start with a simple situation where no problems occur.

Assume that there is a large, but fixed, finite number , and unknown parameters , . The large number can be thought of as the number of strata in a population, while are the means of the corresponding stratum. Now construct an experiment where you draw To close out the generative model, we assume that the covariates have a known distribution .

A classical problem in mathematical statistics is to construct a -consistent⁸ estimator of the vector . But in the setting of this problem, this is quite difficult. The challenge is that if is a very large number, then we would need a gargantuan⁹ number of observations () in order to resolve all of the parameters properly.

But there is a saving grace! The population¹⁰ average can be estimated fairly easily. In fact, the sample mean (aka the most obvious estimator) is going to be -consistent.

Similarly, if we were to construct a Bayesian estimate of the population mean based off the prior and , then the posterior estimate of the population mean is, for large enough¹¹ , This means that the¹² Bayesian resolution of this problem is roughly the same as the classical resolution. This is a nice thing. For very simple problems, these estimators should be fairly similar. It’s only when shit gets complicated where things become subtle.

This scenario, where a model is parameterized by an extremely high dimensional parameter but the quantity of inferential inference is a low-dimensional summary of , is widely and deeply studied under the name of semi-parametric statistics.

Semi-parametric statistics is, unsurprisingly, harder than parametric statistics, but it also quite a bit more challenging than non-parametric statistics. The reason is that if we want to guarantee a good estimate of a particular finite dimensional summary, it turns out that it’s not enough to generically get a “good” estimate of the high-dimensional parameter. In fact, getting a good estimate of the high-dimensional parameter is often not possible (see the example we just considered).

Instead understanding semi-parametric models becomes the fine art of understanding what needs to be done well and what we can half arse. A description of this would take us well outside the scope of a mere blog post, but if you want to learn more about the topic, that’s what to google.

Robins and Ritov toss an ancillary coin and let slip the dogs of war

In order to destroy all that is right and good about the previous example, we only need to do one thing: randomize in a nefarious way. Robins and Ritov (actually, Wasserman who proposed the case with a finite ) add to their experiment biased coins with the property that for some known , and some .

They then go through the data and add a column . The new data is now a three dimensional vector . It’s important to this problem that the are known and that we have the conditional independence structure .

Robins, Ritov, and Wasserman all ask the same question: Can we still estimate the population mean if we only observe samples from the conditional distribution ?

The answer is going to turn out that there is a perfectly good estimator from classical survey statistics, but a Bayesian estimator is a bit more challenging to find.

Before we get there, it’s worth noting that unlike the problem in the previous section, this problem is at least a little bit interesting. It’s a cartoon of a very common situation where there is covariate-dependent randomization in a clinical trial. Or, maybe even more cleanly, a cartoon of a simple probability survey.

A critical feature of this problem is that because the are known and is known, the joint likelihood factors as so is ancillary¹³ for .

The simplest classical estimator for is the Horvitz-Thompson estimator It’s easy to show that this is a -consistent estimator. Better yet, uniform over in the sense that the convergence of the estimator isn’t affected (to leading order) by the specific values. This uniformity is quite useful as it gives some hope of good finite-data behaviour.

So now that we know that the problem can be solved, let’s see if we can solve it in a Bayesian way. Robins and Ritov gave the following result.

There is no uniformly consistent Baysesian estimator of the parameter unless the prior depends on the values.

Robins and Ritov argue that a “committed subjective Bayesian” would, by the Likelihood Principle, never allow their prior to depend on the ancillary statistic as the Likelihood Principle clearly states that inference should be independent on ancillary information.

There are, of course, ways to construct priors that depend on the sampling probabilities. Wasserman calls this “frequentist chasing”

So let’s investigate this, by talking about what went wrong, how to fix it, and whether fixing it makes us bad Bayesians.

The likelihood principle and the death of nuance

So what is the likelihood principle and why is it being such a bastard to us poor liddle bayesians?

The likelihood principle says, roughly, that the all of the information needed for parameter inference¹⁴ should be contained in the likelihood function.

In particular, if we follow the likelihood principle, then if we have two likelihoods that are scalar multiples of each other, our estimates of the parameters should be the same.

Ok. Sure.

Why on earth do people care about the likelihood principle? I guess it’s because they aren’t happy with the fact that Bayesian methods actually work in practice and instead want to do some extremely boring philosophy-ish stuff to “prove” the superiority and purity of Bayesian methods. And you know all power¹⁵ to them. Your kink is not my kink.

In this context, it means that because is ancillary to for estimating we should avoid using the s (and the s) to estimate . This is in direct opposition to what the Horvitz-Thompson estimator uses.

What happens if we follow this principle? We get a bad estimate.

It’s pretty easy to see that the posterior mean will, eventually, converge to the true value. All that has to happen is you need to see enough observations in each category. So if you get enough data, you will eventually get a good estimate.

Unfortunately, when is large, this will potentially take a very very long¹⁶ time.

Let’s go a bit deeper and see why this behaviour is not wrong, per se, it’s just Bayesian.

Bayesian inference produces a posterior distribution, which is conditional on an observed sample. This posterior distribution is an update to the prior that describes how compatible different parameter configurations are with the observed sample.

The thing is, our sample only sees a small sample of the values of . This means that we are, essentially, estimating where is the set observed values of , which depends on . This target changes as we get more data and see more levels of and eventually coalesces towards the thing we are trying to compute.

But, and this is critical, we cannot say anything about for unless we can assume that they are, in some sense, very strongly related. Unfortunately, the whole point of this example is that we are not allowed¹⁷ to assume that!

In this extremely flexible model, it’s possible to have a sequence that is highly correlated¹⁸ with . If, for instance, were¹⁹ equally spaced on for some small , you would have the situation where you are very likely to see the largest values of and quite unlikely to see any of the smaller values. This would gravely bias your sample mean upwards.

This construction is the basis similar to the one that Robins and Ritov use to prove that there is always at parameter value where the posterior mean converges²⁰ to the true mean at a rate no faster than , which would require an exponentially large number of samples to do any sort of inference.

A reasonable criticism of this argument is that surely most problems will not have strong correlation between the sampling probabilities and the conditional means. In a follow up paper, Ritov et al. argue that it’s not necessarily all that rare. For instance, if they are both realisations of independent GPs²¹ the empirical correlation between the two observed sequences can be far from zero! Less abstractly, it’s pretty easy to imagine something that is more popular with old people (who often answer their phones) than with young people (who don’t typically answer their phones). So this type of adversarial correlation certainly can happen in practice.

Can we save Bayes?

No.

Bayes does not need to be saved. She is doing exactly what it set out to do and is living her best life. Do not interfere²².

So let’s look at why we don’t need to fix things.

A simple posterior and its post-processing

Once again, recall the setting: we are observing the triple²³ In particular, we can process this data to get some quantities:

: The total sample size
: The number of observed
: The total number of times group was sampled
: The number of times an observation from group was recorded.

Because of the structure of the problem, most observed values of and will be zero or one.

Nevertheless, we persist.

We now need priors on the . There are probably a tonne of options here, but I’m going to go with the simplest one, which is just to make them iid for some fixed and known value . We can then fit the resulting model and get the posterior for each . Note that because of the data sparsity, most of the posteriors will just be the same as the prior.

Then we can ask ourselves a much more Bayesian question: What would the average in our sample have been if we had recorded every ? Our best estimate of that quantity is

That’s all well and good. And, again, if I had small enough or large enough that I had a good estimate for all of the , this would be a good estimate. Moreover, for finite data this is likely to be a much better estimator than as it at least partially corrects for any potential imbalance in the covariate sampling.

It’s also worth noting here that there is nothing “Bayesian” about this. I am simply taking the knowledge I have from the sample I observed and processing the posterior to compute a quantity that I am interested in.

But, of course, that isn’t actually the quantity that I’m interested in. I’m interested in that quantity averaged over realisations of . We can compute this if we can quantify the effect that has on .

We can do this pretty easily. Our priors are iid²⁴, so this decouples into independent normal-normal models.

For any , denote as the subset of that are in category . We have that²⁵

If we expand the density for a we get Matching terms in these two expressions we get that while the posterior mean is where I’ve suppressed the dependence on the sample in the and notation because, as a true²⁶ Bayesian, my sample is fixed and known. Hence

Then I get the following estimator for the mean of the complete sample We can also compute the posterior variance²⁷ Note that most of the groups won’t have a corresponding observation, so, recalling that is the set of s that have been updated in the sample, we get where the term that multiplies is less than 1.

So that’s all well and good, but that isn’t really the thing we were trying to estimate. We are actually interested in estimating the population mean, which we will get if we let .

So let’s see if we can do this without violating any of the universally agreed upon sacred strictures of Bayes.

Modelling the effect of the ancillary coin

Here’s the thing, though. We have computed our posterior distributions and we can now use them as a generative model²⁸ for our data. We also have the composition of the complete data set (the s) and full knowledge about how a new sample of the s would come into our world.

We can put these things together! And that’s not in anyway violating our Bayesian oaths! We are simply using our totally legally obtained posterior distribution to compute things. We are still true committed²⁹ subjective Bayesians.

So we are going to ask ourselves a simple question. Imagine, for a given , we have iid samples³⁰ What is the posterior mean ? In fact, because this is random data drawn from a hypothetical sample, we can (and should³¹) ask questions about its distribution! To be brutally francis with you, I am too lazy to work out the variance of the posterior mean. So I’m just going to look at the mean of the posterior mean.

First things first, we need to look at the (average) posterior for when . The exact calculation we did before gives us And, while I said I wasn’t going to focus on the variance, it’s easy enough to write down as where the second term takes into account the variance due to the imputation.

With this, we can estimate sample mean for any number and any set of that sum to and any set of as where in the last line I’ve used the fact that the empirical proportion converges to and the posterior mean converges to . The little-o³² error term is as (and hence and ) goes to infinity.

To turn this into a practical estimate, we can plug in our values of and to get our Bayesian approximation to the population mean which is (up to the small term in brackets) the Horvitz-Thompson estimator!

Is it Bayesian?

I stress, again, that there is nothing inherently non-Bayesian about this derivation. Except possibly the question that it is asking. What I did was compute the posterior distribution and then I took it seriously and used it to compute a quantity of interest.

The only oddity is that the quantity of interest (the population mean) has a slightly awkward link to the observed sample. Hence, I estimated something that had a more direct link to the population mean: the sample mean of the completely observed sample under different realisations of the randomisation .

In order to estimate the sample mean under different realisations of the randomisation, I needed to use the posterior predictive distribution to impute these fictional samples. I then averaged over the imputed samples and sent the sample size to infinity to get an estimator³³.

Or, to put it differently, I used Bayes to get a posterior estimate for new data and then used this probabilistic model to estimate . There was no reason to use Bayesian methods to do this. Non-Bayesian questions do not invite Bayesian answers.

Now, would I go to all of this effort in real life? Probably not. And in the applications that I’ve come across, I’ve never had to. I’ve done a bunch of MRP³⁴, which is structurally quite similar to this problem except we can reasonably model the dependence structure between the s. This paper I wrote with Alex Gao, Lauren Kennedy, and Andrew Gelman is an example of the type of modelling you can do.

Is it true? Am I a chaser?

Wasserman derides “frequentist chasing” Bayesians, making the point that if they want a frequentist guarantee so badly, why not just do it the easy way.

Now. Laz. Mate.

Let me tell you that a lot of my self esteem has been traditionally gathered from chasers, so I absolutely refuse to be party to the slander.

But more than that, let’s be clear. Bayes is a way to probabilistically describe data. That is not enough in and of itself to be useful. For it to be useful, we need to do something with that posterior distribution.

So really, let’s talk about what a true committed subjective Bayesian does about this. Firstly, I mean really. There is no such thing³⁵. But leaving that aside, the closest I can get to a working definition is that a true committed subjective Bayesian is a person who understands that parameters are polite fictions that are used to describe the data. They are not, inherently, linked to any population quantity (for a true committed subjective Bayesian, such a thing does not exist).

The only way to link parameters in a Bayesian model to a population quantity of interest is to use some sort of extra-Bayesian³⁶ information.

For instance, in the first example (the one without the ancillary coin), I made that link in secret using assumptions about the sample. We all know that those types of assumptions are fraught and the reason that people spend so much time whispering DAG into the ears of their sleeping lovers.

For the ancillary coin example, we used the given information about the sampling mechanism as our extra information to link our posterior distribution to the population quantity of interest. None of this changes the purity³⁷ of the Bayesian analysis. Or makes a non-Bayesian solution preferable. (Although, in this case, a non-Bayesian solution is a fuckload easier to come up with.)

Of course Wasserman (and I presume Robins and Ritov) know all of this. But it’s fun to write it all down.

Moreover, I think that the three lessons here are fairly transferable:

If you’re going to go to the trouble of computing a posterior, take it seriously. Use it to do things! You can even put it in as part of a probabilistic model.
If you’re going to make Bayes work for you, think in terms of observables (eg the mean of the complete sample) rather than parameters.
Appeals to purity are a bit of a waste of time.

Footnotes

Huge thanks to Sameer Deshpande for great comments!↩︎
I first came across this in a series of posts on Larry Wasserman’s now defunct but quite excellent blog.↩︎
It’s worth saying that these three people do fabulous statistics of the form that I don’t usually do. But that doesn’t make it less important to understand their contributions. You could say that while I am not a Lazbian, I think it’s important to know the theory.↩︎
I might have slightly reworded it.↩︎
Purity is needed in good olive oil and that’s it↩︎
A committed subjective Bayesian prefers Dutch baby to a Dutch book.↩︎
A true committed subjective Bayesian doesn’t wear anything under his kilt.↩︎
That is, an estimator where for all . This, roughly, means, that you can find a such that with high probability.↩︎
The asymptotics say that we should count our data in multiples of , so we’d to get even one decimal place of accuracy.↩︎
Remember .↩︎
Theorem 2 of Harmeling and Toussaint↩︎
a↩︎
If you’ve not come across it, ancillary is the term used for parts of the data that don’t influence parameter estimates. It’s the opposite of a sufficient statistic. One way to see that it’s ancillary for any model , is to consider the log of the joint density , where the last two terms are constant in .↩︎
You need to be specific here. Obviously this would be false if you were trying to do a statistical prediction. Or if you were trying to make a decision. Those things necessarily depend on extra stuff!↩︎
This is a lie. Insisting on talking about this shit rather than actually making Bayes useful and using it in new and exciting ways to do things that are hard to do without Bayesian methods is a waste of time. Worse than that, when you start pretending your method of choice is the only possible thing that a sensible and principled person would use, you start to look like a bit of a dickhead. It also turns people off trying these very flexible and useful methods. So yeah. I maybe do care a little bit. ↩︎
The expected number of samples to see one draw where is . The expected number of draws where that you need to actually observe the corresponding is . This suggests it will potentially take a lot of draws to even have effectively one sample from each category, let alone the 20-100 you’d need to, practically, get some sort of reasonable estimate.↩︎
Robins and Ritov have always been open that if there is a true parametric model for the (or if that function is “very smooth” in some technical sense, eg a realisation of a smooth Gaussian process) then the Bayesian estimator that incorporates this information will do perfectly well. ↩︎
So the RR example uses binary data, so then it’s the correlation between and , but the exact same argument works if is correlated something like . I went with the Gaussian version because at one point I thought I might end up having to derive posteriors and I’m all about simplicity.↩︎
expit is the inverse of the logit transform↩︎
Check the paper for the details as the situation is slightly different to the one I’m sketching out here, but there’s no real substantive difference.↩︎
Of course, if this were true we could use a GP prior for the s and we’d probably get a decent estimator anyway.↩︎
If you want to interfere, there are plenty of ways to build priors that incorporate the information. The Ritov etc paper has nice references to the various things that sprung up from this example. Are these useful beyond simply making sure the posterior mean of estimates ? Not really. They are priors designed to solve exactly one problem.↩︎
I’m using the C/C++ ternary operator. In R this would be parsed as ifelse(r[i] == 1, y[i], NA). ↩︎
Not exchangeable—there are no shared parameters!↩︎
Remember that . If we wanted a more flexible variance, we could obviously have one, but it makes not real difference to anything.↩︎
I promise I’m just rolling my eyes to see if I can see my brain.↩︎
Remember everything is independent!↩︎
This is the posterior predictive distribution!↩︎
A true committed subjective Bayesian knows that DP stands for Dirichlet Process. No matter the context.↩︎
The variance is because this is the posterior predictive distribution.↩︎
Does this seem like a frequentist question? I guess. But really it’s a question we can always ask about the posterior. Should we? Well if you are trying to estimate a population quantity you sort of have to. Because there isn’t really a concept of a population parameter within a Bayesian framework (true committed subjective or otherwise).↩︎
Remember that this means that the error (which is a random variable) goes to 0 as . A more careful person could probably work out how fast it would happen.↩︎
I only computed the mean, so feel free to pretend that I’m minimizing a loss function↩︎
Multilevel regression with poststratification, a survey modelling technique↩︎
No true Scotsman etc↩︎
or meta-Bayesian in the event that we are doing things like building a Bayesian pseudo-model of on the space of all considered model that just happens to give every model equal probability because Harold Fucking Jeffreys gave you an erection and you could either process that event like an adult or build a whole personality around it. And you chose the latter.↩︎
Can you tell that I hate this entire discussion?↩︎

Reuse

https://creativecommons.org/licenses/by-nc/4.0/

Citation

BibTeX citation:

@online{simpson2022,
  author = {Dan Simpson},
  editor = {},
  title = {On That Example of {Robins} and {Ritov;} or {A} Sleeping Dog
    in Harbor Is Safe, but That’s Not What Sleeping Dogs Are For},
  date = {2022-11-15},
  url = {https://dansblog.netlify.app/posts/2022-11-12-robins-ritov/robins-ritov.html},
  langid = {en}
}

For attribution, please cite this work as:

Dan Simpson. 2022. “On That Example of Robins and Ritov; or A Sleeping Dog in Harbor Is Safe, but That’s Not What Sleeping Dogs Are For.” November 15, 2022. https://dansblog.netlify.app/posts/2022-11-12-robins-ritov/robins-ritov.html.

Priors for the parameters in a Gaussian process

Dan Simpson — Mon, 26 Sep 2022 14:00:00 GMT

Long time readers will know that I bloody love a Gaussian process (GP). I wrote an extremely detailed post on the various ways to define Gaussian processes. And I did not do that because I just love inflicting Hilbert spaces on people. In fact, the only reason that I ever went beyond the standard operational definition of GPs that most people live their whole lives using is that I needed to.

Twice.

The first time was when I needed to understand approximation properties of a certain class of GPs. I wrote a post about it. It’s intense¹.

The second time that I really needed to dive into their arcana and apocrypha² was when I foolishly asked the question can we compute Penalised Complexity (PC) priors³ ⁴ for Gaussian processes?.

The answer was yes. But it’s a bit tricky.

So today I’m going to walk you through the ideas. There’s no real need to read the GP post before reading the first half of this one⁵, but it would be immensely useful to have at least glanced at the post on PC priors.

This post is very long, but that’s mostly because it tries to be reasonably self-contained. In particular, if you only care about the fat stuff, you really only need to read the first part. After that there’s a long introduction to the theory of stationary Gaussian processes. All of this stuff is standard, but it’s hard to find collected in one place all of the things that I need to derive the PC prior. The third part actually derives the PC prior using a great deal of methods from the previous part.

Part 1: How do you put a prior on parameters of a Gaussian process?

We are in the situation where we have a model that looks something like this⁶ ⁷ where is a covariance function with parameters and we need to specify a joint prior on the GP parameters .

The simplest case of this would be GP regression, but a key thing here is that, in general, the structure (or functional form) of the priors on probably shouldn’t be too tightly tied to the specific likelihood. Why do I say that? Well the scaling of a GP should depend on information about the likelihood, but it’s less clear that anything else in the prior needs to know about the likelihood.

Now this view is predicated on us wanting to make an informative prior. In some very special cases, people with too much time on their hands have derived reference priors for specific models involving GPs. These priors care deeply about which likelihood you use. In fact, if you use them with a different model⁸, you may not end up with a proper⁹ posterior. We will talk about those later.

To start, let’s look at the simplest way to build a PC prior. We will then talk about why this is not a good idea.

A first crack at a PC prior

As always, the best place to start is the simplest possible option. There’s always a hope¹⁰ that we won’t need to pull out the big guns.

So what is the simplest solution? Well it’s to treat a GP as just a specific multivariate Gaussian distribution where is a correlation matrix.

The nice thing about a multivariate Gaussian is that we have a clean expression for its Kullback-Leibler divergence. Wikipedia tells us that for an -dimensional multivariate Gaussian To build a PC prior we need to consider a base model. That’s tricky in generality, but as we’ve assumed that the covariance matrix can be decomposed into the variance and a correlation matrix , we can at least specify an easy base model for . As always, the simplest model is one with no GP in it, which corresponds to . From here, we can follow the usual steps to specify the PC prior where we choose for some upper bound and some tail probability so that The specific choice of will depend on the context. For instance, if it’s logistic regression we probably want something like¹¹ . If we have a GP on the log-mean of a Poisson distribution, then we probably want if you want the mean of the Poisson distribution to be less than the maximum integer¹² in R. In most data, you’re gonna want¹³ . If the GP is on the mean of a normal distribution, the choice of will depend on the context and scaling of the data.

Without more assumptions about the form of the covariance function, it is impossible to choose a base model for the other parameters .

That said, there is one special case that’s important: the case where is a single parameter controlling the intrinsic length scale, that is the distance at which the correlation between two points units apart is approximately zero. The larger is, the more correlated observations of the GP are and, hence, the less wiggly its realisation is. On the other hand, as , the observations GP often behaves like realisations from an iid Gaussian and the GP becomes¹⁴ wilder and wilder.

This suggests that a good base model for the length-scale parameter would be . We note that if both the base model and the alternative have the same value of , then it cancels out in the KL-divergence. Under this assumption, we get that where I’m being a bit cheeky putting that limit in, as we might need to do some singular model jiggery-pokery of the same type we needed to do for the standard deviation. We will formalise this, I promise.

As the model gets more complex as the length scale decreases, we want our prior to control the smallest value can take. This suggests we want to choose to ensure How do we choose the lower bound ? One idea is that our prior should have very little probability of the length scale being smaller than the length-scale of the data. So we can chose to be the smallest distance between observations (if the data is regularly spaced) or as a low quantile of the distribution of distances between nearest neighbours.

All of this will specify a PC prior for a Gaussian process. So let’s now discuss why that prior is a bit shit.

What’s bad about this?

The prior on the standard deviation is fine.

The prior on the length scale is more of an issue. There are a couple of bad things about this prior. The first one might seem innocuous at first glance. We decided to treat the GP as a multivariate Gaussian with covariance matrix . This is not a neutral choice. In order to do it, we need to commit to a certain set of observation locations¹⁵. Why? The matrix depends entirely on the observation locations and if we use this matrix to define the prior we are tied to those locations.

This means that if we change the amount of data in the model we will need to change the prior. This is going to play havoc¹⁶ on any sort of cross-validation! It’s worth saying that the other two sources of information (the minimum length scale and the upper bound on ) are not nearly as sensitive to small changes in the data. This information is, in some sense, fundamental to the problem at hand and, therefore, much more stable ground to build your prior upon.

There’s another problem, of course: this prior is expensive to compute. The KL divergence involves computing which costs as much as another log-density evaluation for the Gaussian process (which is to say it’s very expensive).

So this prior is going to be deeply inconvenient if we have varying amounts of data (through cross-validation or sequential data gathering). It’s also going to be wildly more computationally expensive than you expect a one-dimensional prior to be.

All in all, it seems a bit shit.

The Matérn covariance function

It won’t be possible to derive a prior for a general Gaussian process, so we are going to need to make some simplifying assumptions. The assumption that we are going to make is that the covariance comes from the Whittle-Matérn¹⁷ ¹⁸ class where is the smoothness parameter, is the length-scale parameter, is the marginal standard deviation, and is the modified Bessel¹⁹ function of the second kind.

This class of covariance function is extremely important in practice. It interpolates between two of the most common covariance functions:

when , it corresponds to the exponential covariance function,
when , it corresponds to the squared exponential covariance.

There are years of experience suggesting that Matérn covariance functions with finite will often perform better than the squared exponential covariance.

Common practice is to fix²⁰ the value of . There are a few reasons for this. One of the most compelling practical reasons is that we can’t easily evaluate its derivative, which rules out most modern optimisation and MCMC algorithms. It’s also very difficult to think about how you would set a prior on it. The techniques in this post will not help, and as far as I’ve ever been able to tell, nothing else will either. Finally, you could expect there to be horrible confounding between , , and , which will make inference very hard (both numerically and morally).

It turns out that even with fixed, we will run into a few problems. But to understand those, we are going to need to know a bit more about how inferring parameters in a Gaussian processes actually works.

Just for future warning, I will occasionally refer to a GP with a Matérn covariance function as a “Matérn field”²¹.

Asymptotic? I barely know her!

Let’s take a brief detour into classical inference for a moment and ask ourselves when can we recover the parameters of a Gaussian process? For most models we run into in statistics, the answer to that question is when we get enough data. But for Gaussian processes, the story is more complex.

First of all, there is the very real question of what we mean by getting more data. When our observations are iid, this so easy that when asked how she got more data, Kylie just said she “did it again”.

But this is more complex once data has dependence. For instance, in a multilevel model you could have the number of groups staying fixed while the number of observations in each group goes to infinity, you could have the number of observations in each group staying fixed while the number of groups go to infinity, or you could have both²² going to infinity.

For Gaussian processes it also gets quite complicated. Here is a non-exhaustive list of options:

You observe the same realisation of the GP at an increasing number of points that eventually cover the whole of (this is called the increasing domain or outfill regime); or
You observe the same realisation of the GP at an increasing number of points that stay within a fixed domain (this is called the fixed domain or infill regime); or
You observe multiple realisations of the same GP at a finite number of points that stay in the same location (this does not have a name, in space-time it’s sometimes called monitoring data); or
You observe multiple realisations of the same GP at a (possibly different) finite number of points that can be in different locations for different realisations; or
You observe realisations of a process that evolves in space and time (not really a different regime so much as a different problem).

One of the truly unsettling things about Gaussian processes is that the ability to estimate the parameters depends on which of these regimes you choose!

Of course, we all know that asymptotic regimes are just polite fantasies that statisticians concoct in order to self-soothe. They are not reflections on reality. They serve approximately the same purpose²³ as watching a chain of Law and Order episodes.

The point of thinking about what happens when we get more data is to use it as a loose approximation of what happens with the data you have. So the real question is which regime is the most realistic for my data?.

One way you can approach this question is to ask yourself what you would do if you had the budget to get more data. My work has mostly been in spatial statistics, in which case the answer is usually²⁴ that you would sample more points in the same area. This suggests that fixed-domain asymptotics is a good fit for my needs. I’d expect that in most GP regression cases, we’re not expecting²⁵ that further observations would be on new parts of the covariate space, which would suggest fixed-domain asymptotics are useful there too.

This, it turns out, is awkward.

When is a parameter not consistently estimatable: an aside that will almost immediately become relevant

The problem with a GP with the Matérn covariance function on a fixed domain is that it’s not possible²⁶ to estimate all of its parameters at the same time. This isn’t the case for the other asymptotic regimes, but you’ve got to dance with who you came to the dance with.

To make this more concrete, we need to think about a Gaussian process as a realisation of a function rather than as a vector of observations. Why? Because under fixed-domain asymptotics we are seeing values of the function closer and closer together until we essentially see the entire function on that domain.

Of course, this is why I wrote a long and technical blog post on understanding Gaussian processes as random functions. But don’t worry. You don’t need to have read that part.

The key thing is that because a GP is a function, we need to think of it’s probability of being in a set of functions. There will be a set of function , which we call the support of , that is the smallest set such that Every GP has an associated support and, while you probably don’t think much about it, GPs are obsessed with their supports. They love them. They hug them. They share them with their friends. They keep them from their enemies. And they are one of the key things that we need to think about in order to understand why it’s hard to estimate parameters in a Matérn covariance function.

There is a key theorem that is unique²⁷ to Gaussian processes. It’s usually phrased in terms of Gaussian measures, which are just the probability associated with a GP. For example, if is a GP then is the corresponding Gaussian measure. We can express the support of as the smallest set of functions such that .

Theorem 1 (Feldman-Hájek theorem) Two Gaussian measures and with corresponding GPs and on a locally convex space²⁸ either satisfy, for every²⁹ set ,
in which case we say that and are equivalent³⁰ (confusingly³¹ written ) and , or in which case we say and are singular (written ) and and have disjoint supports.

Later on in the post, we will see some precise conditions for when two Gaussian measures are equivalent, but for now it’s worth saying that it is a very delicate property. In fact, if for any , then³² !

This seems like it will cause problems. And it can³³. But it’s fabulous for inference.

To see this, we can use one of the implications of singularity: if and only if where the the Kullback-Leibler divergence can be interpreted as the expectation of the likelihood ratio of vs under . Hence, if and are singular, we can (on average) choose the correct one using a likelihood ratio test. This means that we will be able to correctly recover the true³⁴ parameter.

It turns out the opposite is also true.

Theorem 2 If , is a family of Gaussian measures corresponding to the GPs and for all values of , then there is no sequence of estimators such that, for all where is the probability under data drawn with true parameter . That is, there is no estimator that is (strongly) consistent for all .

Click for a surprise (the proof. shit i spoiled the surprise)

A first look at multilevel regression; or Everybody’s got something to hide except me and my macaques

Dan Simpson — Mon, 05 Sep 2022 14:00:00 GMT

Eliza knows a little something about monkeys. This will become relevant in a moment.

In about 2016, Almeling et al. published a paper that suggested aged Barbary macaques maintained interest in members of their own species while losing interest in novel non-social stimuli (eg toys or puzzles with food inside).

I’d never come across the concept of a Graphical Abstract before, but here is the one for this paper. Graphic design is my passion. Source

This is where Eliza—who knows a little something about monkeys—comes into frame: this did not gel with her experiences at all.

So Eliza (and Mark¹ ², who also knows a little something about monkeys) decided to look into it.

What are the stake?s (According to the papers, not according to me, who knows exactly nothing³ about this type of work)

A big motivation for studying macaques and other non-human primates is that they’re good models of humans. This means that if there was solid evidence of macaques becoming less interested in novel stimuli as they age (while maintaining interest in people), this could suggest an evolutionary reason from this (commonly observed) behaviour in humans.

So if this result is true, it could help us understand the psychology of humans as they age (and in particular, the learned vs evolved trade off they are making).

So what did Eliza and Mark do?

There are a few things you can do when confronted with a result that contradicts your experience: you can complain about it on the Internet, you can mobilize a direct replication effort, or you can conduct your own experiments. Eliza and Mark opted for the third option, designing a conceptual replication.

Direct replications tell you more about the specific experiment that was conducted, but not necessarily more about the phenomenon under investigation. In a study involving aged monkeys⁴, it’s difficult to imagine how a direct replication could take place.

On the other hand, a conceptual replication has a lot more flexibility. It allows you to probe the question in a more targeted manner, appropriate for incremental science. In this case, Eliza and Mark opted to study only the claim that the monkeys lose interest in novel stimuli as they age (paper here). They did not look into the social claim. They also used a slightly different species of macaque (M. mulatta rather than M. butterfly). This is reasonable insofar as understanding macaques as a model for human behaviour.

What does the data look like?

The experiment used 243⁵ monkeys aged between 4 and 30 and gave them a novel puzzle task (opening a fancy tube with food in it) for twenty minutes over two days. The puzzle was fitted with an activity tracker. Each monkey had two tries at the puzzle over two days. Monkeys had access to the puzzle for around⁶ 20 minutes.

In order to match the original study’s analysis, Eliza and Mark divided the first two minutes into 15 second intervals and counted the number of intervals where the monkey interacted with the puzzle. They also measured the same thing over 20 minutes in order to see if there was a difference between short-term curiosity and more sustained exploration.

For each monkey, we have the following information:

Monkey ID
Age (4-30)
Day (one or two)
Number of active intervals in the first two minutes (0-8)
Number of active intervals in the first twenty minutes (0-80)

The data and their analysis are freely⁷ available here.

library(tidyverse)
acti_data <- read_csv("activity_data.csv") 
activity_2mins <- acti_data |>
  filter(obs<9) |> group_by(subj_id, Day) |>
  summarize(total=sum(Activity), 
            active_bins = sum(Activity > 0), 
            age = min(age)) |>
  rename(monkey = subj_id, day = Day) |>
  ungroup()

activity_20minms80 <- acti_data |> filter(obs<81) |>
  group_by(subj_id, Day) |>
  summarize(total=sum(Activity), 
            active_bins = sum(Activity > 0), 
            age = min(age)) |>
  rename(monkey = subj_id, day = Day) |>
  ungroup()

glimpse(activity_20minms80)

Rows: 485
Columns: 5
$ monkey       0, 0, 88, 88, 636, 636, 760, 760, 1257, 1257, 1607, 1607, …
$ day          1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2…
$ total        9881, 6356, 15833, 4988, 572, 308, 1097, 2916, 4884, 2366,…
$ active_bins  42, 34, 43, 19, 10, 4, 12, 23, 50, 33, 9, 11, 13, 7, 30, 3…
$ age          29, 29, 29, 29, 28, 28, 30, 30, 27, 27, 27, 27, 27, 27, 26…

Ok Mary, how are we going to analyze this data?

Eliza and Mark’s monkey data is an example of a fairly common type of experimental data, where the same subject is measured multiple times. It is useful to break the covariates down into three types: grouping variables, group-level covariates, and individual-level covariates.

Grouping variables indicate what group each observation is in. We will see a lot of different ways of defining groups as we go on, but a core idea is that observations within a group should conceptually more similar to each other than observations in different groups. For Eliza and Mark, their grouping variable is monkey. This encodes the idea that different monkeys might have very different levels of curiosity, but the same monkey across two different days would probably have fairly similar levels of curiosity.

Group-level covariates are covariates that describe a feature of the group rather than the observation. In this example, age is a group-level covariate, because the monkeys are the same age at each observation.

Individual-level covariates are covariates that describe a feature that is specific to an observation. (The nomenclature here can be a bit confusing: the “individual” refers to individual observations, not to individual monkeys. All good naming conventions go to shit eventually.) The individual-level covariate is experiment day. This can be a bit harder to see than the other designations, but it’s a little clearer if you think of it as an indicator of whether this is the first time the monkey has seen the task or the second time. Viewed this way, it is very clearly a measurement of an property of an observation rather than of a group.

Eliza and Mark’s monkey data is an example of a fairly general type of experimental data where subjects (our groups) are given the same task under different experimental conditions (described through individual-level covariates). As we will see, it’s not uncommon to have much more complex group definitions (that involve several grouping covariates) and larger sets of both group-level and individual-level covariates.

So how do we fit a model to this data.

There are just too many monkeys; or Why can’t we just analyse this with regression?

The temptation with this sort of data is to fit a linear regression to it as a first model. In this case, we are using grouping, group-level, and individual-level covariates in the same way. Let’s suck it and see.

library(broom)
fit_lm <- lm(active_bins ~ age*factor(day) + factor(monkey), data = activity_2mins)

tidy(fit_lm)

So the first thing you will notice is that that is a lot of regression coefficients! There are 243 monkeys and 2 days, but only 485 observations. This isn’t enough data to reliably estimate all of these parameters. (Look at the standard errors for the monkey-related coefficients. They are huge!)

So what are we to do?

The problem is the monkeys. If we use monkey as a factor variable, we only have (at most) two observations of each factor level. This is simply not enough observations per to estimate a different intercept for each monkey!

This type of model is often described as having no pooling, which indicates that there is no explicit dependence between the intercepts for each group (monkey). (There is some dependence between groups due to the group-level covariate age.)

If we ignore the monkeys, will they go away? or Another attempt at regression

Our first attempt at a regression model didn’t work particularly well, but that doesn’t mean we should give up⁸. A second option is that we can assume that there is, fundamentally, no difference between monkeys. If all monkeys of the same age have similar amounts of interest in new puzzles, this would be a reasonable assumption. The best case scenario is that not accounting for differences between individual monkeys would still lead to approximately normal residuals, albeit with probably a larger residual variance.

This type of modelling assumption is called complete pooling as it pools the information between groups by treating them all as the same.

Let’s see what happens in this case!

fit_lm_pool <- lm(active_bins ~ age*factor(day), data = activity_2mins)
summary(fit_lm_pool)


Call:
lm(formula = active_bins ~ age * factor(day), data = activity_2mins)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.5249 -1.5532  0.1415  1.6731  4.1884 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)      3.789718   0.344466  11.002   <2e-16 ***
age              0.003126   0.021696   0.144    0.885    
factor(day)2     0.056112   0.488818   0.115    0.909    
age:factor(day)2 0.025170   0.030759   0.818    0.414    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.103 on 481 degrees of freedom
Multiple R-squared:  0.01365,   Adjusted R-squared:  0.0075 
F-statistic: 2.219 on 3 and 481 DF,  p-value: 0.0851

On the up side, the regression runs and doesn’t have too many parameters!

The brave and the bold might even try to interpret the coefficients and say something like there doesn’t seem to be a strong effect of age. But there’s real danger in trying to interpret regression coefficients in the presence of a potential confounder (in this case, the monkey ID). And it’s particularly bad form to do this without ever looking at any sort of regression diagnostics. Linear regression is not a magic eight ball.

Let’s look at the diagnostic plots.

library(broom)
augment(fit_lm_pool) |> 
  ggplot(aes(x = .fitted, y = active_bins - .fitted)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  theme_classic()

augment(fit_lm_pool) |> ggplot(aes(sample = .std.resid)) + 
  stat_qq() +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed") + 
  theme_classic()

There are certainly some patterns in those residuals (and some suggestion that the error need a heavier tail for this model to make sense).

What is between no pooling and complete pooling? Multilevel models, that’s what

We are in a Goldilocks situation: no pooling results in a model that has too many independent parameters for the amount of data that we’ve got, while complete pooling has too few parameters to correctly account for the differences between the monkeys. So what is our perfectly tempered porridge⁹?

The answer is to assume that each monkey has its own intercept, but that it’s intercept can only be so far from the overall intercept (that we would’ve gotten from complete pooling). There are a bunch of ways to realize this concept, but the classical method is to use a normal distribution.

In particular, if the th monkey has observations , , then we can write our model as

The effects of age and day and the data standard deviation () are just like they’d be in an ordinary linear regression model. Our modification comes in how we treat the .

In a classical linear regression model, we would fit the s independently, perhaps with some weakly informative prior distribution. But we’ve already discussed that that won’t work.

Instead we will make the exchangeable rather than independent. Exchangeability is a relaxation of the independence assumption to say instead encode that we have no idea which of the intercepts will do what. That is, if we switch around the labels of our intercepts the prior should not change. There is a long and storied history of exchangeable models in statistics, but the short version that is more than sufficient for our purposes is that they usually¹⁰ take the form

In a regression context, we typically assume that for some and that will need their own priors.

We can explore this difference mathematically. The regression model, which assumes independence of the , uses as the joint prior on . On the other hand, the exchangeable model, which forms the basis of multilevel models, assumes the joint prior for some prior on on .

This might not seem like much of a change, but it can be quite profound. In both cases, the prior is saying that each is, with high probability, at most away from the overall mean . The difference is that while the classical least squares formulation uses a fixed value of that needs to be specified by the modeller, while the exchangeable model lets adapt to the data.

This data adaptation is really nifty! It means that if the groups have similar means, they can borrow information from the other groups (via the narrowing of ) in order to improve their precision over an unpooled estimate. On the other hand, if there is a meaningful difference between the groups¹¹, this model can still represent that, unlike the unpooled model.

In our context, however, we need a tiny bit more. We have a group-level covariate (specifically age) that we think is going to effect the group mean. So the model we want is

In order to fully specify the model we need to set the four prior distributions.

This is an example of a multilevel¹² model. The name comes from the data having multiple levels (in this case two: the observation level and the group level). Both levels have an appropriate model for their mean.

This mathematical representation does a good job in separating out the two different levels. However, there are a lot of other ways of writing multilevel models. An important example is the extended formula notation created¹³ by R’s lme4 package. In their notation, we would write this model as

formula <- active_bins_scaled ~ age_centred*day + (1 | monkey)

The first bit of this formula is the same as the formula used in linear regression. The interesting bit is is the (1 | monkey). This is the way to tell R that the intercept (aka 1 in formula notation) is going to be grouped by monkey and we are going to put an exchangeable normal prior on it. For more complex models there are more complex variations on this theme, but for the moment we won’t go any further.

Reasoning out some prior distributions

We need to set priors. The canny amongst you may have noticed that I did not set priors in the previous two examples. There are two reasons for this: firstly I didn’t feel like it, and secondly none but the most terrible prior distributions would have meaningfully changed the conclusions. This is, it turns out, one of the great truths when it comes to prior distributions: they do not matter until they do¹⁴.

In particular, if you have a parameter that directly sees the data (eg it’s in the likelihood) and there is nothing weird going on¹⁵, then the prior distribution will usually not do much as any prior will be quickly overwhelmed by the data.

The problem is that we have one parameter in our model () that does not directly see the data. Instead of directly telling us about an observation, it tells us about how different the groups of observations are. There is usually less information in the data about this type of parameter and, consequently, the prior distribution will be more important. This is especially true when you have more than one grouping variable, or when a variable only has a small number of groups.

So let’s pay some proper attention to the priors.

To begin with, let’s set priors on , , and (aka the data-level parameters). This is a considerably easier task if the data is scaled. Otherwise, you need to encode information about the usual scale¹⁶ of the data into your priors. Sometimes this is a sensible and easy thing to do, but usually it’s easier to simply scale the data. (A lot of software will simply scale your data for you, but it is always better to do it yourself!)

So let’s scale our data. We have three variables that need scaling: age (aka the covariate that isn’t categorical) and active_bins (aka the response). For age, we are going to want to measure it as either years from the youngest monkey or years from the average monkey. I think, in this situation, the first version could make a lot of sense, but we are going with the second. This allows us to interpret as the over-all mean. Otherwise, would tell us about the overall average activity of 4 year old monkeys and we will use to estimate how much the activity changes, on average keeping all other aspects constant, as the monkey ages.

On the other hand, we have no sensible baseline for activity, so deviation from the average seems like a sensible scaling. I also don’t know, a priori, how variable activity is going to be, so I might want to scale¹⁷ it by its standard deviation. In this case, I’m not going to do that because we have a sensible fixed¹⁸ upper limit (8), which I can scale by.

One important thing here is that if we scale the data by data-dependent quantities (like the minimum, the mean, or the standard deviation) we must keep track of this information. This is because any future data we try to predict with this model will need to be transformed the same way using the same¹⁹ numbers! This particularly has implication when you are doing things like test/training set validation or cross validation: in the first case, the test set needs to be scaled in the same way the training set was; while in the second case each cross validation training set needs to be scaled independently and that scaling needs to be used on the corresponding left-out data²⁰.

age_centre <- mean(activity_2mins$age)
age_scale <- diff(range(activity_2mins$age))/2
active_bins_centre <- 4

activity_2mins_scaled <- activity_2mins |>
  mutate(monkey = factor(monkey),
         day = factor(day),
         age_centred = (age - age_centre)/age_scale,
         active_bins_scaled = (active_bins - active_bins_centre)/4)
glimpse(activity_2mins_scaled)

Rows: 485
Columns: 7
$ monkey              0, 0, 88, 88, 636, 636, 760, 760, 1257, 1257, 1607,…
$ day                 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, …
$ total               495, 1003, 2642, 524, 199, 282, 363, 445, 96, 495, …
$ active_bins         6, 6, 8, 6, 2, 3, 3, 4, 3, 8, 6, 5, 3, 3, 6, 5, 8, …
$ age                 29, 29, 29, 29, 28, 28, 30, 30, 27, 27, 27, 27, 27,…
$ age_centred         1.1054718, 1.1054718, 1.1054718, 1.1054718, 1.02854…
$ active_bins_scaled  0.50, 0.50, 1.00, 0.50, -0.50, -0.25, -0.25, 0.00, …

With our scaling completed, we can now start thinking about prior distributions. The trick with priors is to make them wide enough to cover all plausible values of a parameter without making them so wide that they put a whole bunch of weight on essentially silly values.

We know, for instance, that our unscaled activity will go between 0 and 8. That means that it’s unlikely for the mean of the scaled process to be much bigger than 3 or 4. These considerations, along with the fact that we have centred the data so the mean should be closer to zero, suggest that a prior should be appropriate for .

As we normalised our age data relative to the smallest age, we should think more carefully about the scaling of . Macaques live for 20-30²¹ years, so we need to think about, for instance, an ordinary aged macaque that would be 15 years older than the baseline. Thanks to our scaling, the largest change that we can have is around 1, which strongly suggests that if was too much larger than we are going to be in unreasonable territory. So let’s put a prior²² on and . For we can use a prior.

Similarly, the scaling of activity_bins suggests that a prior would be sufficient for the data-level standard deviation .

That just leaves us with our choice of prior for the standard deviation of the intercept²³ , . Thankfully, we considered this case in detail in the previous blog post. There I argued that a sensible prior for would be an exponential prior. To be quite honest with you, a half-normal or a half-t also would be fine. But I’m going to stick to my guns. For the scaling, again, it would be a touch surprising (given the scaling of the data) if the group means were more than 3 apart, so choosing in the exponential distribution should give a relatively weak prior without being so wide that we are putting prior mass on a bunch of values that we would never actually want to put prior mass on.

We can then fit the model with brms. In this case, I’m using the cmdstanr back end, because it’s fast and I like it.

To specify the model, we use the lme4-style formula notation discussed above.

To set the priors, we will use brms. Now, if you are Paul you might be able to remember how to set priors in brms without having to look it up, but I am sadly not Paul²⁴, so every time I need to set priors in brms I write the formula and use the convenient get_prior function

library(cmdstanr)
library(brms)
get_prior(formula, activity_2mins_scaled)

From this, we can see that the default prior on is an improper flat prior, the default prior on the intercept is a Student-t with 3 degrees of freedom centred at zero with standard deviation 2.5. The same prior (restricted to positive numbers) is put on all of the standard deviation parameters. These default prior distributions are, to be honest, probably fine in this context²⁵, but it is good practice to always set your prior.

We do this as follows. (Note that brms uses Stan, which parameterises the normal distribution by its mean and standard deviation!)

priors <- prior(normal(0, 0.2), coef = "age_centred") + 
  prior(normal(0,0.2), coef = "age_centred:day2") +
  prior(normal(0, 1), coef = "day2") +
  prior(normal(0,1), class = "sigma") +
  prior(exponential(1), class = sd) + # tau
  prior(normal(0,1), class = "Intercept")
priors

Pre-experiment prophylaxis

So we have specified some priors using the power of our thoughts. But we should probably check to see if they are broadly sensible. A great thing about Bayesian modelling is that we are explicitly specifying our a priori (or pre-data) assumptions about the data generating process. That means that we can do a fast validation of our priors by simulating from them and checking that they’re not too wild.

There are lots of ways to do this, but the easiest²⁶ way to do this is to use the sample_prior = "only" option in the brm() function.

prior_draws <- brm(formula, 
           data = activity_2mins_scaled,
           prior = priors,
           sample_prior = "only",
           backend = "cmdstanr",
           cores = 4,
           refresh = 0)

Start sampling

Running MCMC with 4 parallel chains...

Chain 1 finished in 0.7 seconds.
Chain 2 finished in 0.7 seconds.
Chain 3 finished in 0.7 seconds.
Chain 4 finished in 0.7 seconds.

All 4 chains finished successfully.
Mean chain execution time: 0.7 seconds.
Total execution time: 0.9 seconds.

Now that we have samples from the prior distribution, we can assemble them to work out what our prior tells us we would, pre-data, predict for the number of active bins for a single monkey (in this a single monkey²⁷ that is 10 years older than the baseline).

pred_data <- data.frame(age_centred = 10, day = 1, monkey = "88") 
tibble(pred = brms::posterior_predict(prior_draws, 
                                      newdata = pred_data )) |>
  ggplot(aes(pred)) +
  geom_histogram(aes(y = after_stat(density)), fill = "lightgrey") +
  geom_vline(xintercept = -1, linetype = "dashed") + 
  geom_vline(xintercept = 1, linetype = "dashed") +
  xlim(c(-20,20)) + 
  theme_bw()

The vertical lines are (approximately) the minimum and maximum of the data. This²⁸ suggests that the implied priors are definitely wider than our observed data, but they are not several orders of magnitude too wide. This is a good situation to be in: it gives enough room in the priors that we might be wrong with our specification while also not allowing for truly wild values of the parameters (and implied predictive distribution). One could even go so far as to say that the prior is weakly informative.

Let’s compare this to the default priors on the standard deviation parameters. (The default priors on the regression parameters are improper so we can’t simulate from them. So I replaced the improper prior with a much narrower prior. If you make the prior on the wider the prior predictive distribution also gets wider.)

priors_default <- prior(normal(0,10), class = "b")
prior_draws_default <- brm(formula, 
           data = activity_2mins_scaled,
           prior = priors_default,
           sample_prior = "only",
           backend = "cmdstanr",
           cores = 4,
           refresh = 0)

Running MCMC with 4 parallel chains...

Chain 1 finished in 0.6 seconds.
Chain 2 finished in 0.6 seconds.
Chain 3 finished in 0.6 seconds.
Chain 4 finished in 0.6 seconds.

All 4 chains finished successfully.
Mean chain execution time: 0.6 seconds.
Total execution time: 0.8 seconds.

tibble(pred = brms::posterior_predict(prior_draws_default, 
                                      newdata = pred_data )) |>
  ggplot(aes(pred)) +
  geom_histogram(aes(y = after_stat(density)), fill = "lightgrey") +
  geom_vline(xintercept = -1, linetype = "dashed") + 
  geom_vline(xintercept = 1, linetype = "dashed") +
  theme_bw()

This is considerably wider.

Fitting the data; or do my monkeys get less interesting as they age

With all of that in hand, we can now fit the data. Hooray. This is done with the same command (minus the sample_prior bit).

posterior_draws <- brm(formula, 
           data = activity_2mins_scaled,
           prior = priors,
           backend = "cmdstanr",
           cores = 4,
           refresh = 0)

Start sampling

Running MCMC with 4 parallel chains...

Chain 1 finished in 1.7 seconds.
Chain 3 finished in 1.8 seconds.
Chain 2 finished in 1.8 seconds.
Chain 4 finished in 1.8 seconds.

All 4 chains finished successfully.
Mean chain execution time: 1.8 seconds.
Total execution time: 2.0 seconds.

posterior_draws

 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: active_bins_scaled ~ age_centred * day + (1 | monkey) 
   Data: activity_2mins_scaled (Number of observations: 485) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Group-Level Effects: 
~monkey (Number of levels: 243) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     0.31      0.03     0.25     0.37 1.00     1070     1766

Population-Level Effects: 
                 Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept           -0.04      0.03    -0.11     0.02 1.00     4222     3171
age_centred          0.02      0.07    -0.11     0.14 1.00     3671     3150
day2                 0.10      0.04     0.03     0.18 1.00     8022     2911
age_centred:day2     0.07      0.07    -0.08     0.22 1.00     6170     2584

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     0.43      0.02     0.39     0.47 1.00     1613     2430

Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

There doesn’t seem to be much of an effect of age in this data.

If you’re curious, this matches well²⁹ with the output of lme4, which is a nice sense check for simple models. Generally speaking, if they’re the same then they’re both fine. If they are different³⁰, then you’ve got to look deeper.

library(lme4)
fit_lme4 <- lmer(formula, activity_2mins_scaled)
fit_lme4

Linear mixed model fit by REML ['lmerMod']
Formula: active_bins_scaled ~ age_centred * day + (1 | monkey)
   Data: activity_2mins_scaled
REML criterion at convergence: 734.9096
Random effects:
 Groups   Name        Std.Dev.
 monkey   (Intercept) 0.3091  
 Residual             0.4253  
Number of obs: 485, groups:  monkey, 243
Fixed Effects:
     (Intercept)       age_centred              day2  age_centred:day2  
        -0.04114           0.01016           0.10507           0.08507

We can also compare the fit using leave-one-out cross validation. This is similar to AIC, but more directly interpretable. It is the average of where is a vector of all of the parameters in the model. The notation is the data without the th observation. This average is sometimes called the expected log predictive density or elpd.

To compare it with the two linear regression models, I need to fit them in brms. I will use a prior for the monkey intercepts and the same priors as the previous model for the other parameters.

priors_lm <-  prior(normal(0,1), class = "b") +
  prior(normal(0, 0.2), coef = "age_centred") + 
  prior(normal(0,0.2), coef = "age_centred:day2") +
  prior(normal(0, 1), coef = "day2") +
  prior(normal(0,1), class = "Intercept") +
  prior(normal(0,1), class = "sigma")

posterior_nopool <- brm(
  active_bins_scaled ~ age_centred * day + monkey, 
  data = activity_2mins_scaled,
  prior = priors_lm,
  backend = "cmdstanr",
  cores = 4,
  refresh = 0)

Running MCMC with 4 parallel chains...

Chain 1 finished in 4.5 seconds.
Chain 3 finished in 4.5 seconds.
Chain 2 finished in 4.5 seconds.
Chain 4 finished in 4.5 seconds.

All 4 chains finished successfully.
Mean chain execution time: 4.5 seconds.
Total execution time: 4.7 seconds.

posterior_pool <- brm(
  active_bins_scaled ~ age_centred * day, 
  data = activity_2mins_scaled,
  prior = priors_lm,
  backend = "cmdstanr",
  cores = 4,
  refresh = 0)

Running MCMC with 4 parallel chains...

Chain 1 finished in 0.1 seconds.
Chain 2 finished in 0.1 seconds.
Chain 3 finished in 0.1 seconds.
Chain 4 finished in 0.1 seconds.

All 4 chains finished successfully.
Mean chain execution time: 0.1 seconds.
Total execution time: 0.3 seconds.

We an now use the loo_compare function to compare the models. By default, the best model is listed first and the other models are listed below it with the difference in elpd values given. To do this, we need to tell brms to compute the loo criterion using the add_criterion function.

posterior_draws <- add_criterion(posterior_draws, "loo")

Warning: Found 2 observations with a pareto_k > 0.7 in model 'posterior_draws'.
It is recommended to set 'moment_match = TRUE' in order to perform moment
matching for problematic observations.

posterior_nopool <- add_criterion(posterior_nopool, "loo")

Warning: Found 63 observations with a pareto_k > 0.7 in model
'posterior_nopool'. It is recommended to set 'moment_match = TRUE' in order to
perform moment matching for problematic observations.

posterior_pool <- add_criterion(posterior_pool, "loo")
loo_compare(posterior_draws, posterior_nopool, posterior_pool)

                 elpd_diff se_diff
posterior_draws    0.0       0.0  
posterior_pool   -29.0       7.4  
posterior_nopool -53.3       9.0

There are some warnings there suggesting that we could recompute these using a slower method, but for the purposes of today I’m not going to do that and I shall declare that the multilevel model performs far better than the other two models.

Post-experiment prophylaxis

Of course, we would be fools to just assume that because we fit a model and compared it to some other models, the model is a good representation of the data. To do that, we need to look at some posterior checks.

The easiest thing to look at is the predictions themselves.

fitted <- activity_2mins_scaled |>
  cbind(t(posterior_predict(posterior_draws,ndraws = 200))) |>
  pivot_longer(8:207, names_to = "draw", values_to = "fitted")

day_labs <- c("Day 1", "Day 2")
names(day_labs) <- c("1", "2")

violin_plot <- fitted |> 
  ggplot(aes( x=age, y = 4*fitted + active_bins_centre, group = age)) + 
  geom_violin(colour = "lightgrey") +
  geom_point(aes(y = active_bins), colour = "red") +
  facet_wrap(~day, labeller = labeller(day = day_labs)) +
  theme_bw() 
violin_plot

That appears to be a reasonably good fit, although it’s possible that the prediction intervals are a bit wide. We can also look at the plot of the posterior residuals vs the fitted values. Here the fitted values are the mean of the posterior predictive distribution.

Next, let’s check for evidence of non-linearity in age.

plot_data <- activity_2mins_scaled |>
  mutate(fitted_mean = colMeans(posterior_epred(posterior_draws,ndraws = 200)))

age_plot <- plot_data |> 
  ggplot(aes(x = age, y = active_bins_scaled - fitted_mean)) +
  geom_point() +
  theme_bw()
age_plot

There doesn’t seem to be any obvious evidence of non-linearity in the residuals, which suggests the linear model for age was sufficient.

We can also check the distributional assumption³¹ that the residuals have a Gaussian distribution. We can check this with a qq-plot. Here we are using the posterior mean to define our residuals.

We can look at the qq-plot to see how we’re doing with normality.

distribution_plot <- plot_data |> ggplot(aes(sample = (active_bins_scaled - fitted_mean)/sd(active_bins_scaled - fitted_mean))) + 
  stat_qq() +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed") + 
  theme_classic()
distribution_plot

That’s not too bad. A bit of a deviation from normality in the tails but nothing that would make me weep. It could well be an artifact of how I defined and normalised the residuals.

We can also look at the so-called k-hat plot, which can be useful for finding high-leverage observations in general models.

loo_posterior <- LOO(posterior_draws) #warnings suppressed
loo_posterior


Computed from 4000 by 485 log-likelihood matrix

         Estimate   SE
elpd_loo   -349.8 12.4
p_loo       117.8  5.2
looic       699.7 24.7
------
Monte Carlo SE of elpd_loo is NA.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     418   86.2%   902       
 (0.5, 0.7]   (ok)        65   13.4%   443       
   (0.7, 1]   (bad)        1    0.2%   272       
   (1, Inf)   (very bad)   1    0.2%   59        
See help('pareto-k-diagnostic') for details.

plot(loo_posterior)

This suggests that observations 393, 394 are potentially high leverage and we should check them more carefully. I won’t be doing that today.

Finally, let’s look at the residuals vs the fitted values. This is a commonly used diagnostic plot in linear regression and it can be very useful for visually detecting non-linear patterns and heteroskedasticity in the residuals. So let’s make the plot³².

problem_plot <- plot_data |> 
  ggplot(aes(x = fitted_mean, y = active_bins_scaled - fitted_mean)) + 
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, linetype = "dashed", colour = "blue")+
  facet_wrap(~day) +
  theme_bw() +  theme(legend.position="none") +
  xlim(c(-1,1)) +
  ylim(c(-1,1))
problem_plot

Hmmmm. That’s not excellent. The stripes are related to the 8 distinct values the response can take, but there is definitely a trend in the residuals. In particular, we are under-predicting small values and over-predicting large values. There is something here and we will look into it!

Understanding diagnostic plots from multilevel models

The thing is, multilevel models are notorious for having patterns that are essentially a product of the data design and not of any type of statistical misspecification. In a really great paper that you should all read, Adam Loy, Heike Hofmann, and Di Cook talk extensively about the challenges with interpreting diagnostic plots for linear mixed effects models³³.

I’m not going to fully follow their recommendations, mostly because I’m too lazy³⁴ to write a for loop, but I am going to appropriate the guts of their idea.

They note that strange patterns can occur in diagnostic plots even for correctly specified models. Moreover, we simply do not know what these patters will be. It’s too complex a function of the design, the structure, the data, and the potential misspecification. That sounds bad, but they note that we don’t need to know what pattern to expect. Why not? Because we can simulate it!

So this is the idea: Let’s simulate some fake³⁵ data from a correctly specified model that otherwise matches with our data. We can then compare the diagnostic plots from the fake data with diagnostic plots from the real data and see if the patterns are meaningfully different.

In order to do this, we should have a method to construct multiple fake data sets. Why? Well a plot is nothing but another test statistic and we must take this variability into account.

(That said, do what I say, not what I do. This is a blog. I’m not going to code well enough to make this clean and straightforward, so I’m just going to do one.)

There is an entire theory of visual inference that uses these lineups of diagnostic plots, where one uses the real data and the rest use realisations of the null data, that is really quite interesting and well beyond the scope of this post. But if you want to know more, read the Low, Hoffman, and Cook paper!

Making new data

The first thing that we need to do is to work out how to simulate fake data from a correctly specified model with the same structure. Following the Low etc paper, I’m going to do a simple parameteric bootstrap, where I take the posterior medians of the fitted distribution and simulate data from them.

That said, there are a bunch of other options. Specifically, we have a whole bag of samples from our posterior distribution and it would be possible to use that to select values of³⁶ for our simulation.

So let’s make some fake data and fit the model to it!

monkey_effect <- tibble(monkey = unique(activity_2mins_scaled$monkey), 
                        monkey_effect = rnorm(243,0,0.31))
data_fake <- activity_2mins_scaled |>
  left_join(monkey_effect, by = "monkey")  |>
  mutate(active_bins_scaled = rnorm(length(age_centred),
            mean = -0.04 +0.01 * age_centred + 
              monkey_effect + if_else(day == "2", 0.1 + 0.085 *age_centred, 0.0), 
            sd = 0.43))
                                              
posterior_draws_fake <- brm(formula, 
           data = data_fake,
           prior = priors,
           backend = "cmdstanr",
           cores = 4,
           refresh = 0)

Running MCMC with 4 parallel chains...

Chain 1 finished in 1.6 seconds.
Chain 2 finished in 1.6 seconds.
Chain 3 finished in 1.6 seconds.
Chain 4 finished in 1.6 seconds.

All 4 chains finished successfully.
Mean chain execution time: 1.6 seconds.
Total execution time: 1.8 seconds.

The good plots

First up, let’s look at the violin plot.

library(cowplot)
fitted_fake <- data_fake |>
  cbind(t(posterior_predict(posterior_draws_fake,ndraws = 200))) |>
  pivot_longer(8:207, names_to = "draw", values_to = "fitted")

day_labs <- c("Day 1", "Day 2")
names(day_labs) <- c("1", "2")

violin_fake <- fitted_fake |> 
  ggplot(aes( x=age, y = 4*fitted + active_bins_centre, group = age)) + 
  geom_violin(colour = "lightgrey") +
  geom_point(aes(y = active_bins), colour = "red") +
  facet_wrap(~day, labeller = labeller(day = day_labs)) +
  theme_bw() 
  
plot_grid(violin_plot, violin_fake, labels = c("Real", "Fake"))

That’s very similar to our data plot.

Next up, we will look at the residuals ordered by age

plot_data_fake <- data_fake |>
  mutate(fitted_mean = colMeans(posterior_epred(posterior_draws_fake,ndraws = 200)))

age_fake <- plot_data_fake |> 
  ggplot(aes(x = age, y = active_bins_scaled - fitted_mean)) +
  geom_point() +
  theme_bw()
plot_grid(age_plot, age_fake, labels = c("Real", "Fake"))

Fabulous!

Now let’s check the distributional assumption on the residuals!

distribution_fake <- plot_data_fake |>
  ggplot(aes(sample = (active_bins_scaled - fitted_mean)/sd(active_bins_scaled - fitted_mean))) + 
  stat_qq() +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed") + 
  theme_classic()

plot_grid(distribution_plot, distribution_fake, labels = c("Real", "Fake"))

Excellent!

Finally, we can look at the k-hat plot. Because I’m lazy, I’m not going to put them side by side. You can scroll.

loo_fake <- LOO(posterior_draws_fake)

Warning: Found 4 observations with a pareto_k > 0.7 in model
'posterior_draws_fake'. It is recommended to set 'moment_match = TRUE' in order
to perform moment matching for problematic observations.

loo_fake


Computed from 4000 by 485 log-likelihood matrix

         Estimate   SE
elpd_loo   -372.1 14.9
p_loo       115.4  6.1
looic       744.2 29.7
------
Monte Carlo SE of elpd_loo is NA.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     422   87.0%   579       
 (0.5, 0.7]   (ok)        59   12.2%   220       
   (0.7, 1]   (bad)        4    0.8%   118       
   (1, Inf)   (very bad)   0    0.0%         
See help('pareto-k-diagnostic') for details.

plot(loo_fake)

And look: we get some extreme values. (Depending on the run we get more or less). This suggests that while it would be useful to look at the data points flagged by the k-hat statistic, it may just be sampling variation.;

All of this suggests our model assumptions are not being grossly violated. All except for that residual vs fitted values plot…

The haunted residual vs fitted plot

Now let’s look at our residual vs fitted plot.

problem_fake <- plot_data_fake |> 
  ggplot(aes(x = fitted_mean, y = active_bins_scaled - fitted_mean)) + 
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, linetype = "dashed", colour = "blue")+
  facet_wrap(~day) +
  theme_bw() +  theme(legend.position="none") +
  xlim(c(-1,1)) +
  ylim(c(-1,1))
plot_grid(problem_plot, problem_fake, labels = c("Real", "Fake"))

And what do you know! They look the same. (Well, minus the discretisation artefacts.)

So what the hell is going on?

Great question! It turns out that this is one of those cases where our intuition from linear models does not transfer over to multilevel models.

We can actually reason this out by thinking about a model where we have no covariates.

If we have no pooling then the observations for every monkey are, essentially, averaged to get our estimate of . If we repeat this, we will find that our are basically³⁷ unbiased and the corresponding residual will have mean zero.

But that’s not what happens when we have partial pooling. When we have partial pooling we are combining our naive average³⁸ with the global average in a way that accounts for the size of group relative to other groups as well as the within-group variability relative to the between-group variability.

Expand for maths. Just a little

Priors part 4: Specifying priors that appropriately penalise complexity

Dan Simpson — Fri, 02 Sep 2022 14:00:00 GMT

At some point in the distant past, I wrote three posts about prior distributions. The first was very basic, because why not. The second one talked about conjugate priors. The third one talked about so-called objective priors.

I am suddenly¹ of a mood to write some more on this² topic.

The thing is, so far I’ve only really talked about methods for setting prior distributions that I don’t particularly care for. Fuck that. Let’s talk about things I like. There is enough negative energy³ in the world.

So let’s talk about priors. But the good stuff. The aim is to give my answer to the question “how should you set a prior distribution?”.

Bro do you even know what a parameter is?

You don’t. No one does. They’re not real.

Parameters are polite fictions that we use to get through the day. They’re our weapons of mass destruction. They’re the magazines we only bought for the articles. They are our girlfriends who live in Canada⁴.

One way we can see this is to ask ourselves a simple⁵:

The answer⁶ ⁷ would be two.

But let me ask a different question. How many parameters are there in this model⁸

One answer to this question would be . In this interpretation of the question everything in the model that isn’t directly observed is a parameter.

But there is another view.

Mathematically, these two models are equivalent. That is, if you marginalise⁹ out the you get This is exactly the negative binomial distribution with mean and variance .

So maybe there are two parameters.

Does it make a difference? Sometimes. For instance, if you were following ordinary practice in Bayesian machine learning, you would (approximately) marginalise out in the first model, but in the second model you’d probably treat them as tuning hyper-parameters¹⁰ in the second and optimise¹¹ over them.

Moreover, in the second model we can ask what other priors could we put on the ?. There is no equivalent question for the first model. This could be useful, for instance, if we believe that the overdispersion may differ among population groups. It is considerably easier to extend the random effects formulation into a multilevel model.

Ok. So it doesn’t really matter too much. It really depends on what you’re going to do with the model when you’re breaking your model into things that we need to set priors for and things where the priors are a structural part of the model.

A hello boys into a party date: on flexibility

There are a lot of ways to set prior distributions. I’ve covered some in previous posts and there are certainly more. But today I’m going to focus on one constructive method that I’m particular fond of: penalised complexity priors.

These priors fall out from a certain way of seeing parameters. The idea is that some parameters in a model function as flexibility parameters. These naturally have a base value, which corresponds to the simplest model that they index. I’ll refer to the distribution you get when the parameter takes its base value as the base model.

Example 1 (Overdispersion of a negative binomial) The negative binomial distribution has two parameters: a mean and an overdispersion parameter so the variance is . The mean parameter is not a flexibility parameter. Conceptually, changing the mean¹² does not make a distribution more or less complex, it simply shuttles it around.

On the other hand, the overdispersion parameter is a flexibility parameter. It’s special value is , which corresponds to a Poisson distribution, which is the base model for the negative binomial distribution.

Example 2 (Student-t degrees of freedom) The three parameter student-t distribution has density (parameterised by its standard deviation assuming !) This has mean and variance . The slightly strange parameterisation and the restriction to is useful because it lets us specify a prior on the variance itself and not some parameter that is the variance divided by some function¹³ of .

The natural base model here is , which corresponds to .

Example 3 (Variance of a Gaussian random effect) A Gaussian distribution has two parameters: a mean and a standard deviation . Once again, is not a flexibility parameter, but in some circumstances can be.

To see this, imagine that we have a simple random intercept model In this case, we don’t really view as a flexibility parameter, but is. Why the distinction? Well let’s think about what happens at special value .

When we are saying that there is no variability in the data if we know the corresponding . This is, frankly, quite weird and it’s not necessarily a base model we would believe¹⁴ in.

On the other hand, if , then we are say that all of the groups have the same mean. This is a useful and interesting base model that could absolutely happen in most data. So we say that while isn’t necessarily a flexibility parameter in the model, definitely is.

In this case the base model is the degenerate distribution¹⁵ where the mean of each group is equal to .

The second example shows that the idea of a flexibility parameter is deeply contextual. Once again, we run into the idea that Statistical Arianism¹⁶ is bad. Parameters and their prior distributions can only be fully understood if you know their context within the entire model.

Sure you’re flexible, but let’s not over-do the Dutch wink

Now that we have the concept of a flexibility parameter, let’s think about how we should use it. In particular, we should ask exactly what we want our prior to do. In the paper we listed 8 things that we want the prior to do:

The prior should contain information¹⁷ ¹⁸ ¹⁹
The prior should be aware of model structure
If we move our model to a new application, it should be clear how we can change the information contained in our prior. We can do this by explicitly including specific information in the prior.
The prior should limit²⁰ the flexibility of an overparameterised model
Restrictions of the prior to identifiable sub-manifolds²¹ of the parameter space should be sensible.
The prior should be specified to control what a parameter does in the context²² of the model (rather than its numerical value)
The prior should be computationally²³ feasible
The prior should perform well²⁴.

These desiderata are aspirational and I in no way claim that we successfully satisfied them. But we tried. And we came up with a pretty useful proposal.

The idea is simple: if our model has a flexibility parameter we should put a prior on it that penalises the complexity of the model. That is, we want most of the prior mass to be near²⁵ the base value.

In practice, we try to do this by penalising the complexity of each component of a model. For instance, consider the following model for a flexible regression: The exact definition²⁶ of a smoothing spline that we are using is not wildly important, but it is specified²⁷ by a smoothing parameter , and when we get our base model (a function that is equal to zero everywhere). This model has two components ( and ) and they each have one smoothing parameter (, with base model at , and , with base model at ).

The nice thing about splitting a model up into components and building priors for each component is that we can build generic priors for each component that can be potentially be tuned to make them appropriate for the global model. Is this a perfect way to realise our second aim? No. But it’s an ok place to start²⁸.

The speed of a battered sav: proximity to the base model

Ok. So you’re Brad Pitt. Wait. No.

Ok. So we need to build a prior that penalises complexity by putting most of its prior mass near the base model. In order to do this we need to first specify what we mean by near.

There are a lot of things that we could mean. The easiest choice would be to just use the natural distance from the base model in the parameter space. But this isn’t necessarily a good idea. Firstly, it falls flat when the base model is at infinity. But more importantly, it violates our 6th aim by ignoring the context of the parameter and just setting a prior on its numerical value.

So instead we are going to parameterise distance by asking ourselves a simple question: for a component with flexibility parameter , how much more complex would our model component be if we used the value instead of the base value ?

We can measure this complexity using the Kullback-Leibler divergence (or KL divergence if you’re nasty) This is a quantity from information theory that directly measures how much information would be lost²⁹ if we replaced the more complex model with the simpler model . The more information that would be lost, the more complex is relative to .

While the Kullback-Leibler divergence looks a bit intimidating the first time you see it, it’s got a lot of nice properties:

It’s always non-negative.
It doesn’t depend on how you parameterise the distribution. If you do a smooth, invertible change of variables to both distribution the KL divergence remains unchanged.
It’s related to the information matrix and the Fisher distance. In particular, let be a family of distributions parameterised by . Then, near , where is the Fisher information. The quantity on the right hand side is the square of a distance from the base model.
It can be related to the total variation distance³⁰

But it also has some less charming properties:

The KL divergence is not a distance!
The KL divergence is not symmetric, that is

The first of these properties is irrelevant to us. The second interesting. I’d argue that it is an advantage. We can think in an analogy: if your base model is a point at the bottom of a valley, there is a big practical difference between how much effort it takes to get from the base model to another model that is on top of a hill compared to the amount of effort it takes to go in the other direction. This type of asymmetry is relevant to us: it’s easier for data to tell a simple model that it should be more complex than it is to tell a complex model to be simpler. We want our prior information to somewhat even this out, so we put less prior mass on models that are more complex and more on models that are more complex.

There is one more little annoyance: if you look at the two distance measures that the KL divergence is related to, you’ll notice that in both cases, the KL divergence is related to the square of the distance and not the distance itself.

If we use the KL divergence itself as a distance proxy, it will increase too sharply³¹ and we may end up over-penalising. To that end, we use the following “distance” measure If you’re wondering about that 2, it doesn’t really matter but it makes a couple of things ever so slightly cleaner down the road.

Ok. Let’s compute some of these distances!

Example 4 (Overdispersion of a negative binomial (continued)) The negative binomial distribution is discrete so This has two problems: I can’t work out what it is and it might³² end up depending on .

Thankfully we can use our alternative representation of the negative binomial to note that and so we could just as well consider the model component that we want to penalise the complexity of. In this case we need the KL divergence³³ between Gamma distributions where is the digamma function.

As , the KL divergence becomes³⁴

Now, you will notice that as the KL divergence heads off to infinity. This happens a lot when the base model is much simpler than the flexible model. Thankfully, we will see later that we can ignore the factor of and get a PC prior that’s valid against the base model for all sufficiently small . This is not legally the same thing as having one for , but it is morally the same.

With this, we get

If the digamma function is a bit too hardcore for you, the approximation gives the approximate distance That is, the distance we are using is approximately the standard deviation of .

Let’s see if this approximation³⁵ is any good.

library(tidyverse)
tibble(alpha = seq(0.01, 20, length.out = 1000),
       exact = sqrt(2*log(1/alpha) - 2*digamma(1/alpha)),
       approx = sqrt(alpha)
       ) |>
  ggplot(aes(x = alpha, y = exact)) + 
  geom_line(colour = "red") +
  geom_line(aes(y = approx), colour = "blue", linetype = "dashed") +
  theme_bw()

It’s ok but it’s not perfect.

Example 5 (Student-t degrees of freedom (Continued)) In our original paper, we computed the distance for the degrees of freedom numerically. However, Yongqiang Tang derived an analytic expression for it.

If we note that we can use this (and the above asymptotic expansion of the digamma function) to get We can use the same asymptotic approximations as above to get

Let’s check this approximation numerically.

tibble(nu = seq(2.1, 300, length.out = 1000),
       exact = sqrt(1 + log(2/(nu-2)) + 
                      2*lgamma((nu+1)/2) - 2*lgamma(nu/2) - 
                      (nu + 1)* (digamma((nu+1)/2)-
                                   digamma(nu/2))),
       approx = sqrt(log(nu^2/((nu-2)*(nu+1))) - (nu+2)/(3*nu*(nu+1)))
       ) |>
  ggplot(aes(x = nu, y = exact)) + 
  geom_line(colour = "red") +
  geom_line(aes(y = approx), colour = "blue", linetype = "dashed") +
  theme_bw()

Once again, this is not a terrible approximation, but it’s also not an excellent one.

Example 6 (Variance of a Gaussian random effect (Continued)) The distance calculation for the standard deviation of a Gaussian random effect has a very similar structure to the negative binomial case. We note, via wikipedia, that

This implies that We shall see later that the scaling on the doesn’t matter so for all intents and purposed

Spinning off the flute into a flat bag: Turning a distance into a prior

So now that we have a distance measure, we need to turn it into a prior. There are lots of ways we can do this. Essentially any prior we put on the distance can be transformed into a prior on the flexibility parameter . We do this through the change of variables formula where is the prior density for the distance parameterisation

But which prior should we use on the distance? A good default choice is a prior that penalises at a constant rate. That is, we want for some . This condition says that the rate at which the density decreases does not change as we move through the parameter space. This is extremely useful because any other (monotone) distribution is going to have a point at which the bulk changes to the tail. As we are putting our prior on , we won’t necessarily be able to reason about this point.

Constant-rate penalisation implies that the prior on the distance scale is an exponential distribution and, hence, we get our generic PC prior for a flexibility parameter

Example 7 (Overdispersion of a negative binomial (continued)) The exact PC prior for the overdispersion parameter in the negative binomial distribution is where is the derivative of the digamma function.

On the other hand, if we use the approximate distance we get

lambda <- 1
dat <- tibble(alpha = seq(0.01, 20, length.out = 1000),
       exact = lambda / alpha^2 * abs(trigamma(1/alpha) - alpha)/
         sqrt(2*log(1/alpha) -
                2*digamma(1/alpha))*
         exp(-lambda*sqrt(2*log(1/alpha) - 
                            2*digamma(1/alpha))),
       approx = lambda/(2*sqrt(alpha))*exp(-lambda*sqrt(alpha))
       ) 
dat |>
  ggplot(aes(x = alpha, y = exact)) + 
  geom_line(colour = "red") +
  geom_line(aes(y = approx), colour = "blue", linetype = "dashed") +
  theme_bw()

dat |>
  ggplot(aes(x = alpha, y = exact - approx)) + 
  geom_line(colour = "black") +
  theme_bw()

That’s a pretty good agreement!

Example 8 (Student-t degrees of freedom (Continued)) An interesting feature of the PC prior (and any prior where the density on the distance scale takes its maximum at the base model) is that the implied prior on has no finite moments. In fact, if your prior on has finite moments, the density on the distance scale is zero at zero!

The exact PC prior for the degrees of freedom in a Student-t distribution is where is given above.

The approximate PC prior is Let’s look at the difference.

dist_ex <- \(nu) sqrt(1 + log(2/(nu-2)) + 
                      2*lgamma((nu+1)/2) - 2*lgamma(nu/2) - 
                      (nu + 1)* (digamma((nu+1)/2)-
                                   digamma(nu/2)))
dist_ap <- \(nu) sqrt(log(nu^2/((nu-2)*(nu+1))) - (nu+2)/(3*nu*(nu+1)))

lambda <- 1
dat <- tibble(nu = seq(2.1, 30, length.out = 1000),
       exact = lambda * (1/(nu-2) + (nu+1)/2 * (trigamma((nu+1)/2) - trigamma(nu/2)))/(4*dist_ex(nu)) * exp(-lambda*dist_ex(nu)),
       approx = lambda * (nu*(nu+2)*(2*nu + 9) + 4)/(3*nu^2*(nu+1)^2*(nu-2)) * exp(-lambda*dist_ap(nu))
       ) 
dat |>
  ggplot(aes(x = nu, y = exact)) + 
  geom_line(colour = "red") +
  geom_line(aes(y = approx), colour = "blue", linetype = "dashed") +
  theme_bw()

dat |>
  ggplot(aes(x = nu, y = exact - approx)) + 
  geom_line(colour = "black") +
  theme_bw()

The approximate prior isn’t so good for near 2. In the original paper, the distance was tabulated for and a different high-precision asymptotic expansion was given for .

In the original paper, we also plotted some common priors for the degrees of freedom on the distance scale to show just how informative flat-ish priors on can be! Note that the wider the uniform prior on is the more informative it is on the distance scale.

(Left) Exponential priors on shown on the distance scale, from right to left the mean of the prior increases (5, 10, 20). (Right) priors on shown on the distance scale. From left to right increases (20, 50, 100).

Example 9 (Variance of a Gaussian random effect (Continued)) This is the easy one because the distance is equal to the standard deviation! The PC prior for the standard deviation of a Gaussian distribution is an exponential prior More generally, if is a multivariate normal distribution, than the PC prior for is still The corresponding prior on is Sometimes, for instance if you’re converting a model from BUGS or you’re looking at the smoothing parameter of a smoothing spline, you might specify your normal distribution in terms of the precision, which is the inverse of the variance. If , then the corresponding PC prior (using the change of variables ) is

This case was explored extensively in the context of structured additive regression models (think GAMs but moreso) by Klein and Kneib, who found that the choice of exponential prior on the distance scale gave more consistent performance than either a half-normal or a half-Cauchy distribution.

Closing the door: How to choose

The big unanswered question is how do we choose . The scaling of a prior distribution is vital to its success, so this is an important question.

And I will just say this: work it out your damn self.

The thing about prior distributions that shamelessly include information is that, at some point, you need to include³⁶ some information. And there is no way for anyone other than the data analyst to know what the information to include is.

But I can outline a general procedure.

Imagine that for your flexibility parameter you have some interpretable transformation of it . For instance if , then a good choice for would be . This is because standard deviations are on the same scale as the observations³⁷, and we have intuition about that happens one standard deviation from the mean.

We then use problem-specific information can help us set a natural scale for . We do this by choosing so that for some , which we would consider large³⁸ for our problem, and .

From the properties of the exponential distribution, we can see that we can satisfy this if we choose This can be found numerically if it needs to be.

The simplest case is the standard deviation of the normal distribution, because in this case and . In general, if and is not a correlation matrix, you should take into account the diagonal of when choosing . For instance, choosing to be the geometric mean³⁹ of the marginal variances of the is a good idea!

When a model has more than one component, or a component has more than one flexibility parameter, it can be the case that depends on multiple parameters. For instance, if I hadn’t reparameterised the Student-t distribution to have variance independent of , a PC prior on would have a quantity of interest that depends on . We will also see this if I ever get around to writing about priors for Gaussian processes.

The Dream: PC priors in practice

Thus we can put together a PC prior as the unique prior that follows the following four principles:

Occam’s razor: We have a base model that represents simplicity and we prefer our base model.
Measuring complexity: We define the prior using the square root of the KL divergence between the base model and the more flexible model. The square root ensures that the divergence is on a similar scale to a distance, but we maintain the asymmetry of the divergence as as a feature (not a bug).
Constant penalisation: We use an exponential prior on the distance scale to ensure that our prior mass decreases evenly as we move father away from the base model.
User-defined scaling: We need the user to specify a quantity of interest and a scale . We choose the scaling of the prior so that . This ensures that when we move to a new context, we are able to modify the prior by using the relevant information about .

These four principles define a PC prior. I think the value of laying them out explicitly is that users and critics can clearly and cleanly identify if these principles are relevant to their problem and, if they are, they can implement them. Furthermore, if you need to modify the principles (say by choosing a different distance measure), there is a clear way to do that.

I’ve come to the end of my energy for this blog post, so I’m going to try to wrap it up. I will write more on the topic later, but for now there are a couple of things I want to say.

These priors can seem quite complex, but I assure you that are a) useful, b) used, and c) not too terrible in practice. Why? Well fundamentally because you usually don’t have to derive them yourselves. Moreover, a lot of that complexity is the price we pay for dealing with densities. We think that this is worth it and the lesson that the parameterisation that you are given may not be the correct parameterisation to use when specifying your prior is an important one!

The original paper contains a bunch of other examples. The paper was discussed and we wrote a rejoinder⁴⁰, which contains an out-of-date list of other PC priors people have derived. If you are interested in some other people’s views of this idea, a good place to start is the discussion of the original paper.

There are also PC priors for Gaussian Processes, disease mapping models, AR(p) processes, variance parameters in multilevel models, and many more applications.

PC priors are all over the INLA software package and its documentation contains a bunch more examples.

Try them out. They’ll make you happy.

Footnotes

I’ve not turned on my computer for six weeks and tbh I finished 3 games and I’m caught up on TV and the weather is shite.↩︎
“But what about sparse matrices?!” exactly 3 people ask. I’ll get back to them. But this is what I’m feeling today.↩︎
I am told my Mercury is in Libra and truly I am not living that with those posts. Maybe Mercury was in Gatorade when I wrote them. So if we can’t be balanced at least let’s like things.↩︎
Our weapons of ass destruction that lives in Canada?↩︎
Negative binomial parameterised by mean and overdispersion so that its mean is and the variance is because we are not flipping fucking coins here↩︎
Hello and welcome to Statistics for Stupid Children. My name is Daniel and I will be your host today.↩︎
If we didn’t have stupid children we’d never get dumb adults and then who would fuck me? You? You don’t have that sort of time. You’ve got a mortgage to service and interest rates are going up. You’ve got your Warhammer collection and it is simply not going to paint itself. You’ve been meaning to learn how to cook Thai food. You simply do not have the time. (I’m on SSRIs so it’s never clear what will come first: the inevitable decay and death of you and your children and your children’s children; the interest, eventual disinterest, and inevitable death of the family archivist from the far future who digs up your name from the digital graveyard; the death of the final person who will ever think of you, thereby removing you from the mortal realm entirely; the death of the universe; or me. Fucking me is a real time commitment.)↩︎
Gamma is parameterised by shape and rate, so has mean 1 and variance .↩︎
integrate↩︎
Sometimes, people still refer to these as hyperparameters and put priors on them, which would clarify things, but like everything in statistics there’s no real agreed upon usage. Because why would anyone want that?↩︎
somehow↩︎
location parameter↩︎
This is critical: we do not know so the only way we can put a sensible prior on the scaling parameter is if we disentangle the role of these two parameters!↩︎
In fact, if my model estimated the data-level variance to be nearly zero I would assume I’ve fucked something up elsewhere and my model is either over-fitting or I have a redundancy in my model (like if ).↩︎
There are some mathematical peculiarities that we will run into later when the base model is singular. But they’re not too bad.↩︎
The Arianist heresy is that God, Jesus, and the Holy Spirit are three separate beings rather than consubstantial. It’s the reason for that bit of the Nicene. The statistical version most commonly occurs when you consider you model for your data conditional on the parameters (you likelihood) and your model for the parameters (your prior) as separate objects. This can lead to really dumb priors and bad inferences.↩︎
Complaining that a prior is adding information is like someone complaining to you that his boyfriend has stopped fucking him and you subsequently discovering that this is because his boyfriend died a few weeks ago. Like I’m sorry Jonathan, I know even the sight of a traffic cone sets your bussy a-quiverin’, but there really are bigger concerns and I’m gonna need you to focus.↩︎
In this story, the bigger concerns are things like misspecification, incorrect assumptions, data problems etc etc, the traffic cone is an unbiased estimator, Jonathan is our stand in for a generic data analyst, and Jonathan’s bussy is said data scientist’s bussy.↩︎
Yes, I know that there are problems with giving my generic data analyst a male name. Did I carefully think through the gender and power dynamics in my bussy simile? I think the answer to that is obvious.↩︎
We use priors for the same reason that other people use penalties: we don’t want to go into a weird corner of our model space unless our data explicitly drags us there↩︎
This is a bit technical. When a model is over-parameterised, it’s not always possible to recover all of the parameters. So we ideally want to make sure that if there are bunch of asymptotically equivalent parameters, our prior operates sensibly on that set. An example of this will come in a future post where I’ll talk about priors for the parameters of a Gaussian process.↩︎
That Arianism thing creeping in again!↩︎
There are examples of theoretically motivated priors where it’s wildly expensive to compute their densities. We will see one in a later post about GPs.↩︎
Sure, Jan. Of course we want that. But we believed that it was important to include this in a list of desiderata because we never want to say “our prior has motivation X and therefore it is good”. It is not enough to be pure, you actually have to work.↩︎
What do I mean by near? Read on McDuff.↩︎
Think of it as a P-spline if you must. The the important thing is that the weights of the basis functions are jointly normal with mean zero and precision matrix .↩︎
Given the knots, which are fixed↩︎
I might talk about more advanced solutions at some point.↩︎
Strictly how many bits would we need ↩︎
The largest absolute difference between the probability that an event happens under and .↩︎
When performing the battered sav, it’s important to not speed up too quickly lest you over-batter.↩︎
It also might not. I don’t care to work it out.↩︎
The “easy” way to get this is to use the fact that the Gamma is in the exponential family and use the general formula for KL divergences in exponential families. The easier way is to look it up on Wikipedia↩︎
Using asymptotic expansions for the log of a Gamma function at infinity↩︎
I’ll be dead before I declare that something is an approximation without bloody checking how good it is.↩︎
We have already included information that is a flexibility parameter with base model , but that is model-specific information. Now we move on to problem specific information.↩︎
the have the same units↩︎
Same thing happens if we want a particular quantity not to be too small, just swap the signs↩︎
Always average on the natural scale. For non-negative parameters geometric means make a lot more sense than arithmetic means!↩︎
Homosexually titled You just keep on pushing my love over the borderline: a rejoinder. I’m still not sure how I got away with that.↩︎

Reuse

https://creativecommons.org/licenses/by-nc/4.0/

Citation

BibTeX citation:

@online{simpson2022,
  author = {Dan Simpson},
  editor = {},
  title = {Priors Part 4: {Specifying} Priors That Appropriately
    Penalise Complexity},
  date = {2022-09-03},
  url = {https://dansblog.netlify.app/2022-08-29-priors4/2022-08-29-priors4.html},
  langid = {en}
}

For attribution, please cite this work as:

Dan Simpson. 2022. “Priors Part 4: Specifying Priors That Appropriately Penalise Complexity.” September 3, 2022. https://dansblog.netlify.app/2022-08-29-priors4/2022-08-29-priors4.html.

Tail stabilization of importance sampling etimators: A bit of theory

Dan Simpson — Tue, 14 Jun 2022 14:00:00 GMT

Imagine you have a target probability distribution and you want to estimate the expectation . That’s lovely and everything, but if it was easy none of us would have jobs. High-dimensional quadrature is a pain in the arse.

A very simple way to get an decent estimate of is to use importance sampling, that is taking draws , from some proposal distribution . Then, noting that we can use Monte Carlo to estimate the second integral. This leads to the importance sampling estimator

This all seems marvellous, but there is a problem. Even though is probably a very pleasant function and is a nice friendly distribution, can be an absolute beast. Why? Well it’s¹ the ratio of two densities and there is no guarantee that the ratio of two nice functions is itself a nice function. In particular, if the bulk of the distributions and are in different places, you’ll end up with the situation where for most draws is very small² and a few will be HUGE³.

This will lead to an extremely unstable estimator.

It is pretty well known that the raw importance sampler will behave nicely (that is will be unbiased with finite variance) precisely when the distribution of has finite variance.

Elementary treatments stop there, but they miss two very big problems. The most obvious one is that it’s basically impossible to check if the variance of is finite. A second, much larger but much more subtle problem, is that the variance can be finite but massive. This is probably the most common case in high dimensions. McKay has an excellent example where the importance ratios are bounded, but that bound is so large that it is infinite for all intents and purposes.

All of which is to say that importance sampling doesn’t work unless you work on it.

Truncated importance sampling

If the problem is the fucking ratios then by gum we will fix the fucking ratios. Or so the saying goes.

The trick turns out to be modifying the largest ratios enough that we stabilise the variance, but not so much as to overly bias the estimate.

The first version of this was truncated importance sampling (TIS), which selects a threshold and estimates the expectation as It’s pretty obvious that has finite variance for any fixed , but we should be pretty worried about the bias. Unsurprisingly, there is going to be a trade-off between the variance and the bias. So let’s explore that.

The bias of TIS

To get an expression for the bias, first let us write and for . Occasionally we will talk about the joint distribution or . Sometimes we will also need to use the indicator variables .

Then, we can write⁴

How does this related to TIS? Well. Let be the random variable denoting the number of times . Then,

Hence the bias in TIS is

To be honest, this doesn’t look phenomenally interesting for fixed , however if we let depend on the sample size then as long as we get vanishing bias.

We can get more specific if we make the assumption about the tail of the importance ratios. In particular, we will assume that⁵ for some⁶ .

While it seems like this will only be useful for estimating , it turns out that under some mild⁷ technical conditions, the conditional excess distribution function⁸ is well approximated by a Generalised Pareto Distribution as . Or, in maths, as , for some and . The shape⁹ parameter is very important for us, as it tells us how many moments the distribution has. In particular, if a distribution has shape parameter , then We will focus exclusively on the case where . When , the distribution has finite variance.

If , then the conditional exceedence function is which suggests that as , converges to a generalised Pareto distribution with shape parameter and scale parameter .

All of this work lets us approximate the distribution of and use the formula for the mean of a generalised Pareto distribution. This gives us the estimate which estimates the bias when is constant¹⁰ as

For what it’s worth, Ionides got the same result more directly in the TIS paper, but he wasn’t trying to do what I’m trying to do.

The variance in TIS

The variance is a little bit more annoying. We want it to go to zero.

As before, we condition on (or, equivalently, ) and then use the law of total variance. We know from the bias calculation that

A similarly quick calculation tells us that To close it out, we recall that is the sum of Bernoulli random variables so

With this, we can get an expression for the unconditional variance. To simplify the expression, let’s write . Then,

There are four terms in the variance. The first and third terms are clearly harmless: they go to zero no matter how we choose . Our problem terms are the second and fourth. We can tame the fourth term if we choose . But that doesn’t seem to help with the second term. But it turns out it is enough. To see this, we note that where the second inequality uses the fact that and the third comes from the law of total probability.

So the TIS estimator has vanishing bias and variance as long as the truncation and . Once again, this is in the TIS paper, where it is proved in a much more compact way.

Asymptotic properties

It can also be useful to have an understanding of how wild the fluctuations are. For traditional importance sampling, we know that if is finite, then then the fluctuations are, asymptotically, normally distributed with mean zero. Non-asymptotic results were given by Chatterjee and Diaconis that also hold even when the estimator has infinite variance.

For TIS, it’s pretty obvious that for fixed and , will be asymptotically normal (it is, after all, the sum of bounded random variables). For growing sequences it’s a tiny bit more involved: it is now a triangular array¹¹ rather than a sequence of random variables. But in the end very classical results tell us that for bounded¹² , the fluctuations of the TIS estimator are asymptotically normal.

It’s worth saying that when is unbounded, it might be necessary to truncate the product rather than just . This is especially relevant if grows rapidly with . Personally, I can’t think of a case where this happens: usually grows (super-)exponentially in while usually grows polynomially, which implies grows (poly-)logarithmically.

The other important edge case is that when can be both positive and negative, it might be necessary to truncate both above and below.

Winsorised importance sampling

TIS has lovely theoretical properties, but it’s a bit challenging to use in practice. The problem is, there’s really no practical guidance on how to choose the truncation sequence.

So let’s do this differently. What if instead of specifying a threshold directly, we instead decided that the largest values are potentially problematic and should be modified? Recall that for TIS, the number of samples that exceeded the threshold, , was random while the threshold was fixed. This is the opposite situation: the number of exceedences is fixed but the threshold is random.

The threshold is now the th largest value of . We denote this using order statistics notation: we re-order the sample so that With this notation, the threshold is and the Winsorized importance sampler (WIS) is where are the pairs ordered so that . Note that are not necessarily in increasing order: they are known as concomitants of , which is just a fancy way to say that they’re along for the ride. It’s very important that we reorder the when we reorder the , otherwise we won’t preserve the joint distribution and we’ll end up with absolute rubbish.

We can already see that this is both much nicer and much wilder than the TIS distribution. It is convenient that is no longer random! But what the hell are we going to do about those order statistics? Well, the answer is very much the same thing as before: condition on them and hope for the best.

Conditioned on the event¹³ , we get From this, we get that the bias, conditional on is

You should immediately notice that we are in quite a different situation from TIS, where only the tail contributed to the bias. By fixing and randomising the threshold, we have bias contributions from both the bulk (due, essentially, to a weighting error) and from the tail (due to both the weighting error and the truncation). This is going to require us to be a bit creative.

We could probably do something more subtle and clever here, but that is not my way. Instead, let’s use the triangle inequality to say and so the first term in the bias can be bounded if we can bound the relative error

Now the more sensible among you will say Daniel, No! That’s a ratio! That’s going to be hard to bound. And, of course, you are right. But here’s the thing: if is small relative to , it is tremendously unlikely that is anywhere near zero. This is intuitively true, but also mathematically true.

To attack this expectation, we are going to look at a slightly different quantity that has the good grace of being non-negative.

Lemma 1 Let , be an iid sample from , let be an integer. Then and where is an F-distributed random variable with parameters .

Proof. For any , where is the incomplete Beta function.

You could, quite reasonably, ask where the hell that incomplete Beta function came from. And if I had thought to look this up, I would say that it came from Equation 2.1.5 in David and Nagaraja’s book on order statistics. Unfortunately, I did not look this up. I derived it, which is honestly not very difficult. The trick is to basically note that the event is the same as the event that at least of the samples are less than or equal to . Because the are independent, this is the probability of observing at least heads from a coin with the probability of a head . If you look this up on Wikipedia¹⁴ you see¹⁵ that it is . The rest just come from noting that and using the symmetry .

To finish this off, we note that From which, we see that

The second result follows the same way and by noting that is also F-distributed with parameters .

The proof has ended

Now, obviously, in this house we do not trust mathematics. Which is to say that I made a stupid mistake the first time I did this and forgot that when is binomial, and had a persistent off-by-one error in my derivation. But we test out our results so we don’t end up doing the dumb thing.

So let’s do that. For this example, we will use generalised Pareto-distributed .

library(tidyverse)
xi <- 0.7
s <- 2
u <- 4

samp <- function(S, k, p, 
                 Q = \(x) u + s*((1-x)^(-xi)-1)/xi, 
                 F = \(x) 1 - (1 + xi*(x - u)/s)^(-1/xi)) {
  # Use theory to draw x_{k:S}
  xk <- Q(rbeta(1, k, S - k + 1))
  c(1 - p / F(xk), 1-(1-p)/(1-F(xk)))
}

S <- 1000
M <- 50
k <- S - M + 1
p <- 1-M/S
N <- 100000

fs <- rf(N, 2 * (S - k + 1), 2 * k )
tibble(theoretical = 1-p - p * fs * (S - k + 1)/k,
       xks = map_dbl(1:N, \(x) samp(S, k, p)[1])) %>%
  ggplot() + stat_ecdf(aes(x = xks), colour = "black") + 
  stat_ecdf(aes(x = theoretical), colour = "red", linetype = "dashed") +
  ggtitle(expression(1 - frac(1-M/S , R(r[S-M+1:S]))))

tibble(theoretical = p - (1-p) * k/(fs * (S - k + 1)),
       xks = map_dbl(1:N, \(x) samp(S, k, p)[2])) %>%
  ggplot() + stat_ecdf(aes(x = xks), colour = "black") + 
  stat_ecdf(aes(x = theoretical), colour = "red", linetype = "dashed") +
  ggtitle(expression(1 - frac(M/S , 1-R(r[S-M+1:S]))))

Fabulous. It follow then that where has an F-distribution with degrees of freedom. As , it follows that this term goes to zero as long as . This shows that the first term in the bias goes to zero.

It’s worth noting here that we’ve also calculated that the bias is at most , however, this rate is extremely sloppy. That upper bound we just computed is unlikely to be tight. A better person than me would probably check, but honestly I just don’t give a shit¹⁶

The second term in the bias is As before, we can write this as By our lemma, we know that the distribution of the term in the absolute value when is the same as where , which has mean and variance From Jensen’s inequality, we get If follows that and so we get vanishing bias as long as and .

Once again, I make no claims of tightness¹⁷. Just because it’s a bit sloppy at this point doesn’t mean the job isn’t done.

Theorem 1 Let , be an iid sample from and let . Assume that

is absolutely continuous
and

Then Winsorized importance sampling converges in and is asymptotically unbiased.

Ok so that’s nice. But you’ll notice that I did not mention our piss-poor rate. That’s because there is absolutely no way in hell that the bias is ! That rate is an artefact of a very sloppy bound on .

Unfortunately, Mathematica couldn’t help me out. Its asymptotic abilities shit the bed at the sight of , which is everywhere in the exact expression (which I’ve put below in the fold.

Mathematica expression for .

Sparse matrices 6: To catch a derivative, first you’ve got to think like a derivative

Dan Simpson — Sun, 29 May 2022 14:00:00 GMT

Welcome to part six!!! of our ongoing series on making sparse linear algebra differentiable in JAX with the eventual hope to be able to do some cool statistical shit. We are nowhere near done.

Last time, we looked at making JAX primitives. We built four of them. Today we are going to implement the corresponding differentiation rules! For three¹ of them.

So strap yourselves in. This is gonna be detailed.

If you’re interested in the code², the git repo for this post is linked at the bottom and in there you will find a folder with the python code in a python file.

She is beauty and she is grace. She is queen of 50 states. She is elegance and taste. She is miss autodiff

Derivatives are computed in JAX through the glory and power of automatic differentiation. If you came to this blog hoping for a great description of how autodiff works, I am terribly sorry but I absolutely do not have time for that. Might I suggest google? Or maybe flick through this survey by Charles Margossian..

The most important thing to remember about algorithmic differentiation is that it is not symbolic differentiation. That is, it does not create the functional form of the derivative of the function and compute that. Instead, it is a system for cleverly composing derivatives in each bit of the program to compute the value of the derivative of the function.

But for that to work, we need to implement those clever little mini-derivatives. In particular, every function needs to have a function to compute the corresponding Jacobian-vector product where the matrix has entries

Ok. So let’s get onto this. We are going to derive and implement some Jacobian-vector products. And all of the assorted accoutrement. And by crikey. We are going to do it all in a JAX-traceable way.

JVP number one: The linear solve.

The first of the derivatives that we need to work out is the derivative of a linear solve . Now, intrepid readers, the obvious thing to do is look the damn derivative up. You get exactly no hero points for computing it yourself.

But I’m not you, I’m a dickhead.

So I’m going to derive it. I could pretend there are reasons³, but that would just be lying. I’m doing it because I can.

Beyond the obvious fun of working out a matrix derivative from first principles, this is fun because we have two arguments instead of just one. Double the fun.

And we really should make sure the function is differentiated with respect to every reasonable argument. Why? Because if you write code other people might use, you don’t get to control how they use it (or what they will email you about). So it’s always good practice to limit surprises (like a function not being differentiable wrt some argument) to cases⁴ where it absolutely necessary. This reduces the emails.

To that end, let’s take an arbitrary SPD matrix with a fixed sparsity pattern. Let’s take another symmetric matrix with the same sparsity pattern and assume that is small enough⁵ that is still symmetric positive definite. We also need a vector with a small .

Now let’s get algebraing.

Easy⁶ as.

We’ve actually calculated the derivative now, but it’s a little more work to recognise it.

To do that, we need to remember the practical definition of the Jacobian of a function that takes an -dimensional input and produces an -dimensional output. It is the matrix such that

The formulas further simplify if we write . Then, if we want the Jacobian-vector product for the first argument, it is while the Jacobian-vector product for the second argument is

The only wrinkle in doing this is we need to remember that we are only storing the lower triangle of . Because we need to represent the same way, it is represented as a vector Delta_x that contains only the lower triangle of . So we need to make sure we remember to form the whole matrix before we do the matrix-vector product !

But otherwise, the implementation is going to be pretty straightforward. The Jacobian-vector product costs one additional linear solve (beyond the one needed to compute the value ).

In the language of JAX (and autodiff in general), we refer to and as tangent vectors. In search of a moderately coherent naming convention, we are going to refer to the tangent associated with the variable x as xt.

So let’s implement this. Remember: it needs⁷ to be JAX traceable.

Primitive two: The triangular solve

For some sense of continuity, we are going to keep the naming of the primitives from the last blog post, but we are not going to attack them in the same order. Why not? Because we work in order of complexity.

So first off we are going to do the triangular solve. As I have yet to package up the code (I promise, that will happen next⁸), I’m just putting it here under the fold.

The primal implementation

Sparse Matrices 5: I bind you Nancy

Dan Simpson — Thu, 19 May 2022 14:00:00 GMT

This is part five of our ongoing series on implementing differentiable sparse linear algebra in JAX. In some sense this is the last boring post before we get to the derivatives. Was this post going to include the derivatives? It sure was but then I realised that a different choice was to go to bed so I can get up nice and early in the morning and vote in our election.

It goes without saying that before I split the posts, it was more than twice as long and I was nowhere near finished. So probably the split was a good choice.

But how do you add a primative to JAX?

Well, the first step is you read the docs.

They tell you that you need to implement a few things:

An implementation of the call with “abstract types”
An implementation of the call with concrete types (aka evaluation the damn function)

Then,

if you want your primitive to be JIT-able, you need to implement a compilation rule.
if you want your primitive to be batch-able, you need to implement a batching rule.
if you want your primitive to be differentiable, you need to implement the derivatives in a way that allows them to be propagated appropriately.

In this post, we are going to do the first task: we are going to register JAX-traceable versions of the four main primitives we are going to need for our task. For the most part, the implementations here will be replaced with C++ bindings (because only a fool writes their own linear algebra code). But this is the beginning¹ of our serious journey into JAX.

First things first, some primitives

In JAX-speak, a primitive is a function that is JAX-traceable². It is not necessary for every possible transformation to be implemented. In fact, today I’m not going to implement any transformations. That is a problem for future Dan.

We have enough today problems.

Because today we need to write four new primitives.

But first of all, let’s build up a test matrix so we can at least check that this code runs. This is the same example from blog 3. You can tell my PhD was in numerical analysis because I fucking love a 2D Laplacian.

from scipy import sparse
import numpy as np

def make_matrix(n):
    one_d = sparse.diags([[-1.]*(n-1), [2.]*n, [-1.]*(n-1)], [-1,0,1])
    A = (sparse.kronsum(one_d, one_d) + sparse.eye(n*n)).tocsc()
    A_lower = sparse.tril(A, format = "csc")
    A_index = A_lower.indices
    A_indptr = A_lower.indptr
    A_x = A_lower.data
    return (A_index, A_indptr, A_x, A)

A_indices, A_indptr, A_x, A = make_matrix(10)

Primitive one:

Because I’m feeling lazy today and we don’t actually need the Cholesky directly for any of this, I’m going to just use scipy. Why? Well, honestly, just because I’m lazy. But also so I can prove an important point: the implementation of the primitive does not need to be JAX traceable. So I’m implementing it in a way that is not now and will likely never be JAX traceable³.

First off, we need to write the solve function and bind it⁴ to JAX. Specific information about what exactly some of these commands are doing can be found in the docs, but the key thing is that there is no reason to dick around whit JAX types in any of these implementation functions. They are only ever called using (essentially) numpy⁵ arrays. So we can just program like normal human beings.

from jax import numpy as jnp
from jax import core

sparse_solve_p = core.Primitive("sparse_solve")

def sparse_solve(A_indices, A_indptr, A_x, b):
  """A JAX traceable sparse solve"""
  return sparse_solve_p.bind(A_indices, A_indptr, A_x, b)

@sparse_solve_p.def_impl
def sparse_solve_impl(A_indices, A_indptr, A_x, b):
  """The implementation of the sparse solve. This is not JAX traceable."""
  A_lower = sparse.csc_array((A_x, A_indices, A_indptr)) 
  
  assert A_lower.shape[0] == A_lower.shape[1]
  assert A_lower.shape[0] == b.shape[0]
  
  A = A_lower + A_lower.T - sparse.diags(A_lower.diagonal())
  return sparse.linalg.spsolve(A, b)

## Check it works
b = jnp.ones(100)
x = sparse_solve(A_indices, A_indptr, A_x, b)

print(f"The error in the sparse sovle is {np.sum(np.abs(b - A @ x)): .2e}")

The error in the sparse sovle is  0.00e+00

In order to facilitate its transformations, JAX will occasionally⁶ call functions using abstract data types. These data types know the shape of the inputs and their data type. So our next step is to specialise the sparse_solve function for this case. We might as well do some shape checking while we’re just hanging around. But the essential part of this function is just saying that the output of is the same shape as (which is usually a vector, but the code is no more complex if it’s a [dense] matrix).

from jax._src import abstract_arrays

@sparse_solve_p.def_abstract_eval
def sparse_solve_abstract_eval(A_indices, A_indptr, A_x, b):
  assert A_indices.shape[0] == A_x.shape[0]
  assert b.shape[0] == A_indptr.shape[0] - 1
  return abstract_arrays.ShapedArray(b.shape, b.dtype)

Primitive two: The triangular solve

This is very similar. We need to have a function that computes and . The extra wrinkle from the last time around is that we need to pass a keyword argument transpose to indicate which system should be solved.

Once again, we are going to use the appropriate scipy function (in this case sparse.linalg.spsolve_triangular). There’s a little bit of casting between sparse matrix types here as sparse.linalg.spsolve_triangular assumes the matrix is in CSR format.

sparse_triangular_solve_p = core.Primitive("sparse_triangular_solve")

def sparse_triangular_solve(L_indices, L_indptr, L_x, b, *, transpose: bool = False):
  """A JAX traceable sparse  triangular solve"""
  return sparse_triangular_solve_p.bind(L_indices, L_indptr, L_x, b, transpose = transpose)

@sparse_triangular_solve_p.def_impl
def sparse_triangular_solve_impl(L_indices, L_indptr, L_x, b, *, transpose = False):
  """The implementation of the sparse triangular solve. This is not JAX traceable."""
  L = sparse.csc_array((L_x, L_indices, L_indptr)) 
  
  assert L.shape[0] == L.shape[1]
  assert L.shape[0] == b.shape[0]
  
  if transpose:
    return sparse.linalg.spsolve_triangular(L.T, b, lower = False)
  else:
    return sparse.linalg.spsolve_triangular(L.tocsr(), b, lower = True)

Now we can check if it works. We can use the fact that our matrix (A_indices, A_indptr, A_x) is lower-triangular (because we only store the lower triangle) to make our test case.

## Check if it works
b = np.random.standard_normal(100)
x1 = sparse_triangular_solve(A_indices, A_indptr, A_x, b)
x2 = sparse_triangular_solve(A_indices, A_indptr, A_x, b, transpose = True)
print(f"""Error in trianglular solve: {np.sum(np.abs(b - sparse.tril(A) @ x1)): .2e}
Error in triangular transpose solve: {np.sum(np.abs(b - sparse.triu(A) @ x2)): .2e}""")

Error in trianglular solve:  3.53e-15
Error in triangular transpose solve:  5.08e-15

And we can also do the abstract evaluation.

@sparse_triangular_solve_p.def_abstract_eval
def sparse_triangular_solve_abstract_eval(L_indices, L_indptr, L_x, b, *, transpose = False):
  assert L_indices.shape[0] == L_x.shape[0]
  assert b.shape[0] == L_indptr.shape[0] - 1
  return abstract_arrays.ShapedArray(b.shape, b.dtype)

Great! Now on to the next one!

Primitive three: The sparse cholesky

Ok. This one is gonna be a pain in the arse. But we need to do it. Why? Because we are going to need a JAX-traceable version further on down the track.

The issue here is that the non-zero pattern of the Cholesky decomposition is computed on the fly. This is absolutely not allowed in JAX. It must know the shape of all things at the moment it is called.

This is going to make for a somewhat shitty user experience for this function. It’s unavoidable with JAX designed⁷ the way it is.

The code in jax.experimental.sparse.bcoo.fromdense has this exact problem. In their case, they are turning a dense matrix into a sparse matrix and they can’t know until they see the dense matrix how many non-zeros there are. So they do the sensible thing and ask the user to specify it. They do this using the nse keyword parameter. If you’re curious what nse stands for, it turns out it’s not “non-standard evaluation” but rather “number of specified entries”. Most other systems use the abbreviation nnz for “number of non-zeros”, but I’m going to stick with the JAX notation.

The one little thing we need to add to this code is a guard to make sure that if the sparse_cholesky function is called without specifying

sparse_cholesky_p = core.Primitive("sparse_cholesky")

def sparse_cholesky(A_indices, A_indptr, A_x, *, L_nse: int = None):
  """A JAX traceable sparse cholesky decomposition"""
  if L_nse is None:
    err_string = "You need to pass a value to L_nse when doing fancy sparse_cholesky."
    _ = core.concrete_or_error(None, A_x, err_string)
  return sparse_cholesky_p.bind(A_indices, A_indptr, A_x, L_nse = L_nse)

@sparse_cholesky_p.def_impl
def sparse_cholesky_impl(A_indices, A_indptr, A_x, *, L_nse = None):
  """The implementation of the sparse cholesky This is not JAX traceable."""
  
  L_indices, L_indptr= _symbolic_factor(A_indices, A_indptr)
  if L_nse is not None:
    assert len(L_indices) == nse
    
  L_x = _structured_copy(A_indices, A_indptr, A_x, L_indices, L_indptr)
  L_x = _sparse_cholesky_impl(L_indices, L_indptr, L_x)
  return L_indices, L_indptr, L_x

The rest of the code is just the sparse Cholesky code from blog 2 and I’ve hidden it under the fold. (You would think I would package this up properly, but I simply haven’t. Why not? Who knows⁸.)

Click here to see the implementation

Sparse Matrices 4: Design is my passion

Dan Simpson — Sun, 15 May 2022 14:00:00 GMT

This is the fourth post in a series where I try to squeeze autodiffable sparse matrices into JAX with the aim to speed up some model classes in PyMC. So far, I have:

Outlined the problem Post 1
Worked through a basic python implementation of a sparse Cholesky decomposition Post 2
Failed to get JAX to transform some numpy code into efficient, JIT-compileable code Post 3

I am in the process of writing a blog on building new primitives¹ into JAX, but as I was doing it I accidentally wrote a long section about options for exposing sparse matrices. It really didn’t fit very well into that blog, so here it is.

What are we trying to do here?

If you recall from the first blog, we need to be able to compute the value and gradients of the (un-normalised) log-posterior where is a sparse matrix, and

Overall, our task is to design a system where this un-normalised log-posterior can be evaluated and differentiated efficiently. As with all design problems, there are a lot of different ways that we can implement it. They share a bunch of similarities, so we will actually end up implementing the guts of all of the systems.

To that end, let’s think of all of the ways we can implement our target².

Option 1: The direct design

, for a sparse, symmetric positive definite matrix
, for a sparse, symmetric positive definite matrix and a vector

This option is, in some sense, the most straightforward. We implement primitives for both of the major components of our target and combine them using existing JAX primitives (like addition, scalar multiplication, and dot products).

This is a bad idea.

The problem is that both primitives require the Cholesky decomposition of , so if we take this route we might end up computing an extra Cholesky decomposition. And you may ask yourself: what’s an extra Cholesky decomposition between friends?

Well, Jonathan, it’s the most expensive operation we are doing for these models, so perhaps we should avoid the 1/3 increase in running time!

There are some ways around this. We might implement sparse, symmetric positive definite matrices as a class that, upon instantiation, computes the Cholesky factorisation.

class SPDSparse: 
  def __init__(self, A_indices, A_indptr, A_x):
    self._perm, self._iperm = _find_perm(A_indices, A_indptr)
    self._A_indices, self._A_indptr, self._A_x = _twist(self._perm, A_indices, A_indptr, A_x)
    try:
      self._L_indices, self._L_indptr, self._L_x = _compute_cholesky()
    except SPDError:
      print("Matrix is not symmetric positive definite to machine precision.")
  
  def _find_perm(self, indices, indptr):
    """Finds the best fill-reducing permutation"""
    raise NotImplemented("_find_perm")
  
  def _twist(self, perm, indices, indptr, x):
    """Returns A[perm, perm]"""
    raise NotImplemented("_twist")
  
  def _compute_cholesky():
    """Compute the Cholesky decomposition of the permuted matrix"""
    raise NotImplemented("_compute_cholesky")
  
  # Not pictured: a whole forest of gets

In contexts where we need a Cholesky decomposition of every SPD matrix we instantiate, this design might be useful. It might also be useful to write a constructor that takes a jax.experimental.CSCMatrix, so that we could build a differentiable matrix and then just absolutely slam it into our filthy little Cholesky context³.

In order to use this type of pattern with JAX, we would need to register it as a Pytree class, which involves writing flatten and unflatten routines. The CSCSparse class is a good example of how to implement this type of thing. Some care would be needed to make sure the differentiation rules don’t try to do something stupid like differentiate with respect to self.iperm or self.L_x. This is beyond the extra autodiff sugar in the experimental sparse library.

Implementing this would be quite an undertaking, but it’s certainly an option. The most obvious downside of this pattern (plus a fully functional sparse matrix class) is that it may end up being quite delicate to have this volume of auxillary information⁴ in a pytree while making everything differentiate properly. This doesn’t seem to be how most parts of JAX has been built. There are also a couple of sharp corners we could run into with instantiation.

To close this out, it’s worth noting a variation on this pattern that comes up: the optional Cholesky. The idea is that rather than compute the permutations and the Cholesky factorisation on initialisation, we store a boolean flag in the class is_cholesky and, whenever we need a Cholesky factor we check is_cholesky and if it’s True we use the computed Cholesky factor and otherwise we compute it and set is_cholesky = True.

This pattern introduces state to the object: it is no longer set and forget. This will not work within JAX⁵, where objects need to be immutable. It’s also not an exceptional pattern in general: it is considerably easier to debug code with stateless objects.

Option 2: Implement all of the combinations of functions that we need

Rather than dicking around with classes, we could just implement primitives that compute

, for a sparse, symmetric positive definite matrix
, for a sparse, symmetric positive definite matrix and vectors and .

This is exactly what we need to do our task and nothing more. It won’t result in any unnecessary Cholesky factors. It doesn’t need us to store computed Cholesky factors. We can simply eat, prey, love.

The obvious downside to this option is it’s going to just massively expand the codebase if there are more things that we want to do. It’s also not obvious why we would do this instead of just making a primitive⁶.

Option 3: Just compute the Cholesky

Our third option is to simply compute (and differentiate) the Cholesky factor directly. We can then compute and through a combination of differentiable operations on the elements of the Cholesky factor (for ) and triangular linear solves and (for ).

Hence we require the following two⁷ JAX primitives:

, where is the Cholesky factor of ,
and for lower-triangular sparse matrix .

This is pretty close to how the dense version of this function would be implemented.

There are two little challenges with this pattern:

We are adding another large-ish node to our autodiff tree. As we saw in other patterns, this is unnecessary storage for our problem at hand.
The number of non-zeros in is a function of the non-zero pattern of . This means the Cholesky will need to be implemented very carefully to ensure that its traceable enough.

The second point here might actually be an issue. To be honest, I have no idea. I think maybe it’s fine? But I need to do a close read on the adding primitives doc. Essentially, as long as the abstract traces just need shapes but not dimensions, we should be ok.

For adding this to something like Stan, however, we will likely need to do some extra work to make sure we know the number of parameters.

The advantage of this type of design pattern is that it gives users the flexibility to do whatever perverted thing they want to do with the Cholesky triangle. For example, they might want to do a centring/non-centring transformation. In Option 1, we would need to write explicit functions to let them do that (not difficult, but there’s a lot of code to write, which has the annoying tendency to increases the maintainence burden).

Option 4: Functors!

A slightly wilder design pattern would be to abandon sparse matrices and just make functions A(theta, ...) that return a sparse matrix. If that function is differentiable wrt its first argument, then we can build this whole thing up that way.

In reality, the only way I can think of to implement this pattern would be to implement a whole differentiable sparse matrix arithmetic (make operations like alpha * A + beta * B, C * D work for sparse matrices). At which point, we’ve basically just recreated option 1.

I’m really only bringing up functors because unlike sparse matrices, it is actually a pretty good model for implementing Gaussian Processes with general covariance functions. There’s a little bit of the idea in this Stan issue that, to my knowledge, hasn’t gone anywhere. More recently, a variant has been used successfully in the (as yet un-merged) Laplace approximation feature in Stan.

Which one should we use?

We don’t really need to make that choice yet. So we won’t.

But personally, I like option 1. I expect everyone else on earth would prefer option 3. For densities that see a lot of action, it would make quite a bit of sense to consider making that density a primitive when it has a complex derivative (à la option 2).

But for now, let’s park this and start getting in on the implementations.

Footnotes

functions that have explicit transformations written for them (eg explicit instruction on how to JIT or how to differentiate)↩︎
I get sick of typing “unnormalised log-posterior”↩︎
I am sorry. I have had some wine.↩︎
Permuations, cholesky, etc↩︎
This also won’t work in Stan, because all Stan objects are stateless.↩︎
This is actually what Stan has done for a bunch of its GLM-type models. It’s very efficient and fast. But with a maintainance burden.↩︎
or three, but you can implement both triangular solves in one function↩︎

Reuse

https://creativecommons.org/licenses/by-nc/4.0/

Citation

BibTeX citation:

@online{simpson2022,
  author = {Dan Simpson},
  editor = {},
  title = {Sparse {Matrices} 4: {Design} Is My Passion},
  date = {2022-05-16},
  url = {https://dansblog.netlify.app/2022-05-16-design-is-my-passion-sparse-matrices-part-four},
  langid = {en}
}

For attribution, please cite this work as:

Dan Simpson. 2022. “Sparse Matrices 4: Design Is My Passion.” May 16, 2022. https://dansblog.netlify.app/2022-05-16-design-is-my-passion-sparse-matrices-part-four.

Sparse Matrices 3: Failing at JAX

Dan Simpson — Fri, 13 May 2022 14:00:00 GMT

This is part three of an ongoing exercise in hubris. Part one is here. Part two is here. The overall aim of this series of posts is to look at how sparse Cholesky factorisations work, how JAX works, and how to marry the two with the ultimate aim of putting a bit of sparse matrix support into PyMC, which should allow for faster inference in linear mixed models, Gaussian spatial models. And hopefully, if anyone ever gets around to putting the Laplace approximation in, all sorts of GLMMs and non-Gaussian models with splines and spatial effects.

It’s been a couple of weeks since the last blog, but I’m going to just assume that you are fully on top of all of those details. To that end, let’s jump in.

What is JAX?

JAX is a minor miracle. It will take python+numpy code and make it cool. It will let you JIT¹ compile it! It will let you differentiate it! It will let you batch². JAX refers to these three operations as transformations.

But, as The Mountain Goats tell us God is present in the sweeping gesture, but the devil is in the details. And oh boy are those details going to be really fucking important to us.

There are going to be two key things that will make our lives more difficult:

Not every operation can be transformed by every operation. For example, you can’t always JIT or take gradients of a for loop. This means that some things have to be re-written carefully to make sure it’s possible to get the advantages we need.
JAX arrays are immutable. That means that once a variable is defined it cannot be changed. This means that things like a = a + 1 is not allowed! If you’ve come from an R/Python/C/Fortran world, this is the weirdest thing to deal with.

There are really excellent reasons for both of these restrictions. And looking into the reasons is fascinating. But not a topic for this blog³

JAX has some pretty decent⁴ documentation, a core piece of which outlines some of the sharp edges you will run into. As you read through the documentation, the design choices become clearer.

So let’s go and find some sharp edges together!

To JAX or not to JAX

But first, we need to ask ourselves which functions do we need to JAX?

In the context of our problem we, so far, have three functions:

_symbolic_factor_csc(A_indices, A_indptr), which finds the non-zero indices of the sparse Cholesky factor and return them in CSC format,
_deep_copy_csc(A_indices, A_indptr, A_x, L_indices, L_indptr), which takes the entries of the matrix and re-creates them so they can be indexed within the larger pattern of non-zero elements of ,
_sparse_cholesky_csc_impl(L_indices, L_indptr, L_x), which actually does the sparse Cholesky factorisation.

Let’s take them piece by piece, which is also a good opportunity to remind everyone what the code looked like.

Symbolic factorisation

def _symbolic_factor_csc(A_indices, A_indptr):
  # Assumes A_indices and A_indptr index the lower triangle of $A$ ONLY.
  n = len(A_indptr) - 1
  L_sym = [np.array([], dtype=int) for j in range(n)]
  children = [np.array([], dtype=int) for j in range(n)]
  
  for j in range(n):
    L_sym[j] = A_indices[A_indptr[j]:A_indptr[j + 1]]
    for child in children[j]:
      tmp = L_sym[child][L_sym[child] > j]
      L_sym[j] = np.unique(np.append(L_sym[j], tmp))
    if len(L_sym[j]) > 1:
      p = L_sym[j][1]
      children[p] = np.append(children[p], j)
        
  L_indptr = np.zeros(n+1, dtype=int)
  L_indptr[1:] = np.cumsum([len(x) for x in L_sym])
  L_indices = np.concatenate(L_sym)
  
  return L_indices, L_indptr

This function only needs to be computed once per non-zero pattern. In the applications I outlined in the first post, this non-zero pattern is fixed. This means that you only need to run this function once per analysis (unlike the others, that you will have to run once per iteration!).

As a general rule, if you only do something once, it isn’t all that necessary to devote too much time into optimising it. There are, however, some obvious things we could do.

It is, for instance, pretty easy to see how you would implement this with an explicit tree⁵ structure instead of constantly np.appending the children array. This is far better from a memory standpoint.

It’s also easy to imagine this as a two-pass algorithm, where you build the tree and count the number of non-zero elements in the first pass and then build and populate L_indices in the second pass.

The thing is, neither of these things fixes the core problem for using JAX to JIT this: the dimensions of the internal arrays depend on the values of the inputs. This is not possible.

It seems like this would be a huge limitation, but in reality it isn’t. Most functions aren’t like this one! And, if we remember that JAX is a domain language focussing mainly on ML applications, this is very rarely the case. It is always good to remember context!

So what are our options? We have two.

Leave it in Python and just eat the speed.
Build a new JAX primitive and write the XLA compilation rule⁶.

Today are opting for the first option!

The structure-changing copy

def _deep_copy_csc(A_indices, A_indptr, A_x, L_indices, L_indptr):
  n = len(A_indptr) - 1
  L_x = np.zeros(len(L_indices))
  
  for j in range(0, n):
    copy_idx = np.nonzero(np.in1d(L_indices[L_indptr[j]:L_indptr[j + 1]],
                                  A_indices[A_indptr[j]:A_indptr[j+1]]))[0]
    L_x[L_indptr[j] + copy_idx] = A_x[A_indptr[j]:A_indptr[j+1]]
  return L_x

This is, fundamentally, a piece of bookkeeping. An annoyance of sparse matrices. Or, if you will, explicit cast between different sparse matrix types⁷. This is a thing that we do actually need to be able to differentiate, so it needs to live in JAX.

So where are the potential problems? Let’s go line by line.

n = len(A_indptr) - 1: This is lovely. n is used in a for loop later, but because it is a function of the shape of A_indptr, it is considered static and we will be able to JIT over it!
L_x = np.zeros(len(L_indices)): Again, this is fine. Sizes are derived from shapes, life is peachy.
for j in range(0, n):: This could be a problem if n was an argument or derived from values of the arguments, but it’s derived from a shape so it is static. Praise be! Well, actually it’s a bit more involved than that.

The problem with the for loop is what will happen when it is JIT’d. Essentially, the loop will be statically unrolled⁸. That is fine for small loops, but it’s a bit of a pain in the arse when n is large.

In this case, we might want to use the structured control flow in jax.lax⁹ In this case we would need jax.lax.fori_loop(start, end, body_fun, init_value). This makes the code look less pythonic, but probably should make it faster. It is also, and I cannot stress this enough, an absolute dick to use.

(In actuality, we will see that we do not need this particular corner of the language here!)

copy_idx = np.nonzero(...): This looks like it’s going to be complicated, but actually it is a perfectly reasonable composition of numpy functions. Hence, we can use the same jax.numpy functions with minimal changes. The one change that we are going to need to make in order to end up with a JIT-able and differentiable function is that we need to tell JAX how many non-zero elements there are. Thankfully, we know this! Because the non-zero pattern of is a subset of the non-zero pattern of , we know that

np.in1d(L_indices[L_indptr[j]:L_indptr[j + 1]], A_indices[A_indptr[j]:A_indptr[j+1]])

will have exactly len(A_indices[A_indptr[j]:A_indptr[j+1]]) True values, and so np.nonzero(...) will have that many. We can pass this information to jnp.nonzero() using the optional size argument.

Oh no! We have a problem! This return size is a function of the values of A_indptr rather than a function of the shape. This means we’re a bit fucked.

There are two routes out:

Declare A_indptr to be a static parameter, or
Change the representation from CSC to something more convenient.

In this case we could do either of these things, but I’m going to opt for the second option, as it’s going to be more useful going forward.

But before we do that, let’s look at the final line in the code.

L_x[L_indptr[j] + copy_idx] = A_x[A_indptr[j]:A_indptr[j+1]]: The final non-trivial line of the code is also a problem. The issue is that these arrays are immutable and we are asking to change the values! That is not allowed!

The solution here is to use a clunkier syntax. In JAX, we need to replace

x[ind] = a

with the less pleasant

x = x.at[ind].set(a)

What is going on under the hood to make the second option ok while the first is an error is well beyond the scope of this little post. But the important thing is that they compile down to an in-place¹⁰ update, which is all we really care about.

Re-doing the data structure.

Ok. So we need a new data structure. That’s annoying. The rule, I guess, is always that if you need to innovate, you should innovate very little if you can get away with it, or a lot if you have to.

We are going to innovate only the tiniest of bits.

The idea is to keep the core structure of the CSC data structure, but to replace the indptr array with explicitly storing the row indices and row values as a list of np.arrays. So A_index will now be a list of n arrays that contain the row indices of the non-zero elements of , while A_xwill now be a list of n arrays that contain the values of the non-zero elements of .

This means that the matrix would be stored as

B_index = [np.array([0,1]), np.array([1,2]), np.array([0,2])]
B_x = [np.array([1,2]), np.array([3,4]), np.array([5,6])]

This is a considerably more pythonic¹¹ version of CSC. So I guess that’s an advantage.

We can easily go from CSC storage to this modified storage.

def to_pythonic_csc(indices, indptr, x):
  index = np.split(indices, indptr[1:-1])
  x = np.split(x, indptr[1,-1])
  return index, x

A JAX-tracable structure-changing copy

So now it’s time to come back to that damn for loop. As flagged earlier, for loops can be a bit picky in JAX. If we use them as is, then the code that is generated and then compiled is unrolled. You can think of this as if the JIT compiler automatically writes a C++ program and then compiles it. If you were to examine that code, the for loop would be replaced by n almost identical blocks of code with only the index j changing between them. This leads to a potentially very large program to compile¹² and it limits the compiler’s ability to do clever things to make the compiled code run faster¹³.

The lax.fori_loop() function, on the other hand, compiles down to the equivalent of a single operation¹⁴. This lets the compiler be super clever.

But we don’t actually need this here. Because if you take a look at the original for loop we are just applying the same two lines of code to each triple of lists in A_index, A_x, and L_index (in our new¹⁵ data structure).

This just screams out for a map applying a single function independently to each column.

The challenge is to find the right map function. An obvious hope would be jax.vmap. Sadly, jax.vmap does not do that. (At least not without more padding¹⁶ than a drag queen.) The problem here is a misunderstanding of what different parts of JAX are for. Functions like jax.vmap are made for applying the same function to arrays of the same size. This makes sense in their context. (JAX is, after all, made for machine learning and these shape assumptions fit really well in that paradigm. They just don’t fit here.)

And I won’t lie. After this point I went wild. lax.map did not help. And I honest to god tried lax.scan, which is will solve the problem but at what cost?.

But at some point, you read enough of the docs to find the answer.

The correct answer here is to use the JAX concept of a pytree. Pytrees are essentially¹⁷ lists of arrays. They’re very flexible and they have a jax.tree_map function that lets you map over them! We are saved!

import numpy as np
from jax import numpy as jnp
from jax import tree_map

def _structured_copy_csc(A_index, A_x, L_index):
    def body_fun(A_rows, A_vals, L_rows):
      out = jnp.zeros(len(L_rows))
      copy_idx =  jnp.nonzero(jnp.in1d(L_rows, A_rows), size = len(A_rows))[0] 
      out = out.at[copy_idx].set(A_vals)
      return out
    L_x = tree_map(body_fun, A_index, A_x, L_index)
    return L_x

Testing it out

Ok so now lets see if it works. To do that I’m going to define a very simple function that is the sum of the squares of all of the elements of . There’s obviously an easy way to do this, but I’m going to do it in a way that uses the function we just built.

def test_func(A_index, A_x, params):
  I_index = [jnp.array([j]) for j in range(len(A_index))]
  I_x = [jnp.array([params[0]]) for j in range(len(A_index))]
  I_x2 = _structured_copy_csc(I_index, I_x, A_index)
  return jnp.sum((jnp.concatenate(I_x2) + params[1] * jnp.concatenate(A_x))**2)

Next, we need a test case. Once again, we will use the 2D Laplacian on a regular grid (up to a scaling). This is a nice little function because it’s easy to make test problems of different sizes.

from scipy import sparse

def make_matrix(n):
    one_d = sparse.diags([[-1.]*(n-1), [2.]*n, [-1.]*(n-1)], [-1,0,1])
    A_lower = sparse.tril(sparse.kronsum(one_d, one_d) + sparse.eye(n*n), format = "csc")
    A_index = jnp.split(jnp.array(A_lower.indices), A_lower.indptr[1:-1])
    A_x = jnp.split(jnp.array(A_lower.data), A_lower.indptr[1:-1])
    return (A_index, A_x)

With our test case in hand, we can check to see if JAX will differentiate for us!

from jax import grad, jit
from jax.test_util import check_grads

grad_func = grad(test_func, argnums = 2)

A_index, A_x = make_matrix(50)
print(f"The value at (2.0, 2.0) is {test_func(A_index, A_x, (2.0, 2.0))}.")
print(f"The gradient is {np.array(grad_func(A_index, A_x, (2.0, 2.0)))}.")

The value at (2.0, 2.0) is 379600.0.

The gradient is [ 60000. 319600.].

Fabulous! That works!

But what about JIT?

JIT took fucking ages. I’m talking “it threw a message” amounts of time. I’m not even going to pretend that I understand why. But I can hazard a guess.

My running assumption, taken from the docs, is that as long as the function only relies of quantities that are derived from the shapes of the inputs (and not the values), then JAX will be able to trace through and JIT through the functions with ease.

This might not be true for tree_maps. The docs are, as far as I can tell, silent on this matter. And a cursory look through the github repo did not give me any hints as to how tree_map() is translated.

Let’s take a look to see if this is true.

import timeit
from functools import partial
jit_test_func = jit(test_func)

A_index, A_x = make_matrix(5)
times = timeit.repeat(partial(jit_test_func, A_index, A_x, (2.0, 2.0)), number = 1)
print(f"n = 5: {[round(t, 4) for t in times]}")

n = 5: [1.6695, 0.0001, 0.0, 0.0, 0.0]

We can see that the first run includes compilation time, but after that it runs a bunch faster. This is how a JIT system is supposed to work! But the question is: will it recompile when we run it for a different matrix?

_ = jit_test_func(A_index, A_x, (2.0, 2.0)) 
A_index, A_x = make_matrix(20)
times = timeit.repeat(partial(jit_test_func, A_index, A_x, (2.0, 2.0)), number = 1)
print(f"n = 20: {[round(t, 4) for t in times]}")

n = 20: [38.5779, 0.0006, 0.0003, 0.0003, 0.0003]

Damn. It recompiles. But, as we will see, it does not recompile if we only change A_x.

# What if we change A_x only
_ = jit_test_func(A_index, A_x, (2.0, 2.0)) 
A_x2 = tree_map(lambda x: x + 1.0, A_x)
times = timeit.repeat(partial(jit_test_func, A_index, A_x2, (2.0, 2.0)), number = 1)
print(f"n = 20, new A_x: {[round(t, 4) for t in times]}")

n = 20, new A_x: [0.0006, 0.0007, 0.0005, 0.0003, 0.0003]

This gives us some hope! This is because the structure of A (aka A_index) is fixed in our application, but the values A_x changes. So as long as the initial JIT compilation is reasonable, we should be ok.

Unfortunately, there is something bad happening with the compilation. For , it takes (on my machine) about 2 seconds for the initial compilation. For , that increases to 16 seconds. Once , this balloons up to 51 seconds. Once we reach the lofty peaks¹⁸ of , we are up at 149 seconds to compile.

This is not good. The function we are JIT-ing is very simple: just one tree_map. I do not know enough¹⁹ about the internals of JAX, so I don’t want to speculate too wildly. But it seems like it might be unrolling the tree_map before compilation, which is … bad.

Let’s admit failure

Ok. So that didn’t bloody work. I’m not going to make such broad statements as you can’t use the JAX library in python to write a transformable sparse Cholesky factorisation, but I am more than prepared to say that I cannot do such a thing.

But, if I’m totally honest, I’m not enormously surprised. Even in looking at the very simple operation we focussed on today, it’s pretty clear that the operations required to work on a sparse matrix don’t look an awful lot like the types of operations you need to do the types of machine learning work that is JAX’s raison d’être.

And it is never surprising to find that a library designed to do a fundamentally different thing does not easily adapt to whatever random task I decide to throw at it.

But there is a light: JAX is an extensible language. We can build a new JAX primitive (or, new JAX primitives) and manually write all of the transformations (batching, JIT, and autodiffing).

And that is what we shall do next! It’s gonna be a blast!

Footnotes

If you’ve never come across this term before, you can Google it for actual details, but the squishy version is that it will compile your code so it runs fast (like C code) instead of slow (like python code). JIT stands for just in time, which means that the code is compiled when it’s needed rather than before everything else is run. It’s a good thing. It makes the machine go bing faster.↩︎
I give less of a shit about the third transformation in this context. I’m not completely sure what you would batch when you’re dealing with a linear mixed-ish model. But hey. Why not.↩︎
If you’ve ever spoken to a Scala advocate (or any other pure functional language), you can probably see the edges of why the arrays need to be immutable. Restrictions to JIT-able control flow has to do with how it’s translated onto the XLA compiler, which involves tracing through the code with an abstract data type with the same shape as the one that it’s being called with. Because this abstract data type does not have any values, structural parts of the code that require knowledge of specific values of the arguments will be lost. You can get around this partially by declaring those important values to be static, which would make the JIT compiler re-compile the function each time that value changes. We are not going to do that. Restrictions to gradients have to do (I assume) with reverse-mode autodiff needing to construct the autodiff tree at compile time, which means you need to be able to compute the number of operations from the types and shapes of the input variables and not from their values.↩︎
Coverage is pretty good on the using bit, but, as is usual, the bits on extending the system are occasionally a bit … sparse. (What in the hairy Christ is a transposition rule actually supposed to do????)↩︎
Forest↩︎
aka implement the damn thing in C++ and then do some proper work on it.↩︎
It is useful to think of a sparse matrix type as the triple (value_type, indices, indptr). This means that if we are going to do something like add sparse matrices, we need to first cast them both to have the same type. After the cast, addition of two different sparse matrices becomes the addition of their x attributes. The same holds for scalar multiplication. Sparse matrix-matrix multiplication is a bit different because you once again need to symbolically work out the sparsity structure (aka the type) of the product. ↩︎
I think. That’s certainly what’s implied by the docs, but I don’t want to give the impression that I’m sure. Because this is complicated.↩︎
What is jax.lax? Oh honey you don’t want to know.↩︎
aka there’s no weird copying↩︎
Whatever that means anyway↩︎
slowwwwww to compile↩︎
The XLA compiler does very clever things. Incidentally, loop unrolling is actually one of the optimisations that compilers have in their pocket. Just not one that’s usually used for loops as large as this.↩︎
Read about XLA High Level Operations (HLOs) here. The XLA documentation is not extensive, but there’s still a lot to read.↩︎
This is why we have a new data structure.↩︎
My kingdom for a ragged array.↩︎
Yes. They are more complicated than this. But for our purposes they are lists of arrays.↩︎
takes so long it prints a message telling us what to do if we need to do if we want to file a bug! Compilation eventually clocks in at 361 seconds.↩︎
aka I know sweet bugger all↩︎

Reuse

https://creativecommons.org/licenses/by-nc/4.0/

Citation

BibTeX citation:

@online{simpson2022,
  author = {Dan Simpson},
  editor = {},
  title = {Sparse {Matrices} 3: {Failing} at {JAX}},
  date = {2022-05-14},
  url = {https://dansblog.netlify.app/2022-05-14-jax-ing-a-sparse-cholesky-factorisation-part-3-in-an-ongoing-journey},
  langid = {en}
}

For attribution, please cite this work as:

Dan Simpson. 2022. “Sparse Matrices 3: Failing at JAX.” May 14, 2022. https://dansblog.netlify.app/2022-05-14-jax-ing-a-sparse-cholesky-factorisation-part-3-in-an-ongoing-journey.

Sparse Matrices 2: An invitation to a sparse Cholesky factorisation

Dan Simpson — Wed, 30 Mar 2022 13:00:00 GMT

This is part two of an ongoing exercise in hubris. Part one is here.

The Choleksy factorisation

So first things first: Cholesky wasn’t Russian. I don’t know why I always thought he was, but you know. Sometime you should do a little googling first. Cholesky was French and died in the First World War.

But now that’s out of the way, let’s talk about matrices. If ¹ is a symmetric positive definite matrix, then there is a unique lower-triangular matrix such that .

Like all good theorems in numerical linear algebra, the proof of the existence of the Cholesky decomposition gives a pretty clear algorithm for constructing . To sketch² it, let us see what it looks like if build up our Choleksy factorisation from left to right, so the first columns have been modified and we are looking at how to build the th column. In order to make lower-triangular, we need the first elements of the th column to be zero. Let’s see if we can work out what the other columns have to be.

Writing this as a matrix equation, we get where is lower-triangular (and ) and lower-case letters are vectors³ and everything is of the appropriate dimension to make the top-left submatrix of .

If we can find equations for and that don’t depend on (ie we can express them in terms of things we already know), then we have found an algorithm that marches from the left of the matrix to the right leaving a Choleksy factorisation in its wake!

If we do our matrix multiplications, we get the following equation for : Rearranging, we get The canny amongst you will be asking “yes but is that a real number”. The answer turns out to be “yes” for all diagonals if and only if⁴ is symmetric positive definite.

Ok! We have expressed in terms of things we know, so we are half way there. Now to attack the vector . Looking at the (3,2) equation implied by the above block matrices, we get Remembering that is a scalar (that we have already computed!), we get

Success!

This then gives us the⁵ Cholesky factorisation⁶:

for j in range(0,n) (using python slicing notation because why notation)
  L[j,j] = sqrt(A[j,j] - L[j, 1:(j-1)] * L[j, 1:(j-1)]')
  L[(j+1):n, j] = (A[(j+1):n, j] - L[(j+1):n, 1:(j-1)] * L[j, 1:(j-1)]') / L[j,j]

Easy as.

When is a dense matrix, this costs floating point operations⁷.

So how can we take advantage of the observation that most of the entries of are zero (aka is a sparse matrix)? Well. That is the topic of this post. In order, we are going to look at the following:

Storing a sparse matrix so it works with the algorithm
How sparse is a Cholesky factor?
Which elements of the Cholesky factor are non-zero (aka symbolic factorisation)
Computing the Cholesky factorisation
~~What about JAX? (or: fucking immutable arrays are trying to ruin my fucking life)~~ (This did not happen. Next time. The post is long enough.)

So how do we store a sparse matrix?

If we look at the Cholesky algorithm, we notice that we are scanning through the matrix column-by-column. When a computer stores a matrix, it stores it as a long 1D array with some side information. How this array is constructed from the matrix depends on the language.

There are (roughly) two options: column-major or row-major storage. Column major storage (used by Fortran⁸, R, Matlab, Julia, Eigen, etc) stacks a matrix column by column. A small example: Row-mjor ordering (C/C++ arrays, SAS, Pascal, numpy⁹) stores things row-by-row.

Which one do we use? Well. If you look at the Cholesky algorithm, it scans through the matrix column-by-column. It is much much much more memory efficient in this case to have the whole column available in one contiguous chunk of memory. So we are going to use column-major storage.

But there’s an extra wrinkle: Most of the entries in our matrix are zero. It would be very inefficient to store all of those zeros. You may be sceptical about this, but it’s true. It helps to realize that even in the examples at the bottom of this post that are not trying very hard to minimise the fill in, only 3-4% of the potential elements in are non-zero.

It is far more efficient to just store the locations¹⁰ of the non-zeros and their values. If only 4% of your matrix is non-zero, you are saving¹¹ a lot of memory!

The storage scheme we are inching towards is called compressed sparse column (CSC) storage. This stores the matrix in three arrays. The first array indices (which has as many entries as there are non-zeros) stores the row numbers for each non-zero element. So if then (using zero-based indices because I’ve to to make this work in Python)

B_indices = [0,1,1,2,0,3]

The second array indptr is an -dimensional array that indexes the first element of each row. The final element of indptr is nnz(B)¹². This leads to

B_indptr = [0,2,4,6]

This means that the entries in column¹³ j are have row numbers

B_indices[B_indptr[j]:B_indptr[j+1]]

The third and final array is x, which stores the values of the non-negative entries of column-by-column. This gives

B_x = [1,2,3,4,5,6]

Using these three arrays we can get access to the jth row of by accessing

B_x[B_indptr[j]:B_indptr[j+1]]

This storage scheme is very efficient for what we are about to do. But it is fundamentally a static scheme: it is extremely expensive to add a new non-zero element. There are other sparse matrix storage schemes that make this work better.

How sparse is a Cholesky factor of a sparse matrix?

Ok. So now we’ve got that out of the way, we need to work out the sparsity structure of a Choleksy factorisation. At this point we need to close our eyes, pray, and start thinking about graphs.

Why graphs? I promise, it is not because I love discrete¹⁴ maths. It is because symmetric sparse matrices are strongly related to graphs.

To remind people, a graph¹⁵ (in a mathematical sense) consists of two lists:

A list of vertices numbered from to ¹⁶.
A list of edges in the graph (aka all the pairs such that and there is an edge between and ).

Every symmetric sparse matrix has a graph naturally associated with it. The relationship is that (for ) is an edge in if and only if .

So, for instance, if

then we can plot the associated graph, .

But why do we care about graphs?

We care because they let us answer our question for this section: which elements of the Cholesky factor are non-zero?

It is useful to write the algorithm out for a second time¹⁷, but this time closer to how we will implement it.

L = np.tril(A)
for j in range(n):
  for k in range(j-1):
    L[j:n, j] -= L[j, k] * L[j:n, k]
  L[j,j]= np.sqrt(L[j,j])
  L[j+1:n, j] = L[j+1:n] / L[j, j]

If we stare at this long enough we can work out when is going to be potentially non-zero.

And here is where we have to take a quick zoom out. We are not interested if the numerical entry is actually non-zero. We are interested if it could be non-zero. Why? Because this will allow us to set up our storage scheme for the sparse Cholesky factor. And it will tell us exactly which bits of the above loops we actually need to do!

So with that motivation in mind, can we spot the non-zeros? Well. I’ll be honest with you. I struggle at this game. This is part of why I do not like thinking about graphs¹⁸. But with a piece of paper and a bit of time, I can convince myslef that is potentially non-zero (or a structural non-zero) if:

is non-zero (because tmp[i-j] is non-zero!), or
and for some (because that is the only time an element of tmp is updated through tmp[i] = tmp[i] - L[i, k] * L[j, k])

If we dig into the second condition a bit more,¹⁹ we notice that the second case can happen if and only if there is a path in ²⁰ from node to node with . The proof is an induction on that I can’t be arsed typing out.

(As an aside, Theorem 2.8 in Rue and Held’s book gives a very clearn nice statistical proof of this result.)

This is enough to see that fill in patterns are going to be a complex thing.

A toy example

Consider the following graph

It’s pretty clear that there is a path between for every pair (the path goes through the fully connected vertex, which is labelled 1).

And indeed, we can check this numerically²¹

library(Matrix)
n <- 6
A <- sparseMatrix(i = c(1:n, rep(1,n)), 
                  j = c(rep(1,n),1:n), 
                  x = -0.2, 
                  dims = c(n,n)) + 
      Diagonal(n)
A != 0 #print the non-zero structrure

6 x 6 sparse Matrix of class "lgCMatrix"
                
[1,] | | | | | |
[2,] | | . . . .
[3,] | . | . . .
[4,] | . . | . .
[5,] | . . . | .
[6,] | . . . . |

L = t(chol(as.matrix(A))) # transpose is for R reasons
round(L, digits = 1) # Fully dense!

     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]  0.8  0.0  0.0  0.0  0.0    0
[2,] -0.3  1.0  0.0  0.0  0.0    0
[3,] -0.3 -0.1  1.0  0.0  0.0    0
[4,] -0.3 -0.1 -0.1  1.0  0.0    0
[5,] -0.3 -0.1 -0.1 -0.1  1.0    0
[6,] -0.3 -0.1 -0.1 -0.1 -0.1    1

But what if we changed the labels of our vertices? What is the fill in pattern implied by a labelling where the fully collected vertex is labelled last instead of first?

There are now no paths from to that only go through lower-numbered vertices. So there is no fill in! We can check this numerically!²²

A2 <- A[n:1,n:1]
L2 <- t(chol(A2))
L2!=0

6 x 6 sparse Matrix of class "ltCMatrix"
                
[1,] | . . . . .
[2,] . | . . . .
[3,] . . | . . .
[4,] . . . | . .
[5,] . . . . | .
[6,] | | | | | |

So what is the lesson here?

The lesson is that the sparse Cholesky algorithm cares deeply about what order the rows and columns of the matrix are in. This is why, in the previous post, we put the dense rows and columns of at the end of the matrix!

Luckily, a lot of clever graph theorists got on the job a while back and found a number of good algorithms for finding decent²³ ways to reorder the vertices of a graph to minimise fill in. There are two particularly well-known reorderings: the approximate minimum degree (AMD) reordering and the nested-dissection reordering. Neither of these are easily available in Python²⁴.

AMD is a bog-standard black box that is a greedy reordering that tries to label the next vertex so that graph you get after removing that vertex and adding edges between all of the nodes that connect to that vertex isn’t too fucked.

Nested dissection tries to generalise the toy example above by finding nodes that separate the graph into two minimally connected components. The separator node is then labelled last. The process is repeated until you run out of nodes. This algorithm can be very efficient in some cases (eg if the graph is planar²⁵, the sparse Cholesky algorithm using this reordering provably costs at most ).

Typically, you compute multiple reorderings²⁶ and pick the one that results in the least fill in.

Which elements of the Cholesky factor are non-zero (aka symbolic factorisation)

Ok. So I guess we’ve got to work out an algorithm for computing the non-zero structure of a sparse Cholesky factor. Naively, this seems easy: just use the Cholesky algorithm and mark which elements are non-zero.

But this is slow and inefficient. You’re not thinking like a programmer! Or a graph theorist. So let’s talk about how to do this efficiently.

The elimination tree

Let’s consider the graph that contains the sparsity pattern of . We know that the non-zero structure consists of all such that and there is a path from to . This means we could just compute that and make .

The thing that you should notice immediately is that there is a lot of redundancy in this structure. Remember that if is non-zero and is also non-zero, then is also non-zero.

This suggests that if we have and in the graph, we can remove the edge from and still be able to work out that is non-zero. This new graph is no longer the graph associated with but, for our purposes, it contains the same information.

If we continue pruning the graph this way, we are going to end up with a²⁷ rooted tree! From this tree, which is called the elimination tree of ²⁸ we can easily work out the non-zero structure of .

The elimination tree is the fundamental structure needed to build an efficient sparse Cholesky algorithm. We are not going to use it to its full potential, but it is very cheap to compute (roughly²⁹ operations).

Once we have the elimination tree, it’s cheap to compute properties of like the number of non-zeros in a column, the exact sparsity pattern of every column, which columns can be grouped together to form supernodes³⁰, and the approximate minimum degree reordering.

All of those things would be necessary for a modern, industrial-strength sparse Cholesky factorisation. But, and I cannot stress this enough, fuck that shit.

The symbolic factorisation

We are doing the easy version. Which is to say I refuse to do anything here that couldn’t be easily done in the early 90s. Specifically, we are going to use the version of this thatGeorge, Liu, and Ng wrote about³¹ in the 90s. Understanding this is, I think, enough to see how things like supernodal factorisations work, but it’s so much less to keep track of.

The nice thing about this method is that we compute the elimination tree implicitly as we go along.

Let be the non-zero entries in the th column of . Then our discussion in the previous section tells us that we need to determine the reach of the node i where .

If we can compute the reach, then !

This is where the elimination tree comes in: it is an efficient representation of these sets. Indeed, if and only if there is a directed³² path from to in the elimination tree! Now this tree is ordered³³ so that if is a child of (aka directly below it in the tree), then . This means that its column in the Cholesky factorisation has already been computed. So all of the nodes that can be reached from by going through are in .

This means that we can compute the non-zeros of the th column of efficiently from the non-zeros of all of the (very few, hopefully) columns associated with the child nodes of .

So all that’s left is to ask “how can we find the child?” (as phones around the city start buzzing). Well, a little bit of thinking time should convince you that if then is the parent of . Or, the parent of column is the index of its first³⁴ non-zero below the diagonal.

We can put all of these observations together into the following algorithm. We assume that we are given the non-zero structure of tril(A) (aka the lower-triangle of ).

import numpy as np

def _symbolic_factor_csc(A_indices, A_indptr):
  # Assumes A_indices and A_indptr index the lower triangle of $A$ ONLY.
  n = len(A_indptr) - 1
  L_sym = [np.array([], dtype=int) for j in range(n)]
  children = [np.array([], dtype=int) for j in range(n)]
  
  for j in range(n):
    L_sym[j] = A_indices[A_indptr[j]:A_indptr[j + 1]]
    for child in children[j]:
      tmp = L_sym[child][L_sym[child] > j]
      L_sym[j] = np.unique(np.append(L_sym[j], tmp))
    if len(L_sym[j]) > 1:
      p = L_sym[j][1]
      children[p] = np.append(children[p], j)
        
  L_indptr = np.zeros(n+1, dtype=int)
  L_indptr[1:] = np.cumsum([len(x) for x in L_sym])
  L_indices = np.concatenate(L_sym)
  
  return L_indices, L_indptr

This was the first piece of Python I’ve written in about 13 years³⁵, so it’s a bit shit. Nevertheless, it works. It is possible to replace the children structure by a linked list implemented in an n-dimensional integer array³⁶, but why bother. This function is run once.

It’s also worth noting that the children array expresses the elimination tree. If we were going to do something with it explicitly, we could just spit it out and reshape it into a more useful data structure.

There’s one more piece of tedium before we can get to the main event: we need to do a deep copy of into the data structure of . There is no³⁷ avoiding this.

Here is the code.

def _deep_copy_csc(A_indices, A_indptr, A_x, L_indices, L_indptr):
  n = len(A_indptr) - 1
  L_x = np.zeros(len(L_indices))
  
  for j in range(0, n):
    copy_idx = np.nonzero(np.in1d(L_indices[L_indptr[j]:L_indptr[j + 1]],
                                  A_indices[A_indptr[j]:A_indptr[j+1]]))[0]
    L_x[L_indptr[j] + copy_idx] = A_x[A_indptr[j]:A_indptr[j+1]]
  return L_x

Computing the Cholesky factorisation

It feels like we’ve been going for a really long time and we still don’t have a Cholesky factorisation. Mate. I feel your pain. Believe me.

But we are here now: everything is in place. We can now write down the Cholesky algorithm!

The algorithm is as it was before, with the main difference being that we now know two things:

We only need to update tmp with descendent of j in the elimination tree.
That’s it. That is the only thing we know.

Of course, we could use the elimination tree to do this very efficiently, but, as per my last email, I do not care. So we will simply build up a copy of all of the descendants. This will obviously be less efficient, but it’s fine for our purposes. Let’s face it, we’re all going to die eventually.

So here it goes.

def _sparse_cholesky_csc_impl(L_indices, L_indptr, L_x):
    n = len(L_indptr) - 1
    descendant = [[] for j in range(0, n)]
    for j in range(0, n):
        tmp = L_x[L_indptr[j]:L_indptr[j + 1]]
        for bebe in descendant[j]:
            k = bebe[0]
            Ljk= L_x[bebe[1]]
            pad = np.nonzero(                                                \
              L_indices[L_indptr[k]:L_indptr[k+1]] == L_indices[L_indptr[j]])[0][0]
            update_idx = np.nonzero(np.in1d(                                 \
              L_indices[L_indptr[j]:L_indptr[j+1]],                          \
              L_indices[(L_indptr[k] + pad):L_indptr[k+1]]))[0]
            tmp[update_idx] = tmp[update_idx] -                              \
              Ljk * L_x[(L_indptr[k] + pad):L_indptr[k + 1]]
            
        diag = np.sqrt(tmp[0])
        L_x[L_indptr[j]] = diag
        L_x[(L_indptr[j] + 1):L_indptr[j + 1]] = tmp[1:] / diag
        for idx in range(L_indptr[j] + 1, L_indptr[j + 1]):
            descendant[L_indices[idx]].append((j, idx))
    return L_x

The one thing that you’ll note in this code³⁸ is that we are implicitly using things that we know about the sparsity structure of the th column. In particular, we know that the sparsity structure of the th column is the union of the relevant parts of the sparsity structure of their dependent columns. This allows a lot of our faster indexing to work.

Finally, we can put it all together.

def sparse_cholesky_csc(A_indices, A_indptr, A_x):
    L_indices, L_indptr= _symbolic_factor_csc(A_indices, A_indptr)
    L_x = _deep_copy_csc(A_indices, A_indptr, A_x, L_indices, L_indptr)
    L_x = _sparse_cholesky_csc_impl(L_indices, L_indptr, L_x)
    return L_indices, L_indptr, L_x

Right. Let’s test it. We’re going to work on a particular³⁹ sparse matrix.

from scipy import sparse

n = 50
one_d = sparse.diags([[-1.]*(n-1), [2.]*n, [-1.]*(n-1)], [-1,0,1])
A = sparse.kronsum(one_d, one_d) + sparse.eye(n*n)
A_lower = sparse.tril(A, format = "csc")
A_indices = A_lower.indices
A_indptr = A_lower.indptr
A_x = A_lower.data

L_indices, L_indptr, L_x = sparse_cholesky_csc(A_indices, A_indptr, A_x)
L = sparse.csc_array((L_x, L_indices, L_indptr), shape = (n**2, n**2))

err = np.sum(np.abs((A - L @ L.transpose()).todense()))
print(f"Error in Cholesky is {err}")

Error in Cholesky is 3.871041263071504e-12

nnz = len(L_x)
print(f"Number of non-zeros is {nnz} (fill in of {len(L_x) - len(A_x)})")

Number of non-zeros is 125049 (fill in of 117649)

Finally, let’s demonstrate that we can reduce the amount of fill-in with a reordering. Obviously, the built in permutation in scipy is crappy, so we will not see much of a difference. But nevertheless. It’s there.

perm = sparse.csgraph.reverse_cuthill_mckee(A, symmetric_mode=True)
print(perm)

[2499 2498 2449 ...   50    1    0]

A_perm = A[perm[:,None], perm]
A_perm_lower = sparse.tril(A_perm, format = "csc")
A_indices = A_perm_lower.indices
A_indptr = A_perm_lower.indptr
A_x = A_perm_lower.data

L_indices, L_indptr, L_x = sparse_cholesky_csc(A_indices, A_indptr, A_x)
L = sparse.csc_array((L_x, L_indices, L_indptr), shape = (n**2, n**2))
err = np.sum(np.abs((A_perm - L @ L.transpose()).todense()))
print(f"Error in Cholesky is {err}")

Error in Cholesky is 3.0580421951974465e-12

nnz_rcm = len(L_x)
print(f"Number of non-zeros is {nnz_rcm} (fill in of {len(L_x) - len(A_x)}),\nwhich is less than the unpermuted matrix, which had {nnz} non-zeros.")

Number of non-zeros is 87025 (fill in of 79625),
which is less than the unpermuted matrix, which had 125049 non-zeros.

And finally, let’s check that we’ve not made some fake non-zeros. To do this we need to wander back into R because scipy doesn’t have a sparse Cholesky⁴⁰ factorisation.

ind <- py$A_indices
indptr <- py$A_indptr
x <- as.numeric(py$A_x)
A = sparseMatrix(i = ind + 1, p = indptr, x=x, symmetric = TRUE)

L = t(chol(A))
sum(L@i - py$L_indices)

[1] 0

sum(L@p - py$L_indptr)

[1] 0

Perfect.

Ok we are done for today.

I was hoping that we were going to make it to the JAX implementation, but this is long enough now. And I suspect that there will be some issues that are going to come up.

If you want some references, I recommend:

George, Liu, and Ng’s notes (warning: FORTRAN).
Timothy Davis’ book (warning: pure C).
Liu’s survey paper about elimination trees (warning: trees).
Rue and Held’s book (Statistically motivated).

Obviously this is a massive area and I obviously did not do it justice in a single blog post. It’s well worth looking further into. It is very cool. And obviously, I go through all this⁴¹ to get a prototype that I can play with all of the bits of. For the love of god, use Cholmod or Eigen or MUMPS or literally anything else. The only reason to write these yourself is to learn how to understand it.

Footnotes

The old numerical linear algebra naming conventions: Symmetric letters are symmetric matrices, upper case is a matrix, lower case is a vector, etc etc etc. Obviously, all conventions in statistics go against this so who really cares. Burn it all down.↩︎
Go girl. Give us nothing.↩︎
or scalars↩︎
This is actually how you check if a matrix is SPD. Such a useful agorithm!↩︎
This variant is called the left-looking Cholesky. There are 6 distinct ways to rearrange these computations that lead to algorithms that are well-adapted to different structures. The left-looking algorithm is well adapted to matrices stored column-by-column. But it is not the only one! The variant of the sparse Cholesky in Matlab and Eigen is the upward-looking Cholesky. CHOLMOD uses the left-looking Cholesky (because that’s how you get supernodes). MUMPS uses the right-looking variant. Honestly this is a fucking fascinating wormhole you can fall down. A solid review of some of the possibilities is in Chapter 4 of Tim Davis’ book.↩︎
Here A is a matrix and u' is the transpose of the vector u.↩︎
You can also see that if is stored in memory by stacking the columns, this algorithm is set up to be fairly memory efficient. Of course, if you find yourself caring about what your cache is doing, you’ve gone astray somewhere. That is why professionals have coded this up (only a fool competes with LAPACK).↩︎
The ultimate language of scientific computing. Do not slide into my DMs and suggest Julia is.↩︎
You may be thinking well surely we have to use a row-major ordering. But honey let me tell you. We are building our own damn storage method, so we can order it however we bloody want. Also, somewhere down the line I’m going to do this in Eigen, which is column major by default.↩︎
If you look at the algorithm, you’ll see that we only need to store the diagonal and the entries below. This is enough (in general) because we know the matrix is symmetric!↩︎
CPU operations are a lot less memory-limited than they used to be, but nevertheless it piles up. GPU operations still very much are, but sparse matrix operations mostly don’t have the arithmetic intensity to be worth putting on a GPU.↩︎
(NB: zero-based indexing!) This is a superfluous entry (the information is available elsewhere), but having it in makes life just a million times easier because you don’t have to treat the final column separately!.↩︎
ZERO BASED, PYTHON SLICES↩︎
I am not a headless torso that can’t host. I differentiate.↩︎
We only care about undirected graphs↩︎
Or from to if you have hate in your heart and darkness in your soul.↩︎
To get from the previous version of the algorithm to this, we unwound all of those beautiful vectorised matrix-vector products. This would be a terrible idea if we were doing a dense Cholesky, but as general rule if you are implementing your own dense Cholesky factorisation you have already committed to a terrible idea. (The same, to be honest, is true for sparse Choleskys. But nevertheless, she persisted.)↩︎
or trees or really any discrete structure.↩︎
Don’t kid yourself, we look this shit up.↩︎
This means that all of the pairs , and are all in the edge set ↩︎
The specific choices building this matrix are to make sure it’s positive definite. The transpose is there because in R, R <- chol(A) returns an upper triangular matrix that satisfies . I assume this is because C has row-major storage, but I honestly don’t care enough to look it up.↩︎
Here the pivot = FALSE option is needed because the default for a sparse Cholesky decomposition in R is to re-order the vertices to try to minimise the fill-in. But that goes against the example!↩︎
Finding the minimum fill reordering is NP-hard, so everything is heuristic.↩︎
scipy has the reverse Cuthill-McKee reordering—which is shit—easily available. As far as I can tell, the easiest way to get AMD out is to factorise a sparse matrix in scipy and pull the reordering out. If I were less lazy, I’d probably just bind SuiteSparse’s AMD algorithm, which is permissively licensed. But nah. The standard nested-dissection implementation is in the METIS package, which used to have a shit license but is now Apache2.0. Good on you METIS!↩︎
and some other cases↩︎
They are cheap to compute↩︎
Actually, you get a forest in general. You get a tree if has a single connected component, otherwise you get a bunch of disjoint trees. But we still call it a tree because maths is wild.↩︎
Fun fact: it is the spanning tree of the graph of . Was that fun? I don’t think that was fun.↩︎
This is morally but not actually true. There is a variant (slower in practice, faster asymptotically), that costs , where is the inverse Ackerman function, which is a very slowly growing function that is always equal to 4 for our purposes. The actual version that people use is technically , but is faster and the is never seen in practice.↩︎
This is beyond the scope, but basically it’s trying to find groups of nodes that can be eliminated as a block using dense matrix operations. This leads to a much more efficient algorithm.↩︎
There is, of course, a typo in the algorithm we’re about to implement. We’re using the correct version from here.↩︎
from parent to child (aka in descending node order)↩︎
by construction↩︎
If there are no non-zeros below the diagonal, then we have a root of one of the trees in the forest!↩︎
I did not make it prettier because a) I think it’s useful to show bad code sometimes, and b) I can’t be arsed. The real file has some comments in it because I am not a monster, but in some sense this whole damn blog is a code comment.↩︎
The George, Liu, Ng book does that in FORTRAN. Enjoy decoding it.↩︎
Well, there is some avoiding this. If the amount of fill in is small, it may be more efficient to do insertions instead. But again, I am not going to bother. And anyway. If A_x is a JAX array, it’s going to be immutable and we are not going to be able to avoid the deep copy.↩︎
and in the deep copy code↩︎
This is the discretisation of a 2D laplacian on a square with some specific boundary conditions↩︎
Cholmod, which is the natural choice, is GPL’d, which basically means it can’t be used in something like Scipy. R does not have this problem.↩︎
Björk voice↩︎

Reuse

https://creativecommons.org/licenses/by-nc/4.0/

Citation

BibTeX citation:

@online{simpson2022,
  author = {Dan Simpson},
  editor = {},
  title = {Sparse {Matrices} 2: {An} Invitation to a Sparse {Cholesky}
    Factorisation},
  date = {2022-03-31},
  url = {https://dansblog.netlify.app/2022-03-23-getting-jax-to-love-sparse-matrices},
  langid = {en}
}

For attribution, please cite this work as:

Dan Simpson. 2022. “Sparse Matrices 2: An Invitation to a Sparse Cholesky Factorisation.” March 31, 2022. https://dansblog.netlify.app/2022-03-23-getting-jax-to-love-sparse-matrices.

Sparse Matrices 1: The linear algebra of linear mixed effects models and their generalisations

Dan Simpson — Mon, 21 Mar 2022 13:00:00 GMT

Back in the early days of the pandemic I though “I’ll have a pandemic project”. I never did my pandemic project.

But I did think briefly about what it would be. I want to get the types of models I like to use in everyday life efficiently implemented inside Stan. These models encapsulate (generalised) linear mixed models¹, (generalised) additive models, Markovian spatial models², and other models. A good description of the types of models I’m talking about can be found here.

Many of these models can be solved efficiently via INLA³, a great R package for fast posterior inference for an extremely useful set of Bayesian models. In focussing on a particular class of Bayesian models, INLA leverages a bunch of structural features to make a very very fast and accurate posterior approximation. I love this stuff. It’s where I started my stats career.

None of the popular MCMC packages really implement the lessons learnt from INLA to help speed up their inference. I want to change that.

The closest we’ve gotten so far is the nice work Charles Margossian has been doing to get Laplace approximations into Stan.

But I want to focus on the other key tool in INLA: using sparse linear algebra to make things fast and scalable.

I usually work with Stan, but the scale of the C++ coding⁴ required to even tell if these ideas are useful in Stan was honestly just too intimidating.

But the other day I remembered Python. Now I am a shit Python programmer⁵ and I’m not fully convinced I ever achieved object permanence. So it took me a while to remember it existed. But eventually I realised that I could probably make a decent prototype⁶ of this idea using some modern Python tools (specifically JAX). I checked with some PyMC devs and they pointed me at what the appropriate bindings would look like.

So I decided to go for it.

Of course, I’m pretty busy and these sort of projects have a way of dying in the arse. So I’m motivating myself by blogging it. I do not know if these ideas will work⁷. I do not know if my coding skills are up to it⁸. I do not know if I will lose interest. But it should be fun to find out.

So today I’m going to do the easiest part: I’m going to scope out the project. Read on, MacDuff.

A generalised linear mixed effects-ish model

If you were to open the correct textbook, or the Bates, Mächler, Boler, and Walker 2015 masterpiece paper that describes the workings of lme4, you will see the linear mixed model written as where

the columns of contain the covariates⁹,
is a vector of unknown regression coefficients,
is a known matrix that describes the random effects (basically which observation is linked to which random effect),
is the vector of random effects with some unknown covariance matrix ,
and is the observation noise (here is a known diagonal matrix¹⁰).

But unlike Doug Bates and his friends, my aim is to do Bayesian computation. In this situation, also has a prior on it! In fact, I’m going to put a Gaussian prior on it, for some typically known¹¹ matrix .

This means that I can treat and the same¹² way! And I’m going to do just that. I’m going to put them together into a vector . Because the prior on is Gaussian¹³, I’m sometimes going to call the Gaussian component or even the latent¹⁴ Gaussian component.

Now that I’ve smooshed my fixed and random effects together, I don’t really need to keep and separate. So I’m going push them together into a rectangular matrix

This allows us to re-write the model as

What the hell is and why are we suddenly parameterising a multivariate normal distribution by the inverse of its covariance matrix (which, if you’re curious, is known as a precision matrix)???

I will take your questions in reverse order.

We are parameterising by the precision¹⁵ matrix because it will simplify our formulas and lead to faster computations. This will be a major topic for us later!

As to what is, it is the matrix and is the collection of all¹⁶ non-Gaussian parameters in the model. Later, we will assume¹⁷ that has quite a lot of structure.

This is a very generic model. It happily contains things like

Linear regression!
Linear regression with horseshoe priors!
Linear mixed effects models!
Linear regression with splines (smoothing or basis)!
Spatial models like ICARs, BYMs, etc etc etc
Gaussian processes (with the caveat that we’re mostly focussing on those that can be formulated via precision matrices rather than covariance matrices. A whole blog post, I have.)
Any combination of these things!

So if I manage to get this implemented efficiently, all of these models will become efficient too. All it will cost is a truly shithouse¹⁸ interface.

The only downside of this degree of flexibility compared to just implementing a straight linear mixed model with and and and all living separately is that there are a couple of tricks¹⁹ to improve numerical stability that we can’t use.

Let’s get the posterior!

The nice thing about thing about this model is that it is a normal likelihood with a normal prior, so we can directly compute two key quantities:

The “full conditional” distribution , which is useful for getting posterior information about and , and
The marginal posterior .

This means that we do not need to do MCMC on the joint space ! We can instead write a model to draw samples from , which is much lower-dimensional and easier²⁰ to sample from, and then compute the joint posterior by sampling from the full conditional.

I talked a little about the mechanics of this in a previous blog post about conjugate priors, but let’s do the derivations. Why? Because they’re not too hard and it’s useful to have them written out somewhere.

The full conditional

First we need to compute . The first thing that we note is that conditional distributions are always proportional to the joint distribution (we’re literally just pretending some things are constant), so we get

Now we just need to expand things out and work out what the mean and the precision matrix of (which is Gaussian by conjugacy!) are.

Computing posterior distributions by hand is a dying²¹ art. So my best and only advice to you: don’t be a hero. Just pattern match like the rest of us. To do this, we need to know what the density of a multivarite normal distribution looks like deep down in its soul.

Behold: the ugly div box!²²

If , then where I just dropped all of the terms that didn’t involve .

This means the plan is to

Expand out the quadratics in the exponential term so we get something that looks like
The matrix will be the precision matrix of .
The mean of is .

So let’s do it!

This means that is multivariate normal with

precision matrix and
mean²³ .

This means if I build an MCMC scheme to give me samples , , then I can turn them into samples from by doing the following.

For

Simulate
Store the pair

Easy²⁴ as!

Writing down

So now we just²⁵ have to get the marginal posterior for the non-Gaussian parameters . We only need it up to a constant of proportionality, so we can express the joint probability in two equivalent ways to get

Rearranging, we get

This is a very nice relationship between the functional forms of the various densities we happen to know and the density we are trying to compute. This means that if you have access to the full conditional distribution²⁶ for you can marginalise out. No weird integrals required.

But there’s one oddity: there is a on the right hand side, but no on the left hand side. What we have actually found is a whole continuum of functions that are proportional to . It truly does not matter which one we choose.

But some choices make the algebra slightly nicer. (And remember, I’m gonna have to implement this later, so I should probably keep and eye on that.)

A good²⁷ generic choice is .

The algebra here can be a bit tricky²⁸, so let’s write out each function evaluated at .

The bit from the likelihood is where is the length of .

The bit from the prior on is

Finally, we get that the denominator is as the exponential term²⁹ cancels!

Ok. Let’s finish this. (Incidentally, if you’re wondering why Bayesians love MCMC, this is why.)

We can now use the fact that to get

For those who just love a log-density, this is A fairly simple expression³⁰ for all of that work.

So why isn’t this just a Gaussian process?

These days, people³¹ are more than passingly familiar³² with Gaussian processes. And so they’re quite possibly wondering why this isn’t all just an extremely inconvenient way to do the exact same computations you do with a GP.

Let me tell you. It is all about and .

The prior precision matrix is typically block diagonal. This special structure makes it pretty easy to compute the term³³. But, of course, there’s more going on here.

In linear mixed effects models, these blocks on the diagonal matrix are typically fairly small (their size is controlled by the number of levels in the variable you’re stratifying by). Moreover, the matrices on the diagonal of are the inverses of either diagonal or block diagonal matrices that themselves have quite small blocks³⁴.

In models that have more structured random effects³⁵, the diagonal blocks of can get quite large³⁶. Moreover, the matrices on these blocks are usually not block diagonal.

Thankfully, these prior precision matrices do have something going for them: most of their entries are zero. We refer to these types of matrices as sparse matrices. There are some marvelous algorithms for factorising sparse matrices that are usually a lot more efficient³⁷ than algorithms for dense matrices.

Moreover, the formulation here decouples the dimension of the latent Gaussian component from the number of observations. The data only enters the posterior through the reduction , so if the number of observations is much larger than the number of latent variables³⁸ and is sparse³⁹, the operation scales linearly in the number of observations (and obviously superlinearly⁴⁰ in the row-dimension of ).

So the prior precision⁴¹ is a sparse matrix. What about the precision matrix of ?

It is also sparse! Recall that . This means that is a matrix that links the stacked vector of random effects to each observation. Typically, the likelihood will only depend on a small number of entries of , which suggests that most elements in each row of will be zero. This, in turn, implies that is sparse and so is⁴² .

On the other hand, the other three blocks are usually⁴³ fully dense. Thankfully, though, the usual situation is that has far more elements that , which means that is still sparse and we can still use our special algorithms⁴⁴

All of this suggests that, under usual operating conditions, is also a sparse matrix.

And that’s great because that means that we can compute the log-posterior using only 3 main operations:

Computing . This matrix is block diagonal so you can just multiply together the determinants⁴⁵ of the diagonal blocks, which are relatively cheap to compute.
Computing . This requires solving the sparse linear system . This is going to rely on some fancy pants sparse matrix algorithm.
Computing . This is, thankfully, a by-product of the things we need to compute to solve the linear system in the previous task.

What I? What I? What I gotta do? What I gotta do to get this model in PyMC?

So this is where shit gets real.

Essentially, I want to implement a new distribution in PyMC that will take approprite inputs and output the log-density and its gradient. There are two ways to do this:

Panic
Pray

For the first option, you write a C++⁴⁶ backend and register it as an Aesara node. This is how, for example, differential equation solvers migrated into PyMC.

For the second option, which is going to be our goal, we light our Sinead O’Connor votive candle and program up the model using JAX. JAX is a glorious feat of engineering that makes compilable and autodiff-able Python code. In a lot of cases, it seamlessly lets you shift from CPUs to GPUs and is all around quite cool.

It also has approximately zero useful sparse matrix support. (It will let you do very basic things⁴⁷ but nothing as complicated as we are going to need.)

So why am I taking this route? Well firstly I’m curious to see how well it works. So I am going to write JAX code to do all of my sparse matrix operations and see how efficiently it autodiffs it.

Now I’m going to pre-register my expectations. I expect it to be a little bit shit. Or, at least, I expect to be able to make it do better.

The problem is that computing a gradient requires a single reverse-mode⁴⁸ autodiff sweep. This does not seem like a problem until you look at how this sort of thing needs to be implemented and you realise that every gradient call is going to need to generate and store the entire damn autodiff tree for the log-density evaluation. And that autodiff tree is going to be large. So I am expecting the memory scaling on this to be truly shite.

Thankfully there are two ways to fix this. One of them is to implement a custom Jacobian-vector product⁴⁹ and register it with JAX so it knows most of how to do the derivative. The other way is to implement this shit in C++ and register it as a JAX primitive. And to be honest I’m very tempted. But that is not where I am starting.

The other problem is going to be exposing this to users. The internal interface is going to be an absolute shit to use. So we are gonna have to get our Def Leppard on and sprinkle some syntactical sugar all over it.

I’m honestly less concerned about this challenge. It’s important but I am not expecting to produce anything good enough to put into PyMC (or any other package). But I do think it’s a good idea to keep this sort of question in mind: it can help you make cleaner, more useful code.

What comes next?

Well you will not get a solution today. This blog post is more than long enough.

My plan is to do three things.

Implement the relevant sparse matrix solver in a JAX-able form. (This is mostly gonna be me trying to remember how to do something I haven’t done in a very long time.)
Bind⁵⁰ the (probably) inefficient version into PyMC to see how that process works.
Try the custom jvp and vjp interfaces in JAX to see if they speed things up relative to just autodiffing through my for loops.
(Maybe) Look into whether hand-rolling some C++ is worth the effort.

Will I get all of this done? I mean, I’m skeptical. But hey. If I do it’ll be nice.

Footnotes

aka linear multilevel models↩︎
Popular in epidemiology↩︎
INLA = Laplace approximations + sparse linear algebra to do fast, fairly scalable, and accurate Bayesian inference on a variety of Bayesian models. It’s particularly good at things like spatial models.↩︎
In its guts, Stan is a fully templated C++ autodiff library, so I would need to add specific sparse matrix support. And then there’s be some truly gross stuff with the Stan language and its existing types. And so on and so on and honestly it just broke my damn brain. So I started a few times but never finished.↩︎
I just don’t ever use it. I semi-regularly read and debug other people’s code, but I don’t typically write very much myself. I use R because that’s what my job needs me to use. So a shadow aim here is to just put some time into my Python. By the end of this I’ll be like Britney doing I’m a Slave 4 U.↩︎
Or maybe more, but let’s not be too ambitious.↩︎
I’m pretty sure they will.↩︎
My sparse matrix data structures are rusty as fuck.↩︎
and the intercept if it’s needed↩︎
Really this costs me nothing and can be useful with multiple observations.↩︎
Default options include the identity matrix or some multiple of the identity matrix.↩︎
REML heads don’t dismay. You can do all kinds of weird shit by choosing some of these matrices in certain ways. I’m not gonna stop you. I love and support you. Good vibes only.↩︎
The priors on and are independent Gaussian so it has to be.↩︎
homosexual↩︎
Inverse correlation matrix↩︎
excluding the fixed ones, like and and . ↩︎
Such a dirty word. For all of the models we care about, this is block diagonal. So this assumption is our restriction to a specific class of models.↩︎
I would suggest a lot of syntactic sugar if you were ever going to expose this stuff to users.↩︎
See the Bates et al. paper. Their formulation is fabulous but doesn’t extend nicely to the situations I care about! Basically they optimise for the situation where can be singular, which is an issue when you’re doing optimisation. But I’m not doing optimisation and I care about the case where the precision matrix is defined as a singular matrix (and therefore does not exist. This seems like a truly wild idea, but it occurs quite naturally in many important models like smoothing splines and ICAR models (which are extremely popular in spatial epidemiology).↩︎
It’s easier in two ways. Firstly, MCMC likes lower-dimensional targets. They are typically easier to sample from! Secondly, the posterior geometry of is usually pretty simple, while the joint posterior has an annoying tendency to have a funnel in it, which forces us to do all kinds of annoying reparameterisation tricks to stop the sampler from shitting the bed.↩︎
Computers!↩︎
CSS is my passion.↩︎
It’s possible to rearrange things to lose that , which I admit looks a bit weird. It cancels out down the line.↩︎
I have, historically, not had the greatest grip on whether or not things are easy.↩︎
See previous footnote.↩︎
Or a good approximation to it. Laplace approximations work very well for this to extend everything we’re doing here from a linear mixed-ish model to a generalised linear mixed-ish model.↩︎
This is actually a bit dangerous on the face of it because it depends on . You can convince yourself it’s ok. Choosing is less stress inducing, but I wanted to bring out the parallel to using a Laplace approximation to , in which case we really want to evaluate the ratio at the point where the approximation is the best (aka the conditional mean).↩︎
A common mistake is to forget the parameter dependent proportionality constants from the normal distribution. You didn’t need them before because you were conditioning on so they were all constant. But now is unknown and if we forget them an angel will cry.↩︎
Honest footnote: This started as because I don’t read my own warnings.↩︎
The brave or foolish amongst you might want to convince yourselves that this collapses to exactly the marginal likelihood we would’ve gotten from Rasmussen and Williams had we made a sequence of different life choices. In particular if and .↩︎
Or, at least, people who have made it this far into the post.↩︎
You like GPs bro? Give me a sequence of increasingly abstract definitions. I’m waiting.↩︎
Multiply the determinants of the matrices along the diagonal.↩︎
Look at the Bates et al paper. Specifically section 2.2. lme4 is a really clever thing.↩︎
examples: smoothing splines, AR(p) models, areal spatial models, some Gaussian processes if you’re careful↩︎
– is not unheard of↩︎
A dense matrix factorisation of an matrix costs . The same factorisation of a sparse matrix can cost as little as if you’re very lucky. More typically it clocks in a –, which is still a substantial saving!↩︎
This happens for a lot of designs, or when a basis spline or a Markovian Gaussian process is being used↩︎
This happens a lot, but not always. For instance subset-of-regressors/predictive process-type models have a dense . In this case, if has rows an columns, this is an , which is more expensive than a sparse unless has roughly non-zeros per row..↩︎
but usually not cubically. See above footnote.↩︎
It’s important that we are talking about precision matrices here and not covariance matrices as the inverse of a sparse matrix is typically dense. For instance, an AR(1) prior with autocorrelation parameter has a prior has a sparse precision matrix that looks something like On the other hand, the covariance matrix is fully dense
This is a generic property: the inverse of a sparse matrix is usually dense (it’s dense as long as the graph associated with the sparse matrix has a single connected component there’s a matrix with the same pattern of non-zeros that has a fully dense inverse) and the entries satisfy geometric decay bounds.↩︎
Remember: is diagonal and known.↩︎
Not if you’re doing some wild dummy coding shit or modelling text, but typically.↩︎
You’d think that dense rows and columns would be a problem but they’re not. A little graph theory and a little numerical linear algebra says that as long as they are the last variables in the model, the algorithms will still be efficient. That said, if you want to dig in, it is possible to use supernodal (eg CHOLMOD) and multifrontal (eg MUMPS) methods to group the operations in such a way that it’s possible to use level-3 BLAS operations. CHOLMOD even spins this into a GPU acceleration scheme, which is fucking wild if you think about it: sparse linear algebra rarely has the arithmetic intensity or data locality required to make GPUs worthwhile (you spend all of your time communicating, which is great in a marriage, terrible in a GPU). But some clever load balancing, tree-based magic, and multithreading apparently makes it possible. Like truly, I am blown away by this. We are not going to do any of this because absolutely fucking not. And anyway. It’s kinda rare to have a huge number of covariates in the sorts of models that use these complex random effects. (Or if you do, you better light your Sinead O’Connor votive candle because honestly you have a lot of problems and you’re gonna need healing.)↩︎
If you’ve been reading the footnotes, you’ll recall that sometimes one of these precision matrices on the diagonal will be singular. Sometimes that’s because you fucked up your programming. But other times it’s because you’re using something like an ICAR (intrinsic conditional autoregressive) prior on one of your components. The precision matrix for this model is , where is the adjacency matrix of some fixed graph (typically describing something like which postcodes are next to each other). Some theory suggests that if has connected components, the zero determinant should be replaced with , where is the number of vertices in .↩︎
I guess there’s nothing really stopping you from writing in pure Python except a creeping sense of inadequacy.↩︎
eg build a sparse matrix↩︎
Honey, we do not have time. Understanding autodiff is not massively important in the grand scheme of this blogpost (or, you know, probably in real life unless you do some fairly specific things). I’ll let Charles explain it.↩︎
Or, a custom vector-Jacobian product, which is not a symmetrical choice.↩︎
I bind you Nancy!↩︎

Reuse

https://creativecommons.org/licenses/by-nc/4.0/

Citation

BibTeX citation:

@online{simpson2022,
  author = {Dan Simpson},
  editor = {},
  title = {Sparse {Matrices} 1: {The} Linear Algebra of Linear Mixed
    Effects Models and Their Generalisations},
  date = {2022-03-22},
  url = {https://dansblog.netlify.app/2022-03-22-a-linear-mixed-effects-model},
  langid = {en}
}

For attribution, please cite this work as:

Dan Simpson. 2022. “Sparse Matrices 1: The Linear Algebra of Linear Mixed Effects Models and Their Generalisations.” March 22, 2022. https://dansblog.netlify.app/2022-03-22-a-linear-mixed-effects-model.

Barry Gibb came fourth in a Barry Gibb look alike contest (Repost)

Dan Simpson — Tue, 25 Jan 2022 13:00:00 GMT

Every day a little death, in the parlour, in the bed. On the lips and in the eyes. In the curtains in the silver, in the buttons, in the bread, in the murmurs, in the pauses, in the gestures, in the sighs. Sondheim

The most horrible sound in the world is that of a reviewer asking you to compare your computational method to another, existing method. Like bombing countries in the name of peace, the purity of intent drowns out the voices of our better angels as they whisper: at what cost.

Before the unnecessary drama of that last sentence¹ sends you running back to the still-open browser tab documenting the world’s slow slide into a deeper, danker, more complete darkness that we’ve seen before, I should say that I understand that for most people this isn’t a problem. Most people don’t do research in computational statistics. Most people are happy².

So why does someone asking for a comparison of two methods for allegedly computing the same thing fill me with the sort of dread usually reserved for climbing down the ladder into my basement to discover, by the the light of a single, swinging, naked light bulb, that the evil clown I keep chained in the corner has escaped? Because it’s almost impossible to do well.

I go through all this before you wake up so I can feel happier to be safe again with you

Many many years ago, when I still had all my hair and thought it was impressive when people proved things, I did a PhD in numerical analysis. These all tend to have the same structure:

survey your chosen area with a simulation study comparing all the existing methods,
propose a new method that should be marginally better than the existing ones,
analyse the new method, show that it’s at least not worse than the existing ones (or worse in an interesting way),
construct a simulation study that shows the superiority of your method on a problem that hopefully doesn’t look too artificial,
write a long discussion blaming the inconsistencies between the maths and the simulations on “pre-asymptotic artefacts”.

Which is to say, I’ve done my share of simulation studies comparing algorithms.

So what changed? When did I start to get the fear every time someone mentioned comparing algorithms?

Well, I left numerical analysis and moved to statistics and I learnt the one true thing that all people who come to statistics must learn: statistics is hard.

When I used to compare deterministic algorithms it was easy. I would know the correct answer and so I could compare algorithms by comparing the error in their approximate solutions (perhaps taking into account things like how long it took to compute the answer).

But in statistics, the truth is random. Or the truth is a high-dimensional joint distribution that you cannot possibly know. So how can you really compare your algorithms, except possibly by comparing your answer to some sort of “gold standard” method that may or may not work.

Inte ner för ett stup. Inte ner från en bro. Utan från vattentornets topp³.

The first two statistical things I ever really worked on (in an office overlooking a fjord) were computationally tractable ways of approximating posterior distributions for specific types of models. The first of these was INLA⁴. For those of you who haven’t heard of it, INLA (and it’s popular R implementation R-INLA) is a method for doing approximate posterior computation for a lot of the sorts of models you can fit in rstanarm and brms. So random effect models, multilevel models, models with splines, and spatial effects.

At the time, Stan didn’t exist (later, it barely existed), so I would describe INLA as being Bayesian inference for people who lacked the ideological purity to wait 14 hours for a poorly mixing BUGS chain to run, instead choosing to spend 14 seconds to get a better “approximate” answer. These days, Stan exists in earnest and that 14 hours is 20 minutes for small-ish models with only a couple of thousand observations, and the answer that comes out of Stan is probably as good as INLA.

Working on INLA I learnt a new fear: the fear that someone else was going to publish a simulation study comparing INLA with something else without checking with us first.

Now obviously, we wanted people to run their comparisons past us so we could ruthlessly quash any dissent and hopefully exile the poor soul who thought to critique our perfect method to the academic equivalent of a Siberian work camp.

Or, more likely, because comparing statistical models is really hard, and we could usually make the comparison much better by asking some questions about how it was being done.

Sometimes, learning from well-constructed simulation studies how INLA was failing lead to improvements in the method.

But nothing could be learned if, for instance, the simulation study was reporting runs from code that wasn’t doing what the authors thought it was⁵. And I don’t want to suggest that bad or unfair comparisons comes from malice (for the most part, we’re all quite conscientious and fairly nice), but rather that they happen because comparing statistical algorithms is hard.

And comparing algorithms fairly where you don’t understand them equally well is almost impossible.

Well did you hear the one about Mr Ed? He said I’m this way because of the things I’ve seen

Why am I bringing this up? It’s because of the second statistical thing that I worked on while I was living in sunny Trondheim (in between looking at the fjord and holding onto the sides of buildings for dear life because for 8 months of the year Trondheim is a very pretty mess of icy hills).

During that time, I worked with Finn Lindgren and Håvard “INLA” Rue on computationally efficient approximations to Gaussian random fields (which is what we’re supposed to call Gaussian Processes when the parameter space is more complex than just “time” [shakes fist at passing cloud]). Finn (with Håvard and Johan Lindström) had proposed a new method, cannily named the Stochastic Partial Differential Equation (SPDE) method, for exploiting the continuous-space Markov property in higher dimensions. Which all sounds very maths-y, but it isn’t.

The guts of the method says “all of our problems with working computationally with Gaussian random fields comes from the fact that the set of all possible functions is too big for a computer to deal with, so we should do something about that”. The “something” is replace the continuous function with a piecewise linear one defined over a fairly fine triangulation on the domain of interest.

But why am I talking about this?

(Sorry. One day I’ll write a short post.)

A very exciting paper popped up on arXiv on Monday⁶ comparing a fairly exhaustive collection of recent methods for making spatial Gaussian random fields more computationally efficient.

Why am I not cringing in fear? Because if you look at the author list, they have included an author from each of the projects they have compared! This means that the comparison will probably be as good as it can be. In particular, it won’t suffer from the usual problem of the authors understanding some methods they’re comparing better than others.

#The world is held together by the wind that blows through Gena Rowland’s hair

So how did they go? Well, actually, they did quite well. I like that

They describe each problem quite well
The simulation study and the real data analysis uses a collection of different evaluations metrics
Some of these are proper scoring rules, which is the correct framework for evaluating probabilistic predictions
They acknowledge that the wall clock timings are likely to be more a function of how hard a team worked to optimise performance on this one particular model than a true representation of how these methods would work in practice.

Not the lovin’ kind

But I’m an academic statistician. And our key feature, as a people, is that we loudly and publicly dislike each other’s work. Even the stuff we agree with. Why? Because people with our skills who also have impulse control tend to work for more money in the private sector.

So with that in mind, let’s have some fun.

(Although seriously, this is the best comparison of this type I’ve ever seen. So, really, I’m just wanting it to be even bester.)

So what’s wrong with it?

It’s gotta be big. I said it better be big

The most obvious problem with the comparison is that the problem that these methods are being compared on is not particularly large or complex. You can see that from the timings. Almost none of these implementations are sweating, which is a sign that we are not anywhere near the sort of problem that would really allow us to differentiate between methods.

So how small is small? The problem had 105,569 observations and required prediction at at most 4,431 other locations. To be challenging, this data needed to be another order of magnitude bigger.

God knows I know I’ve thrown away those graces

(Can you tell what I’m listening to?)

The second problem with the comparison is that the problem is tooooooo easy. As the data is modelled with a Gaussian observation noise and a multivariate Gaussian latent random effect, it is a straightforward piece of algebra to eliminate all of the latent Gaussian variables from the model. This leads to a model with only a small number of parameters, which should make inference much easier.

How do you do that? Well, if the data is , the Gaussian random field is and and all the hyperparmeters . In this case, we can use conditional probability to write that which holds for every value of and particularly . Hence if you have a closed form full conditional (which is the case when you have Gaussian observations), you can write the marginal posterior out exactly without having to do any integration.

A much more challenging problem would have had Poisson or binomial data, where the full conditional doesn’t have a known form. In this case you cannot do this marginalisation analytically, so you put much more stress on your inference algorithm.

I guess there’s an argument to be made that some methods are really difficult to extend to non-Gaussian observations. But there’s also an argument to be made that I don’t care. Shit or get off the pot, as American would say.

Don’t take me back to the range

The prediction quality is measured in terms of mean squared error and mean absolute error (which are fine), the continuous rank probability score (CRPS) and and the Interval Score (INT), both of which are proper scoring rules. Proper scoring rules (and follow the link or google for more if you’ve never heard of them) are the correct way to compare probabilistic predictions, regardless of the statistical framework that’s used to make the predictions. So this is an excellent start!

But one of these measures does stand out: the prediction interval coverage (CVG) which is defined in the paper as “the percent of intervals containing the true predicted value”. I’m going to parse that as “the percent of prediction intervals containing the true value”. The paper suggests (through use of bold in the tables) that the correct value for CVG is 0.95. That is, the paper suggests the true value should lie within the 95% interval 95% of the time.

This is not true.

Or, at least, this is considerably more complex than the result suggests.

Or, at least, this is only true if you compute intervals that are specifically built to do this, which is mostly very hard to do. And you definitely don’t do it by providing a standard error (which is an option in this competition).

#Boys on my left side. Boys on my right side. Boys in the middle. And you’re not here.

So what’s wrong with CVG?

Why? Well first of all it’s a multiple testing problem. You are not testing the same interval multiple times, you are checking multiple intervals one time each. So it can only be meaningful if the prediction intervals were constructed jointly to solve this specific multiple testing problem.

Secondly, it’s extremely difficult to know what is considered random here. Coverage statements are statements about repeated tests, so how you repeat them⁷ will affect whether or not a particular statement is true. It will also affect how you account for the multiple testing when building your prediction intervals. (Really, if anyone did opt to just return standard errors, nothing good is going to happen for them in this criterion!)

Thirdly, it’s already covered by the interval score. If your interval is with nominal level , the interval score is for an observation is This score (where smaller is better) rewards you for having a narrow prediction interval, but penalises you every time the data does not lie in the interval. The score is minimised when . So this really is a good measure of how well the interval estimate is calibrated that also checks more aspects of the interval than CVG (which lacks the first term) does.

There’s the part you’ve braced yourself against, and then there’s the other part

Any conversation about how to evaluate the quality of an interval estimate really only makes sense in the situation where everyone has constructed their intervals the same way. The authors’ code is here, but even without seeing it we know there are essentially four options:

Compute pointwise prediction means and standard errors and build the pointwise intervals .
Compute the pointwise Bayesian prediction intervals, which are formed from the appropriate quantiles (or the HPD region if you are Tony O’Hagan) of .
An interval of the form , where is chosen to ensure coverage.
Some sort of clever thing based on functional data analysis.

But how well these different options work will depend on how they’re being assessed (or what they’re being used for).

Option 1: We want to fill in our sparse observation by predicting at more and more points

(This is known as “in-fill asymptotics”). This type of question occurs when, for instance, we want to fill in the holes in satellite data (which are usually due to clouds).

This is the case that most closely resembles the design of the simulation study in this paper. In this case you refine your estimated coverage by computing more prediction intervals and checking if the true value lies within the interval.

Most of the easy to find results about coverage in these is from the 1D literature (specifically around smoothing splines and non-parametric regression). In these cases, it’s known that the first option is bad, the second option will lead to conservative regions (the coverage will be too high), the third option involves some sophisticated understanding of how Gaussian random fields work, and the fourth is not something I know anything about.

Option 2: We want to predict at one point, where the field will be monitored multiple times

This second option comes up when we’re looking at a long-term monitoring network. This type data is common in environmental science, where a long term network of sensors is set up to monitor, for example, air pollution. The new observations are not independent of the previous ones (there’s usually some sort of temporal structure), but independence can often be assumed if the observations are distant enough in time.

In this case as you are repeating observations at a single site, Option 1 will be the right way to construct your interval, option 2 will probably still be a bit broad but might be ok, and options 3 and 4 will probably be too narrow if the underlying process is smooth.

Option 3: Mixed asymptotics! You do both at once

Simulation studies are the last refuge of the damned.

I see the sun go down. I see the sun come up. I see a light beyond the frame.

So what are my suggestions for making this comparison better (other than making it bigger, harder, and dumping the weird CVG criterion)?

randomise
randomise
randomise

What do I mean by that? Well in the simulation study, the paper only considered one possible set of data simulated from the correct model. All of the results in their Table 2, which contains the scores, and timings on the simulated data, depends on this particular realisation. And hence Table 2 is a realisation of a random variable that will have a mean and standard deviation.

This should not be taken as an endorsement of the frequentist view that the observed data is random and estimators should be evaluated by their average performance over different realisation of the data. This is an acknowledgement of the fact that in this case the data is actually a realisation of a random variable. Reporting the variation in Table 2 would give an idea of the variation in the performance of the method. And would lead to a more nuanced and realistic comparison of the methods. It is not difficult to imagine that for some of these criteria there is no clear winner when averaged over data sets.

Where did you get that painter in your pocket?

I have very mixed feelings about the timings column in the results table. On one hand, an “order of magnitude” estimate of how long this will actually take to fit is probably a useful thing for a person considering using a method. On the other hand, there is just no way for these results not to be misleading. And the paper acknowledges this.

Similarly, the competition does not specify things like priors for the Bayesian solutions. This makes it difficult to really compare things like interval estimates, which can strongly depend on the specified priors. You could certainly improve your chances of winning on the CVG computation for the simulation study by choosing your priors carefully!

What is this six-stringed instrument but an adolescent loom?

I haven’t really talked about the real data performance yet. Part of this is because I don’t think real data is particularly useful for evaluating algorithms. More likely, you’re evaluating your chosen data set as much as, or even more than, you are evaluating your algorithm.

Why? Because real data doesn’t follow the model, so even if a particular method gives a terrible approximation to the inference you’d get from the “correct” model, it might do very very well on the particular data set. I’m not sure how you can draw any sort of meaningful conclusion from this type of situation.

I mean, I should be happy I guess because the method I work on “won” three of the scores, and did fairly well in the other two. But there’s no way to say that wasn’t just luck.

What does luck look like in this context? It could be that the SPDE approximation is a better model for the data than the “correct” Gaussian random field model. It could just be Finn appealing to the old Norse gods. It’s really hard to tell.

If any real data is to be used to make general claims about how well algorithms work, I think it’s necessary to use a lot of different data sets rather than just one.

Similarly, a range of different simulation study scenarios would give a broader picture of when different approximations behave better.

Don’t dream it’s over

One more kiss before we part: This field is still alive and kicking. One of the really exciting new ideas in the field (that’s probably too new to be in the comparison) is that you can speed up the computation of the unnormalised log-posterior through hierarchical decompositions of the covariance matrix (there is also code). This is a really neat method for solving the problem and a really exciting new idea in the field.

There are a bunch of other things that are probably worth looking at in this article, but I’ve run out of energy for the moment. Probably the most interesting thing for me is that a lot of the methods that did well (SPDEs, Predictive Processes, Fixed Rank Kriging, Multi-resolution Approximation, Lattice Krig, Nearest-Neighbour Predictive Processes) are cut from very similar cloth. It would be interesting to look deeper at the similarities and differences in an attempt to explain these results.

Footnotes

2021: Oh my giddy aunt what even was that?!↩︎
2021: The around that time is notable to me, but not interesting to others. So I’m sorry extent to which these blog posts captured the variations in my mental state about that. But also they give a small glimpse at just how bleak my sense of humour can be.↩︎
No I don’t speak Swedish, but one of my favourite songwriters/lyricists does. And sometimes I’m just that unbearable. Also the next part of this story takes place in Norway, which is near Sweden but produces worse music (Susanne Sunfør and M2M being notable exceptions)↩︎
I once gave a truly mortifying talk called INLA: Past, Present, and Future at a conference in Dublin.↩︎
Or, as happened one time, they compared computation for a different model with an algorithm that failed its convergence checks and assumed that all of the hyperparameters were fixed. All of that is bad but the last part is like saying lm is faster than lme4::lmer for fitting mixed effects models because we only checked when the almost always unknown variance parameters were assumed known.↩︎
In 2017. A long time ago.↩︎
Repeat the same test or make a new test for different data↩︎

Reuse

https://creativecommons.org/licenses/by-nc/4.0/

Citation

BibTeX citation:

@online{simpson2022,
  author = {Dan Simpson},
  editor = {},
  title = {Barry {Gibb} Came Fourth in a {Barry} {Gibb} Look Alike
    Contest {(Repost)}},
  date = {2022-01-26},
  url = {https://dansblog.netlify.app/2022-01-26-barry-gibb-came-fourth-in-a-barry-gibb-look-alike-contest-repost},
  langid = {en}
}

For attribution, please cite this work as:

Dan Simpson. 2022. “Barry Gibb Came Fourth in a Barry Gibb Look Alike Contest (Repost).” January 26, 2022. https://dansblog.netlify.app/2022-01-26-barry-gibb-came-fourth-in-a-barry-gibb-look-alike-contest-repost.

Why won’t you cheat with me? (Repost)

Dan Simpson — Wed, 08 Dec 2021 13:00:00 GMT

But I got some ground rules I’ve found to be sound rules
and you’re not the one I’m exempting.
Nonetheless, I confess it’s tempting.
– Jenny Toomey sings Franklin Bruno

It turns out that I did something a little controversial in last week’s¹ post. As these things always go, it wasn’t the thing I was expecting to get push back from, but rather what I thought was a fairly innocuous scaling of the prior. One commenter (and a few other people on other communication channels) pointed out that the dependence of the prior on the design didn’t seem kosher. Of course, we (Andrew, Mike and I) wrote a paper that was sort of about this a few months ago², but it’s one of those really interesting topics that we can probably all deal with thinking more about.

So in this post, I’m going to go into a couple of situations where it makes sense to scale the prior based on fixed information about the experiment. (The emerging theme for these posts is “things I think are interesting and useful but are probably not publishable” interspersed with “weird digressions into musical theatre / the personal mythology of Patti LuPone”.)

If you haven’t clicked yet, this particular post is going to be drier than Eve Arden in Mildred Pierce. If you’d rather be entertained, I’d recommend Tempting: Jenny Toomey sings the songs of Franklin Bruno. (Franklin Bruno is today’s stand in for Patti, because I’m still sad that War Paint closed³. I only got to see it twice.)

(Jenny Toomey was one of the most exciting American indie musicians in the 90s both through her bands [Tsunami was the notable one, but there were others] and her work with Simple Machines, the label she co-founded. These days she’s working in musician advocacy and hasn’t released an album since the early 2000s. Bruno’s current band is called The Human Hearts. He has had a long solo career and was also in an excellent powerpop band called Nothing Painted Blue, who had an album called The Monte Carlo Method. And, now⁴ that I live in Canada, I should say that that album has a fabulous cover of Mark Szabo’s I Should Be With You. To be honest, the only reason I work with Andrew and the Stan crew is that I figure if I’m in New York often enough I’ll eventually coincide with a Human Hearts concert⁵.)

Sparsity

Why won’t you cheat with me? You and I both know you’ve done it before. – Jenny Toomey sings Franklin Bruno

The first object of our affliction are priors that promote sparsity in high-dimensional models. There has been a lot of work on this topic, but the cheaters guide is basically this:

While spike-and-slab models can exactly represent sparsity and have excellent theoretical properties, they are basically useless from a computational point of view. So we use scale-mixture of normal priors (also known as local-global priors) to achieve approximate sparsity, and then use some sort of decision rule to take our approximately sparse signal and make it exactly sparse.

What is a scale-mixture of normals? Well it has the general form where is a global standard deviation parameter, controlling how large the parameters are in general⁶, while the local standard deviation parameters control how big is relative to the other s.

The priors for and the are typically set to be independent. A lot of theoretical work just treats as fixed (or as otherwise less important than the local parameters), but this is wrong.

Pedant’s corner: Andrew likes define mathematical statisticians as those who use for their data rather than . I prefer to characterise them by those who think it’s a good idea to put a prior on variance (an un-elicitable quantity) rather than standard deviation (which is easy to have opinions about). Please people just stop doing this. You’re not helping yourselves!

Actually, maybe that last point isn’t for Pedant’s Corner after all. Because if you parameterise by standard deviation it’s pretty easy to work out what the marginal prior on (with fixed) is.

This is quite useful because, with the notable exception of the “Bayesian” “Lasso” which-does-not-work-but-will-never-die-because-it-was-inexplicably-published-in-the leading-stats-journal-by-prominent-statisticians-and-has-the-word-Lasso-in-the-title-even-though-a-back-of-the-envelope-calculation-or-I-don’t-know-a-fairly-straightforward-simulation-by-the-reviewers-should-have-nixed-it (to use its married name), we can’t compute the marginal prior for most scale-mixtures of normals.

The following result, which was killed by reviewers at some point during the PC prior papers long review process, but lives forever in the arXiv’d first version, tells you everything you need to know. It’s a picture because frankly I’ve had a glass of wine and I’m not bloody typing it all again⁷.

Theorem 1 Let be a prior on the standard deviation of . The induced prior has the following properties. Fix .

If for all and for some , then is finite at .
If for every , then has a weak logarithmic spike at zero, that is
If , then
If for all and for some , then
If for all and for some , then

The proof is here.

The king must die (repost)

Dan Simpson — Tue, 07 Dec 2021 13:00:00 GMT

And then there was Yodeling Elaine, the Queen of the Air. She had a dollar sign medallion about as big as a dinner plate around her neck and a tiny bubble of spittle around her nostril and a little rusty tear, for she had lassoed and lost another tipsy sailor—Tom Waits

It turns out I turned thirty two¹ and became unbearable. Some of you may feel, with an increasing sense of temporal dissonance, that I was already unbearable². Others will wonder how I can look so good at my age³. None of that matters to me because all I want to do is talk about the evils of marketing like the 90s were a vaguely good idea⁴.

The thing is, I worry that the real problem in academic statistics in 2017 is not a reproducibility crisis, so much as that so many of our methods just don’t work. And to be honest, I don’t really know what to do about that, other than suggest that we tighten our standards and insist that people proposing new methods, models, and algorithms work harder to sketch out the boundaries of their creations. (What a suggestion. Really. Concrete proposals for concrete change. But it’s a blog. If ever there was a medium to be half-arsed in it’s this one. It’s like twitter for people who aren’t pithy.)

Berätta för mig om det är sant att din hud är doppad i honung

So what is the object of my impotent ire today. Well nothing less storied than the Bayesian Lasso.

It should be the least controversial thing in this, the year of our lord two thousand and seventeen, to point out that this method bears no practical resemblance to the Lasso. Or, in the words of Law and Order: SVU, “The [Bayesian Lasso] is fictional and does not depict any actual person or event”.

Who do you think you are?

The Bayesian Lasso is a good example of what’s commonly known as the Lupita Nyong’o fallacy⁵, which goes something like this: Lupita Nyong’o had a break out role in Twelve Years a Slave, she also had a heavily disguised role in one of ’ the Star Wars films (the specific Star Wars film is not important. I haven’t seen it and I don’t care). Hence Twelve Years a Slave exists in the extended Star Wars universe.⁶

The key point is that the (classical) Lasso plays a small part within the Bayesian Lasso (it’s the MAP estimate) in the same way that Lupita Nyong’o played a small role in that Star Wars film. But just as the presence of Ms Nyong’o does not turn Star Wars into Twelve Years a Slave, the fact that the classical Lasso can be recovered as the MAP estimate of the Bayesian Lasso does not make the Bayesian Lasso useful.

And yet people still ask if they can be fit in Stan. In that case, Andrew answered the question that was asked, which is typically the best way to deal with software enquiries⁷. But I am brave and was not asked for my opinion, so I’m going to talk about why the Bayesian Lasso doesn’t work.

Hiding all away

So why would anyone not know that the Bayesian Lasso doesn’t work? Well, I don’t really know. But I will point out that all of the results that I’ve seen in this directions (not that I’ve been looking hard) have been published in the prestigious but obtuse places like Annals of Statistics, the journal we publish in when we either don’t want people without a graduate degree in mathematical statistics to understand us or when we want to get tenure.

By contrast, the original paper is very readable and published in JASA, where we put papers when we are ok with people who do not have a graduate degree in mathematical statistics being able to read them, or when we want to get tenure⁸.

To be fair to Park and Casella, they never really say that the Baysian Lasso should be used for sparsity. Except for one sentence in the introduction where they say the median gives approximately sparse estimators and the title which links it to the most prominent and popular method for estimating a sparse signal. Marketing eh. (See, I’m Canadian now⁹).

##The devil has designed my death and is waiting to be sure

So what is the Bayesian LASSO (and why did I spend 600 words harping on about something before defining it? The answer will shock you. Actually the answer will not shock you, it’s because it’s kinda hard to do equations on this thing¹⁰.)

For data observed with Gaussian error, the Bayesian Lasso takes the form

where, instead of putting a Normal prior on as we would in a bog-standard Bayesian regression, we instead use independent Laplace priors

Here the tuning parameter¹¹ where is the number of covariates, is the number of “true” non-zero elements of , is known, and is an unknown scaling parameter that should be .

Important Side note: This isn’t the exact same model as Park and Castella used as they didn’t use the transformation but rather just dealt with as the parameter. We will see below, and it’s born out by many papers in this field, that the best possible value of will depend on this structural/design information

If we know how varies as the structural/design information changes, it’s a much better idea to put a prior on than on directly. Why? Because a prior on needs to depend on p, , and X and hence needs to be changed for each problem, while a prior on can be used for many problems. One possible option is , which is a rate optimal parameter for the (non-Bayesian) Lasso. Later, we’ll do a back-of-the-envelope calculation that suggests we might not need the square root around the logarithmic term.

Why do we scale priors

The critical idea behind the Bayesian Lasso is that we can use the i.i.d. Laplace priors to express the substantive belief that the most of the are (approximately) zero. The reason for scaling the prior is that the values of that are consistent with this belief depend on , , and .

For example, , the Bayesian Lasso will not give an approximately sparse signal.

While we could just use a prior for that has a very heavy right tail (something like an inverse gamma), this is at odds with a good practice principle of making sure all of thee parameters in your models are properly scaled to make them order 1. Why do we do this? Because it makes it much much easier to set sensible priors.

Some of you may have noticed that the scaling can depend on the unknown sparsity . This seems like cheating. People who do asymptotic theory call this sort of value for an oracle value, mainly because people studying Bayesian asymptotics are really really into database software.

The idea is that this is the value of that gives the model the best chance of working. When maths-ing, you work out the properties of the posterior with the oracle value of and then you use some sort of smoothness argument to show that the actual method that is being used to select (or average over) the parameter gives almost the same answer.

It’s also worth noting that the scaling here doesn’t (directly¹²) depend on the number of observations, only the number of covariates. This is appropriate: it’s ok for priors to depend on things that should be known a priori (like the number of parameters) or things that can be worked with¹³ (like the scaling of ). It’s a bit weirder if it depends on the number of observations (that tends to break things like coherent Bayesian updating, while the other dependencies don’t).

Only once in Sheboygan. Only once.

So what’s wrong with the Bayesian Lasso? Well the short version is that the Laplace prior doesn’t have enough mass near zero relative to the mass in the tails to allow for a posterior that has a lot of entries that are almost zero and some entries that are emphatically not zero.
Because the Bayesian Lasso prior does not have a spike at zero, none of the entries will be a priori exactly zero, so we need some sort of rule to separate the “zero” entries from the “non-zero” entries. The way that we’re going to do this is to choose a cutoff where we assume that if , then .

So how do we know that the Lasso prior doesn’t put enough mass in important parts of the parameter space? Well there are two ways. I learnt it during the exciting process of writing a paper that the reviewers insisted should have an extended section about sparsity (although this was at best tangential to the rest of the paper), so I suddenly needed to know about Bayesian models of sparsity. So I read those Annals of Stats papers. (That’s why I know I should be scaling !).

What are the key references? Well all the knowledge that you seek is here and here.

But a much easier way to work out that the Bayesian Lasso is bad is to do some simple maths.

Because the are a priori independent, we get a prior on the effective sparsity For the Bayesian Lasso, that probability can be computed as so

Ideally, the distribution of this effective sparsity would be centred on the true sparsity.
So we’d like to choose so that

A quick re-arrangement suggests that

Now, we are interested in signals with , i.e. where only a very small number of the are non-zero. This suggests we can safely ignore the second term as it will be much smaller than the first term.

To choose , we can work from the general principle that we want to choose it so that the effect of the “almost zero” is small. (here is the th column of the matrix .)

From this, it’s pretty clear that is going to have to depend on , , and as well! But how?

Well, first we note that Hence we can make this asymptotically small (as ) if Critically, if we have scaled the design matrix so that each covariate is at most , ie then this reduces to the much more warm and fuzzy

This means that we need to take in order to ensure that we have our prior centred on sparse vectors (in the sense that the prior mean for the number of non-zero components is always much less than ).

Show some emotion

So for the Bayesian Lasso, a sensible parameter is , which will usually have a large number of components less than the threshold and a small number that are larger.

But this is still a bad prior.

To see this, let’s consider the prior probability of seeing a larger than one

This is the problem with the Bayesian Lasso: in order to have a lot of zeros in the signal, you are also forcing the non-zero elements to be very small. A plot of this function is above, and it’s clear that even for very small values of the probability of seeing a coefficient bigger than one is crushingly small.

Basically, the Bayesian Lasso can’t give enough mass to both small and large signals simultaneously. Other Bayesian models (such as the horseshoe and the Finnish horseshoe) can support both simultaneously and this type of calculation can show that (although it’s harder. See Theorem 6 here).

(The scaling that I derived in the previous section is a little different to the standard Lasso scaling of , but the same result holds: for large the probability of seeing a large signal is vanishingly small.)

Maybe I was mean, but I really don’t think so

This analysis is all very back of the envelope, but it contains a solid grain of truth¹⁴.

If you fit a Bayesian Lasso in Stan with an unknown scaling parameter , you will not see estimates that are all zero, like this analysis suggests. This is because the posterior for tries to find the values of the parameters that best fit the data and not the values that give an -sparse signal.

In order to fit the data, it is important that the useful covariates have large s, which, in turn, forces the s that should be zero to be larger than our dreamt of .

And so you see posteriors constructed with the Bayesian Lasso exisiting in some sort of eternal tension: the small s are too big, and the large s are typically shrunken towards zero.

It’s the sort of compromise that leaves everyone unhappy.

Let’s close it out with the title song.

And I’m so afraid your courtiers
Cannot be called best friends

Lightly re-touched from the original, posted on Andrew’s blog. Orignal verison, 2 November, 2017.

Footnotes

(2021 edit) I am no longer 32. I am still unbearable.↩︎
Fair point↩︎
Answer: Black Metal↩︎
They were not. The concept of authenticity is just another way for the dominant culture to suppress more interesting ones.↩︎
(2021 edit): Really, Daniel? Really?↩︎
(2021): Ok. That ended better than I feared.↩︎
It’s usually a fool’s game to try to guess why people are asking particular questions. It probably wouldn’t be hard for someone to catalogue the number of times I’ve not followed my advice on this, but in life as in statistics, consistency is really only a concern if everything else is going well.↩︎
2021: Look at me trying to land a parallel construction.↩︎
2021: The other week someone asked if I was Canadian, which is a sure sign that my accent is just broken.↩︎
2021: Prophetic words↩︎
Could we put a prior on this? Sure. And in practice this is what we should probably do. But for today, we are going to keep it fixed.↩︎
It depends on which could depend on the number of observations.↩︎
There’s a lot of space for interesting questions here.↩︎
It’ws also fully justified by people who have written far more mathematically sophisticated papers on this topic!↩︎

Reuse

https://creativecommons.org/licenses/by-nc/4.0/

Citation

BibTeX citation:

@online{simpson2021,
  author = {Dan Simpson and Dan Simpson},
  editor = {},
  title = {The King Must Die (Repost)},
  date = {2021-12-08},
  url = {https://dansblog.netlify.app/2021-12-08-the-king-must-die-repost},
  langid = {en}
}

For attribution, please cite this work as:

Dan Simpson, and Dan Simpson. 2021. “The King Must Die (Repost).” December 8, 2021. https://dansblog.netlify.app/2021-12-08-the-king-must-die-repost.

Getting into the subspace; or what happens when you approximate a Gaussian process

Dan Simpson — Tue, 23 Nov 2021 13:00:00 GMT

So. Gaussian processes, eh.

Now that we know what they are, I guess we should do something with a Gaussian process. But we immediately hit a problem. You see, Gaussian processes are charming things, sweet and caring. But they have a dark side. Used naively¹, they’re computationally expensive when you’ve got a lot of data.

Stab of dramatic music

Yeah. So. What’s the problem here? Well, the first problem is people seem to really like having a lot of data. Fuck knows why. Most of it is rubbish². But they do.

This is a problem for our poor little Gaussian processes because of how the data tends to come.

A fairly robust model for data is that it comes like where is our measurement of choice (which might be a continuous, discrete or weird³), is our location⁴ in the index set⁵ (usually ) of the Gaussian process, and is whatever other information we have⁶. If we want to be really saucy, we could also assume these things are iid samples from some unknown distribution and then pretend like that isn’t a wildly strong structural assumption. But I’m not like that. I’ll assume the joint distribution of the samples is exchangeable⁷ ⁸ ⁹. Or something. I’m writing this sequentially, so I have no idea where this is going to end up.

So where is the problem? The problem is that, if we use the most immediately computational definition of a Gaussian process, then we need to build Where are all of the distinct values of in the dataset. If there are a lot of these, the covariance matrix is very large and this becomes a problem. First, we must construct it. Then we must solve it. Then we must do actual computations with it. The storage scales quadratically in . The computation scales cubically in . This is too much storage and too much computation if the data set has a lot of distinct GP evaluations, it will simply be too expensive to do the matrix work that we need to do in order to make this run.

So we need to do something else.

A tangent; or Can we just be smarter?

On a tangent, because straight lines are for poor souls who don’t know about Gaussian processes, there’s a body of work on trying to circumvent this problem by being good at maths. The idea is to try to find some cases where we don’t need to explicitly form the covariance matrix in order to do all of the calculations. There’s a somewhat under-cooked¹⁰ literature on this. It dances around an idea that traces back to fast multipole methods for integral equations: We know that correlations decay as points get further apart, so we do not need to calculate the correlations between points that are far apart as well as we need to calculate the correlations between points that are close together. For a fixed covariance kernel that decays in a certain way, you can modify the fast multipole method, however it’s more fruitful to use an algebraic¹¹ method. H-matrices was the first real version of this, and there’s a paper from 2008 using them to approximate GPs. A solid chunk of time later, there have been two good papers recently on this stuff. Paper 1 Paper 2. These methods really only provide gradient descent type methods for maximum likelihood estimation and it’s not clear to me that you’d be able to extend these ideas easily to a Bayesian setting (particularly when you need to infer some parameters in the covariance function)¹².

I think this sort of stuff is cool for a variety of reasons, but I also don’t think it’s the entire solution. (There was also a 2019 NeurIPS paper that scales a GP fit to a million observations as if that’s a good idea. It is technically impressive, however.) But I think the main possibility of the H-matrix work is that it allows us to focus on the modelling and not have to make premature trade offs with the computation.

The problem with modelling a large dataset using a GP is that GPs are usually fit with a bunch of structural assumptions (like stationarity and isotropy) that are great simplifying assumptions for moderate data sizes but emphatically do not capture the complex dependency structures when there is a large amount of data. As you get more data, your model should become correspondingly more complex¹³ and stationary, and/or isotropic Gaussian processes emphatically do not do this.

This isn’t to say that you shouldn’t use GPs on a large data set (I am very much on record as thinking you should), but that it needs to be a part of your modelling arsenal and probably not the whole thing. The real glory of GPs is that they are a flexible enough structure to play well with other modelling techniques. Even if you end up modelling a large data set with a single GP, that GP will most likely be anisotropic, non-stationary, and built up from multiple scales. Which is a different way to say that it likely does not have a squared exponential kernel with different length scales for each feature.

(It’s probably worth making the disclaimer at this point, but when I’m thinking about GPs, I’m typically thinking about them in 1-4 dimensions. My background is in spatial statistics, so that makes sense. Some of my reasoning doesn’t apply in more typical machine learning applications where might be quite high-dimensional. That said, you simply get a different end of the same problem. In that case you need to balance the smoothness needed to interpolate in high dimensions with the structure needed to allow your variables to be a) scaled differently and b) correlated. Life is pain either way.)

So can we make things better?

The problem with Gaussian processes, at least from a computational point of view, is that they’re just too damn complicated. Because they are supported on some infinite dimensional Banach space , the more we need to see of them (for instance because we have a lot of unique s) the more computational power they require. So the obvious solution is to somehow make Gaussian processes less complex.

This somehow has occupied a lot of people’s time over the last 20 years and there are many many many many possible options. But for the moment, I just want to focus on one of the generic classes of solutions: You can make Gaussian processes less computationally taxing by making them less expressive.

Or to put it another way, if you choose an dimensional subspace and rep;ace the GP , which is supported on the whole of , with a different Gaussian process supported on , then all of your problems go away.

Why? Well because the Gaussian process on can be represented in terms of an -dimensional Gaussian random vector. Just take , to be a basis for , then the GP can be written as where , for some and . (The critical thing here is that the are functions so is still a random function! That link between the multivariate Gaussian and the function that can be evaluated at any is really important!)

This means that I can express my Gaussian process prior in terms of the multivariate Gaussian prior on , and I only need operations to evaluate its log-density.

If our observation model is such that , and we assume conditional¹⁴ independence, then we can eval the log-likelihood term in operations. Here is the vector that links the basis in that we use to define to the observation locations¹⁵.

Many have been tempted to look at the previous paragraph and conclude that a single evaluation of the log-posterior (or its gradient) will be , as if that multiplier were just a piece of fluff to be flicked away into oblivion.

This is, of course, sparkly bullshit.

The subspace size controls the trade off between bias and computational cost and, if we want that bias to be reasonably small, we need to be quite large. In a lot of cases, it needs to grow with . A nice paper by David Burt, Carl Rasmussen, and Mark van der Wilk suggests that needs to depend on the covariance function¹⁶. In the best case (when you assume your function is so spectacularly smooth that a squared-exponential covariance function is ok), you need something like , while if you’re willing to make a more reasonable assumption that your function has continuous¹⁷ derivatives, then you need something like .

You might look at those two options for and say to yourself “well shit. I’m gonna use a squared exponential from now on”. But it is never as simple as that. You see, if you assume a function is so smooth it is analytic¹⁸, then you’re assuming that it lacks the derring-do to be particularly interesting between its observed values¹⁹. This translates to relatively narrow uncertainty bands. Whereas a function with derivatives has more freedom to move around the smaller is. This naturally results in wider uncertainty bands.

I think²⁰ in every paper I’ve seen that compares a squared exponential covariance function to a Matérn-type covariance function (aka the ones that let you have -times differentiable sample paths), the Matérn family has performed better (in my mind this is also in terms of squared error, but it’s definitely the case when you’re also evaluating the uncertainty of the prediction intervals). So I guess the lesson is that cheap isn’t always good?

Anyway. The point of all of this is that if we can somehow restrict our considerations to an -dimensional subspace of , then we can get some decent (if not perfect) computational savings.

But what are the costs?

Some notation that rapidly degenerates into a story that’s probably not interesting

So I guess the key question we need to answer before we commit to any particular approximation of our Gaussian process is what does it cost? That is, how does the approximation affect the posterior distribution?

To quantify this, we need a way to describe the posterior of a Gaussian process in general. As happens so often when dealing with Gaussian processes, shit is about to get wild.

A real challenge with working with Gaussian processes theoretically is that they are objects that naturally live on some (separable²¹) Banach space . One of the consequences of this is that we cannot just write the density of as because there is no measure²² on such that

This means that we can’t just work with densities to do all of our Bayesian stuff. We need to work with posterior probabilities properly.

Ugh. Measures.

So let’s do this. We are going to need a prior probability associated with the Gaussian process , which we will write as where is a nice²³ set in . We can then use this as a base for our posterior, which we define as where is the negative log-likelihood function. Here is the normalising constant which is finite as long as for all , where is a constant. This is a very light condition.

This way of looking at posteriors resulting Gaussian process priors was popularised in the inverse problems literature²⁴. It very much comes from a numerical analysis lens: the work is framed as here is an object, how do we approximate it?.

These questions are different to the traditional ones answered by a theoretical statistics papers, which are almost always riffs on “what happens in asymptopia?”.

I came across this work for two reasons: one is because I have been low-key fascinated by Gaussian measures ever since I saw a talk about them during my PhD; and secondly my PhD was in numerical analysis, so I was reading the journals when these papers came out.

That’s not why I explored these questions, though. That is a longer story. The tl;dr is

I had to learn this so I could show a particular point process model converges, and so now the whole rest of this blog post is contained in a technical appendix that no one has ever read in this paper.

Yes but what is a Gaussian process? or, Once, twice, three times a definition; or A descent into madness

Dan Simpson — Tue, 02 Nov 2021 13:00:00 GMT

I guess I’m going to talk about Gaussian processes now. This wasn’t the plan but who really expected a) there to be a plan or b) me to stick to the plan. I feel like writing about Gaussian processes and so I shall! It will be grand.

What is a Gaussian process?

Well I could tell you that a Gaussian process is defined by its joint distribution where , and for some positive definite covariance (or kernel) function .

But that would be about as useful as presenting you with a dog that can bark “she’s a grand old flag”: perhaps good enough for a novelty hit, but there’s just no longevity in it.

To understand a Gaussian process you need to feel it deep down within you where the fear and the detailed mathematical concepts live.

So let’s try again.

We’re gonna have a … you know what. I’m not gonna do that. But I am going to define this stuff three times. Once for mum, once for dad, and once for the country.

You’ve got to wonder why anyone would introduce something three ways. There are some reasons. The first is, of course, that each definition gives you a different insight into different aspects of Gaussian processes (the operational, the boundless generality, the functional). And the second is because I’ve had to use all three of these ideas (and several more) over the years in order to understand how Gaussian processes work.

I learnt about GPs from several sources (listed not in order):

A Swede¹ (so I will rant about random fields in the footnotes eventually);
A book² that was introducing GPs in a very general way because they needed the concept in outrageous generality to answer questions about the distribution of the maximum of a Gaussian process;
A book³ written by a Russian who’s really only into measure theory and doesn’t believe anything is real if it isn’t at least happening on a Frechet space;
And a book⁴ by a different Russian who’s really only into generalised Markov properties and needed to work with Gaussian processes that are defined over functions.

Of these, the most relevant is probably the first one. I was primarily taught this stuff by Finn Lindgren, who had the misfortune of having the office next to mine when we worked together in Trondheim a very long time ago. (We both had a lot more hair then.)

One of the things that I learnt from him is that Gaussian processes can appear in all kinds of contexts, which means you need to understand them as a model for an unknown function rather than as a tool to be used in a specific context (like for Gaussian process regression or Gaussian process classification).

It’s some effort to really get a good grip on the whole “Gaussian processes as a model for an unknown function” thing but once you relax into it⁵, it stops being alarming to see models where you are observing things that aren’t just . It is not alarming when you are observing integrals of the GP over regions, or derivatives. And you (or your methods) don’t fall apart when presented with complex non-linear functions on the GP (as happens if you look at
Bayesian inverse problems literature⁶).

What is a Gaussian process? (Version 1)

I’m going to start with the most common definition of a Gaussian process⁷. This is the definition that was alluded to in the first section and it’s also the definition operationalised in books like Rasmussen and Williams’⁸, which is a bread and butter reference for most machine learners interested in GPs, use.

The idea is pretty straightforward: I need to define a stochastic model for an unknown function and I want it to be, in some sense, Gaussian. So how do I go about doing this?

Firstly, I probably don’t care too much about the function as an abstract object. For example, if I’m using the Gaussian process to model something like temperature, I am only going to observe it at a fairly small number of places (even though I could choose any set of places I want). This means that for some arbitrary set set of locations , I am most interested⁹ in understanding the joint distribution¹⁰

So how would we model the joint distribution? If we want the model to be tractable, we probably want a nice distribution. This is where the Gaussian part comes in. The Gaussian distribution is an extremely tractable¹¹ distribution in medium-to-high dimensions. So the choice to model our joint distribution (which could be any size ) as makes sense from a purely mercenary position¹².

So how do we choose the mean and the covariance function? We will see that the mean can be selected as for pretty much any function¹³ , but, when we come to write there will be some very strong restrictions on the covariance function .

So where do these restrictions come from?

Oh those (gay) Russians!

As with all things in probability, all the good shit comes from the Soviets. Kolmogorov¹⁴ was a leading light in the Soviet push to formalise probability and one of his many many many contributions is something called the Kolmogorov extension theorem, which gives the exact conditions under which we can go from declaring that the distributions of (these are called finite dimensional distributions) are Gaussian to describing a legitimate random function .

There are essentially two conditions:

The order of the observations doesn’t matter in a material way. In our case changing the order just permutes the rows and columns of the mean vector and covariance matrix, which is perfectly ok.
There is a consistent way to map between the distributions of and . This is the condition that puts a strong restriction on the covariance function.

Essentially, we need to make sure that we have a consistent way to add rows and columns to our covariance matrix while ensuring that stays positive definite (that is, while all of the eigenvalues stay non-negative, which is the condition required for a multivariate normal distribution¹⁵). The condition—which is really gross—is that for every positive integer and every set of points , and for every not all equal to zero, we require that

This condition is obviously very difficult to check. This is why people typically choose their covariance function from a very short list¹⁶ that is typically found in a book on Gaussian processes.

But Kolmogorov said a little bit more

There’s a weird thing in grad school in North America where they insist on teaching measure theoretic probability theory and then never ever ever ever ever using any of the subtleties. But Gaussian processes (and, in general, stochastic processes on uncountable index spaces) are a great example of when you need these details.

Why? Because unlike discrete probability (where the set of events that we can compute the probability of is obvious) or even continuous random variables (where the events that we can’t compute the probability of are so weird we can truly just ignore them unless we are doing something truly exotic), for Gaussian processes,¹⁷ the set of allowable events is considerably smaller than the set of all things you might want probabilities of.

The gist of it is that we have built up a random function from a bunch of finite random vectors. This means that we can only assign probabilities to events that can be built up from events on finite random vectors. The resulting set of events (or -algebra to use the adult term) is called the cylindrical¹⁸ -algebra and can be roughly¹⁹ thought of as the set of all events that can be evaluated by evaluating at most a countable number of times.

Things that aren’t measurable

This will potentially become a problem if, for instance, you are working with a Gaussian process in a model that uses a Gaussian process in a weird way. When this happens, it is not guaranteed that, for instance, your likelihood is a measurable function, which would mean that you can’t normalise your probability distribution! (I mean, don’t worry. Unless you’re doing something fairly wild it will be, but it has come up especially in the inverse problems literature!)

This limited set of measurable events even seems to preclude well studied “events” like “ is continuous” or “ is twice continuously differentiable” or “ has a finite supremum”. All things that we a) want to know about a Gaussian process and b) things people frequently say about Gaussian processes. It is common for people to say that “Brownian motion is continuous” and similar things.

As with all of mathematics, there are a lot of work arounds that we can use. For those three statements in particular, there is some really elegant mathematical work (due, again, to Kolmogorov and extended greatly by others). The idea is that we can build another function such that for all²⁰ such that is continuous (or differentiable or bounded).

In the language of stochastic processes, is called a version of and the more correct, temperate language (aka the one least likely to find in the literature) is that has a continuous/differentiable/bounded version.

If you’re interested in seeing how a differentiable version of a Gaussian process is constructed, you basically have to dick around with dyads for a while. Martin Hairer’s lecture notes²¹ is a nice clear example.

Where are the limitations of this definition?

There are a few. These are, of course, in the eye of the beer holder. The definition is workable in a lot of situations and, with some explanation can be broadened out a bit more. It’s less of a great definition when you’re trying to manipulate Gaussian processes as mathematical objects, but that’s what the next one is for.

The first limitation is maybe not so much a limit of the definition as a bit of work you have to do to make it applicable. And that is: what happens if I am observing (or my likelihood depends on) averages like instead of simple point evaluations²².

This might seem like a massively different problem, until we remember that integrals are just sums dressed up for Halloween, so we can approximate the integrals arbitrarily well by sums²³. In fact, if we squint²⁴ a bit, we can see that the above vector will also be multivariate Gaussian with mean vector and covariance matrix with entries Similar formulas hold for derivative observations.

Probably the bigger limitation is that in this way of seeing things, your view is tied very tightly to the covariance function. While it is a natural object for defining Gaussian processes, it is fucking inconvenient if you want to understand things like how well approximate Gaussian processes work.

And let’s face it, a big chunk of Gaussian processes we see in practice are approximate because the computational burden on large data sets is too big to do anything but approximate.

(Fun fact, when I was much much younger I wrote a paper that was a better title than a paper²⁵ called²⁶ In order to make spatial statistics computationally feasible, we need to abandon the covariance function. I copped a lot of shit for it at the time [partly because the title was better than the paper, but partly because some people are dicks], but I think the subsequent 10 years largely proved me (or at least my title) right²⁷.)

The focus on the covariance function also hides the strong similarity between Gaussian process literature and the smoothing splines literature starting from Grace Wahba in the 1970s. It’s not that nobody notices this, but it’s work to get there!

In a similar way, it hides the fundamental role the reproducing kernel Hilbert space (or Cameron-Martin space) is doing and the ways that Gaussian process regression is (and is not) like kernel smoothing in RKHSs. This, again, isn’t a secret per se—you can find this information if you want it—but it’s confusing to people and the lack of clarity leads to people missing useful connections (or sometimes leads to them drawing mistaken parallels).

How many times have you seen someone say that realisations of a Gaussian process are in the RKHS associated with the covariance function? They are not. In fact, every realisation of a Gaussian process is rougher than any function in the RKHS (with probability 1)! Unfortunately, this means that your reason for choosing the kernel in a RKHS regression and for choosing the covariance function in a Gaussian process prior need to be subtly different. Or, to put it differently, a penalty is not a log-prior and interpreting the maximum a penalised likelihood is, in high dimensions, a very distant activity from interpreting a posterior distribution (even when the penalty is the log of the prior).

What is a Gaussian process? (Version 2)

Ok. Let’s do this again. This definition lives in a considerably more mathematical space and while I’m gonna try to explain the key terms, I will fail. But hey. Who doesn’t like googling weird terms?

A Gaussian process is a collection of random variables , where and is some set of things that isn’t too topologically disastrous²⁸.

But what makes it Gaussian? Here’s the general definition.

A stochastic process/random field is Gaussian if and only if every continuous linear functional has a univariate Gaussian distribution.

Well that’s very useful Daniel. What the hell is a linear functional?

Great question angry man who lives inside my head! It is any function that takes the Gaussian process and an input and spits out a real number that is is

Linear. Aka
Bounded²⁹.

Great. Love a definition. Shall we try something more concrete?

Point evaluation (aka evaluating the function at a point) is a linear functional (). As is a definite integral over a set .

It’s a fun little exercise to convince yourself that this all implies that for any collection of continuous linear functionals, then is a Gaussian process means that the vector is multivariate Gaussian.

Your idea of fun is not my idea of fun. Anyway. Keep talking.

If lives in a Banach space³⁰ , then the set of all continuous/bounded linear functionals on is called the dual space and is denoted .

I mean, cool I guess but where the merry hell is the covariance function

In this context, the most important thing about is it does double duty: it is both a space of linear functionals and a space that can be identified with random variables.

How the fuck do you do that?

Well, the trick is to remember the definition! If , then is a Gaussian. Similarly, if we have two functionals we consider the covariance of their associated random variables

is a symmetric, positive definite bilinear form (aka good candidate for an inner product)!

We can use this to add more functions to , particularly for any sequence that is Cauchy with respect to the norm we append the limit to to complete the space. Once we take equivalence classes, we end up with a Hilbert space that, very unfortunately, probabilists have a tendency to call the reproducing kernel Hilbert space associated with .

Why is this unfortunate? Well primarily because it’s not the exact same space that machine learners call the reproducing kernel Hilbert space, which is, to put it mildly, confusing. But we can build the machine learner’s RKHS (known to probabilists as the Cameron-Martin space).

Why are you even telling me this? Is this a digression?

Honestly. Yes. But regardless the space is quite useful to understand what’s going on. To start off, let’s do one example that shows just how different a Gaussian process is from a multivariate normal random vector. We will show that if we multiply a GP by a constant, we completely change its support³¹! Many a computational and inferential ship have come to grief on these sharp rocks.

To do this, though, we need³² to make an assumption on : We assume that is separable³³. This isn’t an vacuous assumption, but in a lot of cases of practical interest, this is basically the same thing as assuming the set is a nice bounded domain or a friendly compact manifold (and not something like )³⁴.

So. How do we use to show that Gaussian processes are evil? Well we begin by noting that is a separable³⁵ Hilbert space it contains an orthonormal basis , (that is and if ). We can use this basis to show some really really weird stuff about .

In particular, consider another Gaussian process , where is a non-zero constant. For this process we can build in an analogous way. The are still orthogonal in but now .

Now consider the functional . We are going to use this function to break stuff! To do this, we are going to define two disjoint sets of functions and . Clearly and are disjoint if .

Because are orthonormal in , it follows that that are iid. Similarly, are also independent. Hence it follows from the properties of random variables (aka the mean plus the strong law of large numbers) that and hence . On the other hand, , so . As and are disjoint, this means that unless , the processes and are mutually singular (aka they have no overlapping support).

What does this mean? This means the distributions of and (which remember is just multiplied by a constant) are as different from each other as a normal distribution truncated to and another normal distribution truncated to ! Or, more realistically³⁶, as disjoint as a distribution over and .

This is an example of the most annoying phenomena in Gaussian processes³⁷: the slightest change in a Gaussian process can lead to a mutually singular process. In fact, this is not a particularly strange example. It can be shown that Gaussian processes over uncountable index spaces are either absolutely continuous or mutually singular. There is no half-arsing it!

This has a lot of implications when it comes to computing³⁸, setting priors on the parameters that control the properties of the covariance function³⁹, and just generally inference⁴⁰.

Yes but where’s our reproducing kernel Hilbert space

We just saw that if is a Gaussian process than will be a singular GP if . What happens if we add things? Well, a result known as the Cameron-Martin theorem says that, for a deterministic , is absolutely continuous wrt if and only if is in the Cameron-Martin space (this is the one that machine learners call the RKHS!).

But how do we find this mythical space? I find this quite stressful!

Like, honey I do not know. But when a probabilist is in distress, we can calm them by screaming characteristic function at the top of our lungs right into their ear. Try it. It definitely works. You won’t be arrested.

So let’s do that. The characteristic function of a univariate random variable is which doesn’t seem like it’s going to be an amazingly useful thing, but it actually is. It’s how you prove the central limit theorem⁴¹, and a few other shiny things.

When we are dealing with more complex random things, like random vectors and Gaussian processes, we can use characteristic functions, but we need to extend beyond the fact that they’re currently only defined for univariate random variables. Conveniently, we have some lying around. In particular, if , we have the associated random variable and we can compute its characteristic function⁴², which leads to the definition of a characteristic function of a stochastic process on

Now this feels quite different. It’s no longer a function of some real number but is instead a function of a linear functional , which feels weird but isn’t.

Characteristic functions are immensely useful because if two Gaussian processes have same characteristic function they have the same distribution⁴³.

Because is a Gaussian process, we can compute its characteristic function! We know that is Gaussian so we can look up its characteristic function on Wikipedia and get that where and .

We know that and the latter of which can be extended naturally to the aforementioned positive definite quadratic form

This leads to the exact form of the characteristic function and to this theorem, which is true.

Theorem 1 A stochastic process is a Gaussian process if and only if

So Alf is back. In pog form.

Yes.

In this case, we can define the covariance operator as⁴⁴ The definition is cleaner when (which is why people tend to assume that when writing this shit down⁴⁵), in which case we get and

Great gowns, beautiful gowns.

Wow. Shady.

Anyway, the whole reason to introduce this is the following:

Theorem 2 Let . Then

This does not not help us answer the question of whether or not has the same support as . To do this, we construct a variable that is absolutely continuous with respect to (we guarantee this because we specify its density⁴⁶ wrt ).

To this end, take some and define a stochastic process with density wrt⁴⁷ u

From this, we can compute⁴⁸ the characteristic function of

So we are fine if we can find some such that
To do this, we note that so for any we can find a such that and for such a is absolutely continuous with respect to .

This gives us our definition of the Cameron-Martin space (aka the RKHS) associated with .

Definition 1 The Cameron-Martin space (or reproducing kernel Hilbert space, if you must) associated with a Gaussian process is the Hilbert space equipped with the inner product

A fun note is that the reason the probabilists don’t call the Cameron-Martin space the reproducing kernel Hilbert space is that there is no earthly reason to think that point evaluation will be bounded in general. So it become a problematique name. (And no, I don’t know why they’re ok with calling that some things are just mysterious.)

Lord in heaven. Any chance of being a bit more concrete?

Sure! Let’s consider the case where is a Gaussian random vector While all of this is horribly over-powered for this case, it does help get a grip on what the inner product on is.

In this case, is row vectors like , and

Furthermore, the operator satisfies .

So what is ? Well, every dimensional vector space can be represented as an -dimensional vector, so what we really need to do is identify from . To do this, we use the relationship for all . Translating that to our finite dimensional case we get that from which it follows that . Hence we get the inner product between

Ok! That’s cool!

Yes! And the same thing holds in general, if you squint⁴⁹. Just replace the covariance matrix with the covariance operator

This operator has (in a suitable sense) a symmetric⁵⁰ non-negative definite (left) (closed) inverse operator , which defines the RKHS inner product by where and are smooth enough functions for this to make sense. In general, will be a (very) singular integral operator, but when has the Markov property, is a differential operator. In all of these cases the RKHS is the set of functions that are smooth enough that .

We sometimes call the operator the precision operator and it’s fundamental to thin plate spline theory as well as some nice ways to approximate GPs in 1-4 dimensions. I will blog about this later, probably, but for now if you’re interested Finn Lindgren, Håvard Rue, and David Bolin just released a really nice survey paper about the technique.

Tell me some things about the Cameron-Martin space

Now that we’ve gone to the effort of finding it, I should probably tell you why it’s so important. So here are a collection of facts!

Fact 1: The Cameron-Martin space (the set of functions and the inner product) determines a⁵¹ Gaussian process, in that if two Gaussian processes have the same mean and the the same Cameron-Martin space, they have the same distribution. In fact, the next definition of a Gaussian process is going to show this constructively.

This is nice because it means you can define a Gaussian process without needing to specify its covariance function. You just (just!) need to specify a Hilbert space. It turns out that this is a considerably easier task than trying to find a positive definite covariance function if the domain is weird.

Fact 2: is never in the RKHS. That is, . But⁵² if, for any , is any measurable set of functions with , then , where . Or to say it in words, although is never in , if you find a set that could be in (even if it’s extremely unlikely to be there), then is almost surely made up of a function in plus a function in .

This is wild. It means that while is never in the RKHS, all you need to do is add a bit of rough to get all of the stuff out. Another characterisation of the RKHS that are related to this is that it is the intersection of all subsets of that have full measure under (aka all sets such that ).

Fact 3: If we observe some data , where is some observation vector, then the posterior mean is in the RKHS and that posterior distribution of is a Gaussian process that’s absolutely continuous with respect to the prior GP u(s). This means that the posterior mean, which is our best point prediction under squared error loss, is always smoother than any of the posterior draws.

This kinda makes sense: averaging things smooths out the rough edges. And so when we average a Gaussian process in this way, we make it smoother. But this is a thing that we need to be aware of! Our algorithms, our reasoning for choosing a kernel, and our interpretations of the posterior need to be aware that the space of posterior realizations is rougher than the space that contains the posterior mean.

Frequentists / people who penalise likelihoods don’t have to worry about this shit.

So what have we learnt?

So so so so so so so much notation and weird maths shit.

But there are three take aways here:

The importance of the Fourier transform (aka the characteristic function) when it comes to understanding Gaussian processes.
The maths buys us understanding of some of the more delicate properties of a Gaussian process as a random object (in particular it’s joint properties)
You can define a Gaussian process exclusively using the RKHS inner product. (You can also do all of the computations that way too, but we’ll cover that later). So you do not need to explicitly specify a covariance function. Grace Wahba started doing this with thin plate splines (and -splines) in 1974 and it worked out pretty well for her.

So to finish off this post, let’s show one more way of constructing a Gaussian process. This time we will explicitly start from the RKHS.

What is a Gaussian process? (Version 3)

Our final Gaussian process definition is going to centre the RKHS⁵³ as the fundamental object. This construction, which is known as an abstract Wiener space⁵⁴ is less general⁵⁵ than our previous definition, but it covers most of the processes we are going to encounter in applications.

This construction is by far the most abstract of the three (it is in the name after all). So buckle up.

The jumping off point here is a separable Hilbert space . This has an inner-product on it, and the associated notion of orthogonality and an orthogonal projector. Consider an -dimensional subspace . We can, without any trouble, define a Gaussian process on with characteristic function We hit no mathematical problems because is finite dimensional and nothing weird happens to Gaussians in finite dimensions.

The thing is, we can do this for any finite dimensional subspace and, in particular, if we have a sequence of subspace , where , then we can build a sequence of finite dimensional Gaussian processes that are each supported in their respective .

The question is: can we construct a Gaussian process supported⁵⁶ on such that , where is the orthogonal projector from to ?

You would think the answer is yes. It is not. In fact, Komolgorov’s extension theorem says that we can build a Gaussian process this way, but it does not guarantee that the process will be supported on . And it is not.

To see why this is, we need to look a bit more carefully at the covariance operator of a Gaussian process on a separable Hilbert space. The key mathematical feature of a separable Hilbert space is that it has an⁵⁷ orthonormal⁵⁸ basis . We can use the orthonormal basis to do a tonne of things, but the one we need right now is the idea of a trace⁵⁹

For a (zero mean) Gaussian process supported on , we can see that where the second line is just true because I say it is and the third line is Pythagoras’ theorem writ large (and is finite because Gaussian processes have a lot of moments⁶⁰!).

If we were to say this in words, we would say that the covariance operator of a Gaussian process supported on a separable Hilbert space is a trace-class operator (or has a finite trace).

And this is where we rejoin the main narrative. You see, if was a stochastic process on , then its characteristic function would be But it can’t be! Because is infinite dimensional and the proposed covariance operator is the identity on , which is not trace class (its trace is clearly infinite).

So whatever is⁶¹, it is emphatically not a Gaussian process on .

That doesn’t seem like a very useful trip through abstract land

Well, while we did not successful make a Gaussian process on we did actually build the guts of a Gaussian process on a different space. The trick is to use the same idea in reverse. We showed that was not a Gaussian process because its covariance operator wasn’t on trace class. It turns out that the reverse also holds: if and is trace class on , then is a Gaussian process supported on .

The hard part is going to be finding another Hilbert space .

To do this, we need to recall a definition of a separable Hilbert space with orthonormal basis , : From this, we can build a larger separable Hilbert space as This is larger because there are sequences of s that are admissible for that aren’t admissible for (for example⁶², ).

We let be the linear embedding that we get by considering an element as an element of . If we let be an orthonormal basis on (note: this is not the same as as it needs to be re-scaled to have unit norm in ), then we get Why? Because which means that is an orthonormal basis for . This means we have to divide the coefficients by when we move from to , otherwise we wouldn’t be representing the same function.

With this machinery set up, we can ask if is a Gaussian process on . Or, more accurately, we can ask if is a Gaussian process on .

Well.

Let’s compute its characteristic function. It follows that and so⁶³ is a trace class operator on and, therefore, is a Gaussian process on .

But wait, there is more! To do the calculation above, we identified elements of as infinite sequences that satisfy . In this case the covariance operator is is diagonal, so the th entry of . From this, and the reasoning in the previous section, we see that the Cameron-Martin space can be thought of as a subset of . The Cameron-Martin inner product can be constructed from the inverse of , which gives Clearly, this will not be finite unless we put much much stronger restrictions on and than that .

The Cameron Marin space is the subspace of consisting of all functions such that This is (isomorphic to) !

To see this, we note that the condition is only going to hold if for some sequence such that . Remembering that , it follows that if and only if which is exactly the definition of .

Are you actually trying to kill me?

Yes.

So let’s recap what we just did: We took a separable Hilbert space and used it to construct a Gaussian process on a larger space with as its Cameron-Martin space. And we did all of this without ever touching a covariance function. This is an abstract Wiener space construction of a Gaussian process.

The thing is that this construction is a lot more general than this. The following is a (simplified⁶⁴) version of the abstract Wiener space theorem.

Theorem 3 Let be a separable Hilbert space and let be a separable Banach space. Furthermore, we assume that is dense in . Then there is a unique Gaussian process with and . It can be constructed from the canonical cylindrical Gaussian process on by , where is the natural embedding.

Was there any point to doing that?

I mean, probably not. The main thing we did here was see that you can take the RKHS as the primal object when building a Gaussian process. Why that may be a useful observation was not covered.

We also saw that there are some restrictions required on the covariance operator to ensure that a Gaussian process is a proper stochastic process on a given space. (For the tech-heads, the problem with is that it’s associated probability measure is not countably additive. That is a bad thing, so we do not allow it.)

The restrictions are very clear for covariance operators on separable Hilbert spaces (they must be trace class). Unfortunately, there isn’t any clean characterization of all allowable covariance operators on more complex spaces like Banach spaces⁶⁵.

Where do we go now but nowhere

And with that I have finished my task. I have defined Gaussian processes three different ways and if anyone is still reading at this point: you’re a fucking champion.

I probably want to talk about other stuff eventually:

Using all this technology to work out what happens to a posterior when we approximate a Gaussian process (which we usually do for computational reasons)
Understanding how singularity/absolute continuity of Gaussian measures can help you set priors for the parameters in a covariance function
The Markov property in space: what is it and how do you use it
Show how we can use methods for solving PDEs to approximate Gaussian processes.

The last one has gotten a lot less urgent because Finn, David and Håvard just released a lovely survey paper.

Maybe by the time I am finished with these things (if that ever happens, I don’t rate my chances), I will have justified all of this technicality. But for now, I am done.

Footnotes

Not the root vegetable.↩︎
The first 5 chapters of Adler and Taylor’s masterpiece s are glorious↩︎
Not gonna lie. Bogachev’s Gaussian Measures is only recommended if you believe in intercessory prayer.↩︎
Rozanov’s Markov Random Fields, which is freely available from that link and is so beautiful you will cry when it turns the whole question into one about function space embeddings. It will be a moist old time. Bring tissues.↩︎
I recommend some video head cleaner↩︎
which spent an embarrassing amount of time essentially divorced from the mainstream statistical literature↩︎
This is the nomenclature that machine learners thrust upon us and it’s annoying and I hate it. Traditionally, a stochastic process was indexed by time (so in this case it would be a one-dimensional Gaussian process and when it was indexed by any other set it was referred to as a random field. So I would much rather be talking about Gaussian random fields. Why? Because there’s a bunch of shit that is only true in 1D and I’m not interested in talking about that)↩︎
Great book. Great reference. No shade whatsoever↩︎
Maybe? But maybe I’m most interested in the average temperature over a region. This is why we are going to need to think about things more general than just evaluating Gaussian processes at a location.↩︎
Why the joint? Well because it’s likely that nearby temperature measurements will be similar, while measurements that are far apart are more likely to be (almost) independent (maybe after adjusting for season, time of day, etc).↩︎
in the sense that we have formulas for almost everything we want to have formulas for↩︎
We can also play games with multivariate Gaussians (like building deep Gaussian processes or putting stochastic models on the covariance structure) that markedly increase their flexibility.↩︎
Usually this involves covariates!↩︎
Wikipedia edit war aside (have a gander, it’s a blast), there’s evidence that Kolmogorov had a long-term relationship with Aleksandrov that was a) known at the time and b) used by the Soviets to blackmail them. So that’s fun.↩︎
to be proper on some subspace. We are allowed zero eigenvalues for technical reasons and it actually turns out to be useful later, making things like thin plate splines a type of Gaussian process. Grace Wahba had to do all this without Google.↩︎
Exponential, Mat'{e}rn, and squared-exponential are the common ones on . After that shit gets exotic.↩︎
and any other process built from Kolmogorov’s extension theorem↩︎
so named because a set is called a cylinder set↩︎
the exact definition is the smallest -algebra for which all continuous linear functionals are measurable↩︎
The all part is important here. Consider the function where is uniformly distributed on and is 1 when and zero otherwise. This function is equal to for almost every (rather than for every ), but the random function is definitely not the zero function (it is always non-zero at exactly one point).↩︎
Bottom of page 12 through page 14 here↩︎
This is a straight generalisation. If then it’s the exact situation we were in before.↩︎
In the technical language of the next section, the set of delta functions is dense in the space of bounded linear functionals↩︎
aka replace integrals with sums, compute the joint distribution of the sums, and then send everything to infinity, which is ok when are bounded↩︎
The paper is pretty good and I think it’s a nice contribution. But the title was perfect.↩︎
Sorry for the pay wall. It’s from so long ago it’s not on arXiv.↩︎
Yes. NN-GPs, Vecchia approximations, fixed-rank Kriging, variational GPs, and all of the methods I haven’t specifically done work on, all abandon some or all of the covariance function. Whether the people who work on those methods think they’re abandoning the covariance function is between them an Cher.↩︎
I say this, but you can make this work over pretty bonkers spaces. If we want to be general, if is a linear space and is a space of functionals on that separates the points of , then is defined as a Gaussian process (wrt the appropriate cylindrical -algebra) if is Gaussian for all . Which is fairly general but also, like, at this point I am just really showing off my maths degree.↩︎
It is very convenient that continuous linear functionals and bounded linear functionals are the same thing.↩︎
it’s the one with a norm↩︎
The support of is a set such that .↩︎
need is a big word here. We don’t need to do this, but not doing it makes things more technical. The assumption we are about to make let’s us breeze past a lot of edge cases as we sail from the unfettered Chapter 2 of Bogachev to the more staid and calm Chapter 3 of Bogachev.↩︎
That it contains a countable dense set. Somewhat surprisingly, this implies that the Gaussian process is separable (or alternatively that it’s law is a Radon measure), which is a wildly technical condition that just makes everything about 80% less technical↩︎
There are separable spaces on the whole space too, but, like, leave me alone.↩︎
I’ve made a↩︎
These sets don’t overlap, but they’re probably not very far apart from each other? Honestly I can’t be arsed checking but this is my feeling.↩︎
and continuously index stochastic processes/random fields in general↩︎
see Simon Cotter and Friends↩︎
see Geir-Arne Fuglstad and friends↩︎
Zhang, H. (2004). Inconsistent estimation and asymptotically equal interpolations in model-based geostatistics. Journal of the American Statistical Association, 99(465):250–261.↩︎
Everyone who’s ever suffered through that inexplicable grad-level measure-valued probability course that builds up this really fucking intense mathematical system and then essentially never uses it to do anything interesting should be well aware of the many many many many ways to prove the central limit theorem.↩︎
well, the characteristic function when because if , .↩︎
In a locally convex space, this is true as measures over the cylindrical -algebra, but for separable spaces it’s true over the Borel -algebra (aka all open sets), which is an enormous improvement. (This happens because for separable spaces these two -algebras coincide.) That we have to make these sorts of distinctions (between Baire and Borel measures) at all is an important example of when you really need the measure theoretic machinery to do probability theory. Unfortunately, this is beyond the machinery that’s typically covered in that useless fucking grad probability course.↩︎
It is not at all clear that the range of this operator is contained in . It should be mapping to , but that separability really really helps! Check out the Hairer notes.↩︎
Or they define it on the set , the completion of which is the general definition of .↩︎
Radon-Nikodym derivative↩︎
A true pain in the arse when working on infinite dimensional spaces is that there’s no natural equivalent of a Lebesgue measure, so we don’t have a universal default measure to take the density against. So we have to take it against an existing probability measure. In this case, the most convenient one is the distribution of . In finite dimensions, the density would satisfy where is the density of .↩︎
I’m skipping the actual computation because I’m lazy.↩︎
or if you’re working on a separable Hilbert space↩︎
self-adjoint↩︎
separable↩︎
This next thing is a consequence of Borel’s inequality: is the ball in and a is any measurable subset of with , then , where is the CDF of the standard normal distribution. Just take .↩︎
At some point while writing this I’ve started using RKHS and Cameron-Martin space interchangeably for the one that is a subset of . We’re all just gonna have to be ok with that.↩︎
You can get references for this from the Bogachev book, but I actually quite like this survey from Jan van Neervaen, even though it’s almost comically general.↩︎
Although I made the big, ugly assumption that was separable halfway through the last definition, almost everything is true without that. Just with more caveats. Whereas, the abstract Wiener space construction really fundamentally uses the separability of and as a place to start.↩︎
ie with ↩︎
It has lots of them but everything we’re about to talk about is independent of the choice of orthonormal basis.↩︎
and for .↩︎
This is the trace of the operator and I would usually write this as , but it makes no difference here.↩︎
There’s a result called Fernique’s theorem that implies that Gaussian processes have all polynomial and exponential moments.↩︎
It’s called an iso-normal process and is strongly related to the idea of white noise and I’ll probably talk about that at some point. But the key thing is it is definitely not a Gaussian process in the ordinary sense on . We typically call it a generalized Gaussian process or a Generalized Gaussian random field and it is a Gaussian process indexed by . Life is pain.↩︎
chaos_reins.gif↩︎
You can convince yourself this is true. I’m not doing all the work for you↩︎
If you want more, read Bogachev or that Radonification paper↩︎
The best reference I have is this survey↩︎

Reuse

https://creativecommons.org/licenses/by-nc/4.0/

Citation

BibTeX citation:

@online{simpson2021,
  author = {Dan Simpson},
  editor = {},
  title = {Yes but What Is a {Gaussian} Process? Or, {Once,} Twice,
    Three Times a Definition; or {A} Descent into Madness},
  date = {2021-11-03},
  url = {https://dansblog.netlify.app/yes-but-what-is-a-gaussian-process-or-once-twice-three-times-a-definition-or-a-descent-into-madness},
  langid = {en}
}

For attribution, please cite this work as:

Dan Simpson. 2021. “Yes but What Is a Gaussian Process? Or, Once, Twice, Three Times a Definition; or A Descent into Madness.” November 3, 2021. https://dansblog.netlify.app/yes-but-what-is-a-gaussian-process-or-once-twice-three-times-a-definition-or-a-descent-into-madness.

Un garçon pas comme les autres (Bayes)

Diffusion models; or Yet another way to sample from an arbitrary distribution

A prelude: Measure transport for sampling from arbitrary distributions

Continuous distributions in 1D

Transport maps: A less terrible method that works on general densities

What if we only have samples from the target density

So does it work?

Continuous normalising flows: Making the problem easier by making it harder

A very quick introduction to inverse problems

The likelihood for a normalising flow

But oh that complexity

Diffusion models

Diffusions and stochastic differential equations

Reversing the diffusion

Estimating the score

Generating samples

Some closing thoughts

Footnotes

Reuse

Citation

Markovian Gaussian processes: A lot of theory and some practical stuff

Gaussian processes via the covariance operator

White noise and its associated things

The generalised Gaussian process

Approximating GPs when is a differential operator

The Markov property for on abstract spaces

Rewriting the Markov property I: Splitting spaces

Rewriting the Markov property II: The dual random field

Building out our toolset with the conjugate GP

When does exits? or, A surprising time with the reproducing kernel Hilbert space

But when does ?

At long last, an RKHS characterisation of the Markov property

Putting this all in terms of

Using the RKHS to build computationally efficient approximations to Markovian GPs

Footnotes

Reuse

Citation

Sparse matrices part 7a: Another shot at JAX-ing the Cholesky decomposition

Control flow of the damned

Building a JAX-traceable symbolic sparse Choleksy factorisation

Building the expression graph

MCMC with the wrong acceptance probability

What is Markov chain Monte Carlo

MCMC with approximate acceptance probabilities

A bit of a literature review

Trying to understand noisy Markov chains

What do the look like?

Footnotes

Reuse

Citation

On that example of Robins and Ritov; or A sleeping dog in harbor is safe, but that’s not what sleeping dogs are for

Sometimes it’s the parable of the barren fig tree. Sometimes you’re just pissed at a shrub.

A counterexample always proceedes from the least interesting premise

Robins and Ritov toss an ancillary coin and let slip the dogs of war

The likelihood principle and the death of nuance

Can we save Bayes?

A simple posterior and its post-processing

Modelling the effect of the ancillary coin

Is it Bayesian?

Is it true? Am I a chaser?

Footnotes

Reuse

Citation

Priors for the parameters in a Gaussian process

Part 1: How do you put a prior on parameters of a Gaussian process?

A first crack at a PC prior

What’s bad about this?

The Matérn covariance function

Asymptotic? I barely know her!

When is a parameter not consistently estimatable: an aside that will almost immediately become relevant

A first look at multilevel regression; or Everybody’s got something to hide except me and my macaques

What are the stake?s (According to the papers, not according to me, who knows exactly nothing3 about this type of work)

So what did Eliza and Mark do?

What does the data look like?

Ok Mary, how are we going to analyze this data?

There are just too many monkeys; or Why can’t we just analyse this with regression?

If we ignore the monkeys, will they go away? or Another attempt at regression

What is between no pooling and complete pooling? Multilevel models, that’s what

Reasoning out some prior distributions

Pre-experiment prophylaxis

What are the stake?s (According to the papers, not according to me, who knows exactly nothing³ about this type of work)