Back in the early days of the pandemic I though “I’ll have a pandemic project”. I never did my pandemic project.

But I did think briefly about what it would be. I want to get the types of models I like to use in everyday life efficiently implemented inside Stan. These models encapsulate (generalised) linear mixed models^{1}, (generalised) additive models, Markovian spatial models^{2}, and other models. A good description of the types of models I’m talking about can be found here.

Many of these models can be solved efficiently via INLA^{3}, a great R package for fast posterior inference for an extremely useful set of Bayesian models. In focussing on a particular class of Bayesian models, INLA leverages a bunch of structural features to make a very very fast and accurate posterior approximation. I love this stuff. It’s where I started my stats career.

None of the popular MCMC packages really implement the lessons learnt from INLA to help speed up their inference. I want to change that.

The closest we’ve gotten so far is the nice work Charles Margossian has been doing to get Laplace approximations into Stan.

But I want to focus on the other key tool in INLA: *using sparse linear algebra to make things fast and scalable*.

I usually work with Stan, but the scale of the C++ coding^{4} required to even tell if these ideas are useful in Stan was honestly just too intimidating.

But the other day I remembered Python. Now I am a shit Python programmer^{5} and I’m not fully convinced I ever achieved object permanence. So it took me a while to remember it existed. But eventually I realised that I could probably make a decent prototype^{6} of this idea using some modern Python tools (specifically JAX). I checked with some PyMC devs and they pointed me at what the appropriate bindings would look like.

So I decided to go for it.

Of course, I’m pretty busy and these sort of projects have a way of dying in the arse. So I’m motivating myself by blogging it. I do not know if these ideas will work^{7}. I do not know if my coding skills are up to it^{8}. I do not know if I will lose interest. But it should be fun to find out.

So today I’m going to do the easiest part: I’m going to scope out the project. Read on, MacDuff.

## A generalised linear mixed effects-ish model

If you were to open the correct textbook, or the Bates, Mächler, Boler, and Walker 2015 masterpiece paper that describes the workings of `lme4`

, you will see the linear mixed model written as \[
y = X\beta + Zb + \epsilon,
\] where

- the columns of \(X\) contain the covariates
^{9}, - \(\beta\) is a vector of unknown regression coefficients,
- \(Z\) is a known matrix that describes the random effects (basically which observation is linked to which random effect),
- \(b \sim N(0, \Sigma_b)\) is the vector of random effects with some unknown covariance matrix \(\Sigma_b\),
- and \(\epsilon \sim N(0 ,\sigma^2 W)\) is the observation noise (here \(W\) is a known diagonal matrix
^{10}).

But unlike Doug Bates and his friends, my aim is to do Bayesian computation. In this situation, \(\beta\) *also* has a prior on it! In fact, I’m going to put a Gaussian prior \(\beta \sim N(0, R)\) on it, for some typically known^{11} matrix \(R\).

This means that I can treat \(\beta\) and \(b\) the same^{12} way! And I’m going to do just that. I’m going to put them together into a vector \(u = (\beta^T, b^T)^T\). Because the prior on \(u\) is Gaussian^{13}, I’m sometimes going to call \(u\) the *Gaussian component* or even the *latent*^{14} Gaussian component.

Now that I’ve smooshed my fixed and random effects together, I don’t really need to keep \(X\) and \(Z\) separate. So I’m going push them together into a rectangular matrix \[ A = [X \vdots Z]. \]

This allows us to re-write the model as \[\begin{align*} y \mid u, \sigma & \sim N(A u, \sigma^2 W)\\ u \mid \theta &\sim N(0, Q(\theta)^{-1}). \end{align*}\]

*What the hell is \(Q(\theta)\) and why are we suddenly parameterising a multivariate normal distribution by the inverse of its covariance matrix (which, if you’re curious, is known as a precision matrix)???*

I will take your questions in reverse order.

We are parameterising by the precision^{15} matrix because it will simplify our formulas and lead to faster computations. This will be a major topic for us later!

As to what \(Q(\theta)\) is, it is the matrix \[
Q(\theta) = \begin{pmatrix} \Sigma_b^{-1} & 0 \\ 0 & R^{-1}\end{pmatrix}
\] and \(\theta = (\sigma, \Sigma_b)\) is the collection of all^{16} non-Gaussian parameters in the model. Later, we will assume^{17} that \(\Sigma_b\) has quite a lot of structure.

This is a *very* generic model. It happily contains things like

- Linear regression!
- Linear regression with horseshoe priors!
- Linear mixed effects models!
- Linear regression with splines (smoothing or basis)!
- Spatial models like ICARs, BYMs, etc etc etc
- Gaussian processes (with the caveat that we’re mostly focussing on those that can be formulated via precision matrices rather than covariance matrices. A whole blog post, I have.)
- Any combination of these things!

So if I manage to get this implemented efficiently, all of these models will become efficient too. All it will cost is a truly shithouse^{18} interface.

The only downside of this degree of flexibility compared to just implementing a straight linear mixed model with \(X\) and \(Z\) and \(\beta\) and \(b\) all living separately is that there are a couple of tricks^{19} to improve numerical stability that we can’t use.

## Let’s get the posterior!

The nice thing about thing about this model is that it is a normal likelihood with a normal prior, so we can directly compute two key quantities:

The “full conditional” distribution \(p(u \mid y, \theta)\), which is useful for getting posterior information about \(b\) and \(\beta\), and

The marginal posterior \(p(\theta \mid y)\).

This means that we do not need to do MCMC on the joint space \((u, \theta)\)! We can instead write a model to draw samples from \(p(\theta \mid y)\), which is much lower-dimensional and easier^{20} to sample from, and then compute the joint posterior by sampling from the full conditional.

I talked a little about the mechanics of this in a previous blog post about conjugate priors, but let’s do the derivations. Why? Because they’re not too hard and it’s useful to have them written out somewhere.

### The full conditional

First we need to compute \(p(u \mid y , \theta)\). The first thing that we note is that conditional distributions are always proportional to the joint distribution (we’re literally just pretending some things are constant), so we get \[\begin{align*} p(u \mid y , \theta) &\propto p(y \mid u, \theta) p(u \mid \theta) p(\theta) \\ &\propto \exp\left[-\frac{1}{2\sigma^2} (y - Au)^TW^{-1}(y-Au)\right]\exp\left[-\frac{1}{2}u^TQ(\theta)u\right]. \end{align*}\]

Now we just need to expand things out and work out what the mean and the precision matrix of \(p(u \mid y, \theta )\) (which is Gaussian by conjugacy!) are.

Computing posterior distributions by hand is a dying^{21} art. So my best and only advice to you: don’t be a hero. Just pattern match like the rest of us. To do this, we need to know what the density of a multivarite normal distribution looks like *deep* down in its soul.

Behold: the ugly `div`

box!^{22}

If \(u \sim N(m, P^{-1})\), then \[\begin{align*} p(u) &\propto \exp\left[- \frac{1}{2}(u - m)^TP(u-m)\right] \\ &\propto \exp\left[- \frac{1}{2}u^TPu + m^TPu\right], \end{align*}\] where I just dropped all of the terms that didn’t involve \(u\).

This means the plan is to

- Expand out the quadratics in the exponential term so we get something that looks like \(\exp\left[-\frac{1}{2}u^TPu + z^Tu\right]\)
- The matrix \(P\) will be the precision matrix of \(u \mid y, \theta\).
- The mean of \(\mu \mid y, \theta\) is \(P^{-1}z\).

So let’s do it!

\[\begin{align*} p(u \mid y , \theta) &\propto \exp\left[-\frac{1}{2\sigma^2} u^TA^TW^{-1}Au + \frac{1}{\sigma^2}(A^TW^{-1}y)^Tu\right]\exp\left[-\frac{1}{2}u^TQ(\theta)u\right] \\ &\propto \exp\left[-\frac{1}{2}u^T\left(Q + \frac{1}{\sigma^2}A^TW^{-1}A\right)u + \frac{1}{\sigma^2}(A^TW^{-1}y)^Tu\right]. \end{align*}\]

This means that \(p(u \mid y ,\theta)\) is multivariate normal with

precision matrix \(Q_{u\mid y,\theta}(\theta) = \left(Q(\theta) + \frac{1}{\sigma^2}A^TW^{-1}A\right)\) and

mean

^{23}\(\mu_{u\mid y,\theta}(\theta) = \frac{1}{\sigma^2} Q_{u\mid y,\theta}(\theta)^{-1} A^TW^{-1}y\).

This means if I build an MCMC scheme to give me \(B\) samples \(\theta_b \sim p(\theta \mid y)\), \(b = 1, \ldots, B\), then I can turn them into \(B\) samples \((\theta_b, u_b)\) from \(p(\theta, u \mid y)\) by doing the following.

For \(b = 1, \ldots, B\)

Simulate \(u_b \sim N\left(\mu_{u\mid y,\theta}(\theta_b), Q_{u\mid y,\theta}(\theta_b)^{-1}\right)\)

Store the pair \((\theta_b, u_b)\)

Easy^{24} as!

### Writing down \(p(\theta \mid y)\)

So now we just^{25} have to get the marginal posterior for the non-Gaussian parameters \(\theta\). We only need it up to a constant of proportionality, so we can express the joint probability \(p(y, u, \theta)\) in two equivalent ways to get \[\begin{align*}
p(y, u , \theta) &= p(y, u, \theta) \\
p(u \mid \theta, y) p(\theta \mid y) p(y) &= p(y \mid u, \theta) p(u \mid \theta)p(\theta). \\
\end{align*}\]

Rearranging, we get \[\begin{align*} p(\theta \mid y) &= \frac{p(y \mid u, \theta) p(u \mid \theta)p(\theta)}{p(u \mid \theta, y)p(y)} \\ &\propto \frac{p(y \mid u, \theta) p(u \mid \theta)p(\theta)}{p(u \mid \theta, y)}. \end{align*}\]

This is a very nice relationship between the functional forms of the various densities we happen to know and the density we are trying to compute. This means that if you have access to the full conditional distribution^{26} for \(u\) you can marginalise \(u\) out. No weird integrals required.

But there’s one oddity: there is a \(u\) on the right hand side, but no \(u\) on the left hand side. What we have actually found is a whole continuum of functions that are proportional to \(p(\theta \mid y)\). It truly does not matter which one we choose.

But some choices make the algebra slightly nicer. (And remember, I’m gonna have to implement this later, so I should probably keep and eye on that.)

A good^{27} generic choice is \(u = \mu_{u\mid y, \theta}(\theta)\).

The algebra here can be a bit tricky^{28}, so let’s write out each function evaluated at \(u = \mu_{u\mid y, \theta}(\theta)\).

The bit from the likelihood is \[\begin{align*} p(y \mid u = \mu_{u\mid y, \theta}(\theta), \theta) &\propto \sigma^{-n} \exp\left[-\frac{1}{2\sigma^2}(y - A\mu_{u\mid y, \theta}(\theta))^TW^{-1}(y- A\mu_{u\mid y, \theta}(\theta))\right]\\ &\propto \sigma^{-n}\exp\left[\frac{-1}{2\sigma^2} \mu_{u\mid y, \theta}(\theta)^TA^TW^{-1}A\mu_{u\mid y, \theta}(\theta) + \frac{1}{\sigma^2} y^T W^{-1}A \mu_{u\mid y, \theta}(\theta)\right], \end{align*}\] where \(n\) is the length of \(y\).

The bit from the prior on \(u\) is \[\begin{align*} p(\mu_{u\mid y, \theta}(\theta) \mid \theta ) \propto |Q(\theta)|^{1/2}\exp\left[-\frac{1}{2} \mu_{u\mid y, \theta}(\theta)^TQ(\theta)\mu_{u\mid y, \theta}(\theta)\right]. \end{align*}\]

Finally, we get that the denominator is \[
p(\mu_{u\mid y, \theta}(\theta) \mid y, \theta) \propto |Q_{u\mid y, \theta}(\theta)|^{1/2}
\] as the exponential term^{29} cancels!

Ok. Let’s finish this. (Incidentally, if you’re wondering why Bayesians love MCMC, this is why.)

\[\begin{align*} p(\theta \mid y) &\propto p(\theta) \frac{|Q(\theta)|}{|Q_{u\mid y, \theta}(\theta)|} \exp\left[-\frac{1}{2} \mu_{u\mid y, \theta}(\theta)^T(Q(\theta) + \frac{1}{\sigma^2}A^TW^{-1}A)\mu_{u\mid y, \theta}(\theta) + \frac{1}{\sigma^2} y^T W^{-1}A \mu_{u\mid y, \theta}(\theta)\right] \\ &= p(\theta) \frac{|Q(\theta)|}{|Q_{u\mid y, \theta}(\theta)|} \exp\left[-\frac{1}{2} \mu_{u\mid y, \theta}(\theta)^TQ_{u\mid y, \theta}(\theta)\mu_{u\mid y, \theta}(\theta) + \frac{1}{\sigma^2} y^T W^{-1}A \mu_{u\mid y, \theta}(\theta)\right]. \end{align*}\]

We can now use the fact that \(Q_{u\mid y, \theta}(\theta)\mu_{u\mid y, \theta}(\theta) = A^TW^{-1}y\) to get

\[\begin{align*} p(\theta \mid y) &\propto p(\theta) \frac{|Q(\theta)|}{|Q_{u\mid y, \theta}(\theta)|} \exp\left[-\frac{1}{2} \mu_{u\mid y, \theta}(\theta)^TA^TW^{-1}y + \frac{1}{\sigma^2} y^T W^{-1}A \mu_{u\mid y, \theta}(\theta)\right] \\ &=\frac{|Q(\theta)|}{|Q_{u\mid y, \theta}(\theta)|} \exp\left[\frac{1}{2} \mu_{u\mid y, \theta}(\theta)^TA^TW^{-1}y \right] . \end{align*}\]

For those who just love a log-density, this is \[
\log(p(\theta \mid y)) = \frac{1}{2} \mu_{u\mid y, \theta}(\theta)^TA^TW^{-1}y +\frac{1}{2} \log(|Q(\theta)|) - \frac{1}{2}\log(|Q_{u\mid y, \theta}(\theta)|).
\] A fairly simple expression^{30} for all of that work.

## So why isn’t this just a Gaussian process?

These days, people^{31} are more than passingly familiar^{32} with Gaussian processes. And so they’re quite possibly wondering why this isn’t all just an extremely inconvenient way to do the exact same computations you do with a GP.

Let me tell you. It is *all* about \(Q(\theta)\) and \(A\).

The prior precision matrix \(Q(\theta)\) is typically block diagonal. This special structure makes it pretty easy to compute the \(|Q(\theta)|\) term^{33}. But, of course, there’s more going on here.

In linear mixed effects models, these blocks on the diagonal matrix are typically fairly small (their size is controlled by the number of levels in the variable you’re stratifying by). Moreover, the matrices on the diagonal of \(Q(\theta)\) are the inverses of either diagonal or block diagonal matrices that themselves have quite small blocks^{34}.

In models that have more structured random effects^{35}, the diagonal blocks of \(Q(\theta)\) can get quite large^{36}. Moreover, the matrices on these blocks are usually not block diagonal.

Thankfully, these prior precision matrices do have something going for them: most of their entries are zero. We refer to these types of matrices as *sparse matrices*. There are some marvelous algorithms for factorising sparse matrices that are usually a lot more efficient^{37} than algorithms for dense matrices.

Moreover, the formulation here decouples the dimension of the latent Gaussian component from the number of observations. The data only enters the posterior through the reduction \(A^Ty\), so if the number of observations is much larger than the number of latent variables^{38} and \(A\) is sparse^{39}, the operation scales *linearly* in the number of observations (and obviously superlinearly^{40} in the row-dimension of \(A\)).

So the prior precision^{41} is a sparse matrix. What about the precision matrix of \([u \mid y, \theta]\)?

It is also sparse! Recall that \(A = [Z \vdots X]\). This means that \[
\frac{1}{\sigma^2}A^TW^{-1}A = \frac{1}{\sigma^2}\begin{pmatrix} Z^T W^{-1}Z & Z^T W^{-1}X \\ X^T W^{-1} Z & X^TW^{-1}X \end{pmatrix}.
\] \(Z\) is a matrix that links the stacked vector of random effects \(b\) to each observation. Typically, the likelihood \(p(y_i \mid \theta)\) will only depend on a small number of entries of \(b\), which suggests that most elements in each row of \(Z\) will be zero. This, in turn, implies that \(Z\) is sparse and so is^{42} \(Z^TW^{-1}Z\).

On the other hand, the other three blocks are usually^{43} fully dense. Thankfully, though, the usual situation is that \(b\) has *far* more elements that \(\beta\), which means that \(A^TW^{-1}A\) is still sparse and we can still use our special algorithms^{44}

All of this suggests that, under usual operating conditions, \(Q_{u\mid y, \theta}\) is *also* a sparse matrix.

And that’s *great* because that means that we can compute the log-posterior using only 3 main operations:

Computing \(\log(|Q(\theta)|)\). This matrix is block diagonal so you can just multiply together the determinants

^{45}of the diagonal blocks, which are relatively cheap to compute.Computing \(\mu_{u \mid y, \theta}(\theta)\). This requires solving the sparse linear system \(Q_{u \mid y, \theta} \mu_{u \mid y, \theta} = \frac{1}{\sigma^2}A^TW^{-1}y\). This is going to rely on some fancy pants sparse matrix algorithm.

Computing \(\log(|Q_{u \mid y, \theta}(\theta)|)\). This is, thankfully, a by-product of the things we need to compute to solve the linear system in the previous task.

## What I? What I? What I gotta do? What I gotta do to get this model in PyMC?

So this is where shit gets real.

Essentially, I want to implement a new distribution in PyMC that will take approprite inputs and output the log-density and its gradient. There are two ways to do this:

- Panic
- Pray

For the first option, you write a C++^{46} backend and register it as an Aesara node. This is how, for example, differential equation solvers migrated into PyMC.

For the second option, which is going to be our goal, we light our Sinead O’Connor votive candle and program up the model using JAX. JAX is a glorious feat of engineering that makes compilable and autodiff-able Python code. In a lot of cases, it seamlessly lets you shift from CPUs to GPUs and is all around quite cool.

It also has approximately zero useful sparse matrix support. (It will let you do *very* basic things^{47} but nothing as complicated as we are going to need.)

So why am I taking this route? Well firstly I’m curious to see how well it works. So I am going to write JAX code to do all of my sparse matrix operations and see how efficiently it autodiffs it.

Now I’m going to pre-register my expectations. I expect it to be a little bit shit. Or, at least, I expect to be able to make it do better.

The problem is that computing a gradient requires a single reverse-mode^{48} autodiff sweep. This does not seem like a problem until you look at how this sort of thing needs to be implemented and you realise that every gradient call is going to need to generate *and store* the entire damn autodiff tree for the log-density evaluation. And that autodiff tree is going to be *large*. So I am expecting the memory scaling on this to be truly shite.

Thankfully there are two ways to fix this. One of them is to implement a custom *Jacobian-vector product*^{49} and register it with JAX so it knows *most* of how to do the derivative. The other way is to implement this shit in C++ and register it as a JAX primitive. And to be honest I’m very tempted. But that is not where I am starting.

The other problem is going to be exposing this to users. The internal interface is going to be an absolute shit to use. So we are gonna have to get our Def Leppard on and sprinkle some syntactical sugar all over it.

I’m honestly less concerned about this challenge. It’s important but I am not expecting to produce anything good enough to put into PyMC (or any other package). But I do think it’s a good idea to keep this sort of question in mind: it can help you make cleaner, more useful code.

### What comes next?

Well you will not get a solution today. This blog post is more than long enough.

My plan is to do three things.

Implement the relevant sparse matrix solver in a JAX-able form. (This is mostly gonna be me trying to remember how to do something I haven’t done in a very long time.)

Bind

^{50}the (probably) inefficient version into PyMC to see how that process works.Try the custom

`jvp`

and`vjp`

interfaces in JAX to see if they speed things up relative to just autodiffing through my for loops.(Maybe) Look into whether hand-rolling some C++ is worth the effort.

Will I get all of this done? I mean, I’m skeptical. But hey. If I do it’ll be nice.

## Footnotes

aka linear multilevel models↩︎

Popular in epidemiology↩︎

INLA = Laplace approximations + sparse linear algebra to do fast, fairly scalable, and accurate Bayesian inference on a variety of Bayesian models. It’s particularly good at things like spatial models.↩︎

In its guts, Stan is a fully templated C++ autodiff library, so I would need to add specific sparse matrix support. And then there’s be some truly gross stuff with the Stan language and its existing types. And so on and so on and honestly it just broke my damn brain. So I started a few times but never finished.↩︎

I just don’t ever use it. I semi-regularly read and debug other people’s code, but I don’t typically write very much myself. I use R because that’s what my job needs me to use. So a shadow aim here is to just put some time into my Python. By the end of this I’ll be like Britney doing I’m a Slave 4 U.↩︎

Or maybe more, but let’s not be too ambitious.↩︎

I’m pretty sure they will.↩︎

My sparse matrix data structures are

*rusty*as fuck.↩︎and the intercept if it’s needed↩︎

Really this costs me nothing and can be useful with multiple observations.↩︎

Default options include the identity matrix or some multiple of the identity matrix.↩︎

REML heads don’t dismay. You can do all kinds of weird shit by choosing some of these matrices in certain ways. I’m not gonna stop you. I love and support you. Good vibes only.↩︎

The priors on \(\beta\) and \(b\) are independent Gaussian so it has to be.↩︎

homosexual↩︎

Inverse correlation matrix↩︎

excluding the fixed ones, like \(W\) and \(A\) and \(R\). ↩︎

Such a dirty word. For all of the models we care about, this is block diagonal. So this assumption is our restriction to a specific class of models.↩︎

I would suggest a lot of syntactic sugar if you were ever going to expose this stuff to users.↩︎

See the Bates

*et al.*paper. Their formulation is fabulous but doesn’t extend nicely to the situations I care about! Basically they optimise for the situation where \(\Sigma_b\) can be singular, which is an issue when you’re doing optimisation. But I’m not doing optimisation and I care about the case where the precision matrix is defined as a singular matrix (and therefore \(\Sigma_b\) does not exist. This seems like a truly wild idea, but it occurs quite naturally in many important models like smoothing splines and ICAR models (which are extremely popular in spatial epidemiology).↩︎It’s easier in two ways. Firstly, MCMC likes lower-dimensional targets. They are typically easier to sample from! Secondly, the posterior geometry of \(p(\theta \mid y)\) is usually pretty simple, while the joint posterior \(p(\theta, u \mid y)\) has an annoying tendency to have a funnel in it, which forces us to do all kinds of annoying reparameterisation tricks to stop the sampler from shitting the bed.↩︎

Computers!↩︎

CSS is my passion.↩︎

It’s possible to rearrange things to lose that \(\frac{1}{\sigma^2}\), which I admit looks a bit weird. It cancels out down the line.↩︎

I have, historically, not had the greatest grip on whether or not things are easy.↩︎

See previous footnote.↩︎

Or a good approximation to it. Laplace approximations work very well for this to extend everything we’re doing here from a linear mixed-ish model to a generalised linear mixed-ish model.↩︎

This is actually a bit dangerous on the face of it because it depends on \(\theta\). You can convince yourself it’s ok. Choosing \(u=0\) is less stress inducing, but I wanted to bring out the parallel to using a Laplace approximation to \(p(u \mid \theta, y)\), in which case we really want to evaluate the ratio at the point where the approximation is the best (aka the conditional mean).↩︎

A common mistake is to forget the parameter dependent proportionality constants from the normal distribution. You didn’t need them before because you were conditioning on \(\theta\) so they were all constant. But now \(\theta\) is unknown and if we forget them an angel will cry.↩︎

Honest footnote: This started as \(p(\mu_{u\mid y, \theta}(\theta) \mid y, \theta) \propto 1\) because I don’t read my own warnings.↩︎

The brave or foolish amongst you might want to convince yourselves that this collapses to

*exactly*the marginal likelihood we would’ve gotten from Rasmussen and Williams had we made a sequence of different life choices. In particular if \(A = I\) and \(Q(\theta) = \Sigma(\theta)^{-1}\).↩︎Or, at least, people who have made it this far into the post.↩︎

You like GPs bro? Give me a sequence of increasingly abstract definitions. I’m waiting.↩︎

Multiply the determinants of the matrices along the diagonal.↩︎

Look at the Bates et al paper. Specifically section 2.2.

`lme4`

is a really clever thing.↩︎examples: smoothing splines, AR(p) models, areal spatial models, some Gaussian processes if you’re careful↩︎

\(10^4\)–\(10^6\) is not unheard of↩︎

A dense matrix factorisation of an \(n\times n\) matrix costs \(\mathcal{O}(n^3)\). The same factorisation of a sparse matrix can cost as little as \(\mathcal{O}(n)\) if you’re very lucky. More typically it clocks in a \(\mathcal{O}(n^{1.5})\)–\(\mathcal{O}(n^{2})\), which is still a substantial saving!↩︎

This happens for a lot of designs, or when a basis spline or a Markovian Gaussian process is being used↩︎

This happens a lot, but not always. For instance subset-of-regressors/predictive process-type models have a dense \(A\). In this case, if \(A\) has \(m\) rows an \(n\) columns, this is an \(\mathcal{O}(mn)\), which is more expensive than a sparse \(A\) unless \(A\) has roughly \(m\) non-zeros per row..↩︎

but usually not cubically. See above footnote.↩︎

It’s important that we are talking about

*precision*matrices here and not covariance matrices as the inverse of a sparse matrix is typically dense. For instance, an AR(1) prior with autocorrelation parameter \(\rho\) has a prior has a sparse precision matrix that looks something like \[ Q = \frac{1}{\tau^2}\begin{pmatrix} 1 & -\rho &&&&& \\ -\rho&1 + \rho^2& -\rho&&&& \\ &-\rho& 1 + \rho^2 &- \rho&&& \\ &&-\rho& 1 + \rho^2&-\rho&& \\ &&&-\rho&1+\rho^2 &-\rho & \\ &&&&-\rho&1 + \rho^2& - \rho \\ &&&&&-\rho&1 \end{pmatrix}. \] On the other hand, the*covariance matrix*is fully dense \[ Q^{-1} = \tau^2\begin{pmatrix} \rho&\rho^2&\rho^3&\rho^4&\rho^5&\rho^6&\rho^7 \\ \rho^2&\rho&\rho^2&\rho^3&\rho^4&\rho^5&\rho^6 \\ \rho^3&\rho^2&\rho&\rho^2&\rho^3&\rho^4&\rho^5 \\ \rho^4&\rho^3&\rho^2&\rho&\rho^2&\rho^3&\rho^4 \\ \rho^5&\rho^4&\rho^3&\rho^2&\rho&\rho^2&\rho^3 \\ \rho^6&\rho^5&\rho^4&\rho^3&\rho^2&\rho&\rho^2 \\ \rho^7&\rho^6&\rho^5&\rho^4&\rho^3&\rho^2&\rho \\ \end{pmatrix}. \]

This is a generic property: the inverse of a sparse matrix is usually dense (it’s dense as long as the graph associated with the sparse matrix has a single connected component there’s a matrix with the same pattern of non-zeros that has a fully dense inverse) and the entries satisfy geometric decay bounds.↩︎Remember: \(W\) is diagonal and known.↩︎

Not if you’re doing some wild dummy coding shit or modelling text, but typically.↩︎

You’d think that dense rows and columns would be a problem but they’re not. A little graph theory and a little numerical linear algebra says that as long as they are the last variables in the model, the algorithms will still be efficient. That said, if you want to

*dig in*, it is possible to use supernodal (eg CHOLMOD) and multifrontal (eg MUMPS) methods to group the operations in such a way that it’s possible to use level-3 BLAS operations. CHOLMOD even spins this into a GPU acceleration scheme, which is fucking wild if you think about it: sparse linear algebra rarely has the arithmetic intensity or data locality required to make GPUs worthwhile (you spend all of your time communicating, which is great in a marriage, terrible in a GPU). But some clever load balancing, tree-based magic, and multithreading apparently makes it possible. Like truly, I am blown away by this. We are not going to do*any*of this because absolutely fucking not. And anyway. It’s kinda rare to have a huge number of covariates in the sorts of models that use these complex random effects. (Or if you do, you better light your Sinead O’Connor votive candle because honestly you have a lot of problems and you’re gonna need healing.)↩︎If you’ve been reading the footnotes, you’ll recall that sometimes one of these precision matrices on the diagonal will be singular. Sometimes that’s because you fucked up your programming. But other times it’s because you’re using something like an ICAR (intrinsic conditional autoregressive) prior on one of your components. The precision matrix for this model is \(Q_\text{ICAR} = \tau_\text{ICAR} = \tau \text{Adj}(\mathcal{G})\), where \(\operatorname{Adj}(\mathcal{G})\) is the adjacency matrix of some fixed graph \(\mathcal{G}\) (typically describing something like which postcodes are next to each other). Some theory suggests that if \(\mathcal{G}\) has \(d\) connected components, the zero determinant should be replaced with \(\tau^{(m - d)/2}\), where \(m\) is the number of vertices in \(\mathcal{G}\).↩︎

I guess there’s nothing really stopping you from writing in pure Python except a creeping sense of inadequacy.↩︎

eg build a sparse matrix↩︎

Honey, we do not have time. Understanding autodiff is not massively important in the grand scheme of this blogpost (or, you know, probably in real life unless you do some fairly specific things). I’ll let Charles explain it.↩︎

Or, a custom vector-Jacobian product, which is not a symmetrical choice.↩︎

I bind you Nancy!↩︎

## Reuse

## Citation

```
@online{simpson2022,
author = {Simpson, Dan},
title = {Sparse {Matrices} 1: {The} Linear Algebra of Linear Mixed
Effects Models and Their Generalisations},
date = {2022-03-22},
url = {https://dansblog.netlify.app/2022-03-22-a-linear-mixed-effects-model},
langid = {en}
}
```