Conjugate Priors? The crystal deoderant of Bayesian statistics

If we’re going to talk about priors, let’s talk about priors. And let’s talk about the most prior-y priors in the whole damn prior universe. Let’s talk about *conjugate priors*.

Who cares.

90s revivalists.

Live your life.

Deep breaths. You soul is an island of positivity.

Conjugate priors are wild and fabulous beasts. They roam the strange, mathematic plains and live forever in our dreams of a better future.

Too much?

OK.

Conjugate priors are a mathematical curiosity that occasional turn out to be slightly useful.

A prior distribution \(p(\theta)\) is *conjugate* to the likelihood^{1} \(p(y \mid \theta)\) if the *posterior distribution* \(p(\theta \mid y)\) is in the same distributional family as the prior.

Moreover, there is an rule to update the parameters in the prior to get the parameters in the posterior based on some simple summaries of the data. This means that you can simply write the posterior down as a specific distribution that you can^{2} easily sample from and get on with your life.

Really, it seems like a pretty good thing. But there is, unsurprisingly, a hitch: almost no likelihoods have conjugate priors. And if you happen to have a model with a nice^{3} conjugate prior then good for you, but if you modify your model even slightly, you will no longer have one.

That is, the restriction to conjugate priors is a massive restriction on your *entire* model.

Conjugate priors are primarily used by two types of people:

- People who need to write exam questions for undergraduate Bayesian statistics courses
^{4}, - People who need to implement a Gibbs sampler and don’t want to live through the nightmare
^{5}that is Metropolis-within-Gibbs.

For the most part, we can ignore the first group of people as a drain on society.

The second group is made up of:

people who are using software that forces it on them. And like we don’t all have time to learn new software

^{6}. Leave, as hero^{7}Chris Crocker, Britney alone.people who are writing their own Gibbs sampler. Annie Lennox said it best: Why-y-y-y-y-y-y-y-y-y-y? For a very large variety of problems,

*you do not have to do this*^{8}. The exception is when you have a discrete parameter in your model that you can’t marginalise out^{9}, like an exponential random graph model or something equally hideous. Thankfully, a lot of work in machine learning has expanded the options for Bayesian and pseudo-semi-kinda Bayesian^{10}estimation of these types of models. Anyway. Discrete parameters are disgusting. I am tremendously indiscrete.

The third type are the odd ducks who insist that because the posterior and the prior being in the same family means that the prior can be interpreted as the outcome of Bayesian analysis on a previous experiment. Instead of the much more realistic way of arriving at a conjugate prior where you find yourself waking up alone in a bathtub full of ice and using an \(\text{Inverse-Gamma}(1/2, 0.0005)\) prior on the variance (which is conjugate for a Gaussian likelihood) because some paper from 1995^{11} told you it was a good choice.

There is actually one situation where they can be pretty useful. If your parameter space breaks down into \(\theta = (\eta, \phi)\), where \(\eta\) is a high-dimensional variable, then if \(p(y \mid \theta) = p(y \mid \eta)\) and \(p(\eta \mid \phi)\) is conjugate for \(p(y \mid \eta)\), then a *magical* thing happens: you can compute \(p(\eta \mid y, \phi)\) explicitly (using the conjugate property) and then you can greatly simplify the posterior as \(p(\theta\mid y ) = p(\eta \mid y, \phi) p(\phi \mid y)\), where^{12} \[
p(\phi \mid y) = \frac{p(y \mid \eta)p(\eta \mid \phi)p(\phi)}{p(y) p(\eta \mid y, \phi)} \propto \left.\frac{p(y \mid \eta)p(\eta \mid \phi)p(\phi)}{p(\eta \mid y, \phi)}\right|_{\eta = \text{anything}},
\] where every term on the right hand side is able to be calculated^{13}. Even if this doesn’t have a known distribution form, it is much much lower-dimensional than the original problem and much more amenable to MCMC or possibly deterministic integration methods.

This really does feel a bit abstract, so I will give you the one case where I know it’s used very commonly.This is the case where \(y \sim N(A\eta, R)\) and^{14} \(\eta \mid \phi \sim N(0, \Sigma(\phi))\), where \(\Sigma(\phi)\) is a covariance matrix and \(A\) is a matrix (the dimension of \(\eta\) is often higher than the dimension of \(y\)).

This is an example of a class of models that occur *constantly* in statistics: Håvard Rue^{15} calls them Latent Gaussian models. They basically extend^{16} `(geostatistical)? linear|additive (mixed)? models`

. So for all of these models, we can explicitly integrate out the high-dimensional Gaussian component, which makes inference *a breeze*^{17}.

It gets slightly better than that because if you combine this observation with a clever asymptotic approximation, you get an approximately conjugate model and can produce Laplace approximations, nested Laplace approximations^{18}, and Integrated Nested Laplace approximations^{19}, depending on how hard you are willing to work.

Yes we have drifted somewhat from the topic, but that’s because the topic is boring.

Conjugate priors are mostly a mathematical curiosity and their role in Bayesian statistics is inexplicably inflated^{20} to make them seem like a core topic. If you never learn about conjugate priors your Bayesian education will not be lacking anything. It will not meaningfully impact your practice. But even stopped clocks are right 2-3 times a day^{21}

Like all areas of Bayesian statistics, conjugate priors push back against the notion of Arianism.↩︎

often, but not always↩︎

Christian Robert’s book The Bayesian Choice has an example where a model has a conjugate prior but it doesn’t normalise easily.↩︎

Or, in my case, it’s explicitly listed on the syllabus.↩︎

Not a nightmare.↩︎

I’m equally likely to learn Julia and Stata. Which is is to say I’m tremendously unlikely to put the effort in to either. I wish them both well. Live your life and let me live mine.↩︎

I have not fact checked this recently, and we all know white gays sometimes go bad. But he started from a good place, so I’m sure it’s fine.↩︎

There is pedagogical value in learning how MCMC methods work by implementing them yourself. But girl it is 2021. Go fuck with a bouncy particle sampler or something. Live your live out loud! Young Bayesians run free↩︎

Often, like with mixture models or hidden markov models, you can eg https://mc-stan.org/docs/2_22/stan-users-guide/latent-discrete-chapter.html↩︎

The difference between these things is pretty slight in the usual situation where your MCMC scheme doesn’t explore the space particularly well. I’m not of the opinion that you either explore the full posterior or you don’t use the model. Most of the time you do perfectly fine with approximate exploration or, at least, you do as well as anything else will.↩︎

BERNARDINELLI, L., CLAYTON, D. and MONTOMOLI, C. (1995). Bayesian estimates of disease maps: How important are priors? Stat. Med. 14 2411–2431.↩︎

Multiply both sides of the first equation by the denominator and it’s equivalent to \(p(y, \eta, \phi) = p(y, \eta, \phi)\), which is tautologically true.↩︎

The constant of proportionality does not depend on \(\eta\). All of the \(\eta\) parts cancel!↩︎

The mean doesn’t have to be zero but you can usually make it zero using … magic.↩︎

apologies for the regexp.↩︎

See also Rasmussen and Williams doing marginal inference with GPs. Exactly the same process.↩︎

I assume this is so people don’t need to update their lecture notes.↩︎

daylight savings time fades the curtains and wreaks havoc with metaphors.↩︎

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. Source code is available at https://github.com/dpsimpson/blog/tree/master/_posts/2021-10-14-priors2, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

For attribution, please cite this work as

Simpson (2021, Oct. 15). Un garçon pas comme les autres (Bayes): Priors: Whole New Way (Track 2). Retrieved from https://dansblog.netlify.app/posts/2021-10-14-priors2/

BibTeX citation

@misc{simpson2021priors:, author = {Simpson, Dan}, title = {Un garçon pas comme les autres (Bayes): Priors: Whole New Way (Track 2)}, url = {https://dansblog.netlify.app/posts/2021-10-14-priors2/}, year = {2021} }