Un garçon pas comme les autres (Bayes) - MCMC with the wrong acceptance probability

Just the other day¹ I was chatting with a friend² about MCMC and he asked me a fundamental, but seldom asked, question: What happens my acceptance probability is a bit off?.

This question comes up a bunch. In this context, they were switching from double to single precision³ and were a little worried that some of their operations would be a bit more inexact than they were used to. Would this tank MCMC? Would everything still be fine?

What is Markov chain Monte Carlo

Markov chain Monte Carlo (MCMC) is, usually, guess-and-check for people who want to be fancy.

It is a class of algorithms that allow you to construct a⁴ Markov chain that has a given stationary distribution⁵ \(\pi\). In Bayesian applications, we usually want to choose \(\pi = p(\theta \mid y)\), but there are other applications of MCMC.

Most⁶ MCMC algorithms live in the Metropolis-Hastings family of algorithms. These methods require only one component: a proposal distribution \(q(\theta' \mid \theta)\). Given basically any⁷ proposal distribution, we can go from our current state \(\theta_k\) to the new state \(\theta_{k+1}\) using the following three steps:

Propose a potential new state \(\theta' \sim q(\theta' \mid \theta_k)\)
Sample a Bernoulli random variable \(r_{k+1}\) with \[ \Pr(r_{k+1} = 1 \mid \theta_k) = \alpha_{k+1} = \min\left\{1, \frac{\pi(\theta')}{\pi(\theta_k)}\frac{q(\theta_k \mid \theta')}{q(\theta' \mid \theta_k)}\right\} \]
Set \(\theta_{k+1}\) according to the formula \[ \theta_{k+1} = \begin{cases} \theta', & r_{k+1}=1 \\ \theta_k, &r_{k+1} = 0.\end{cases} \]

The acceptance probability⁸ is chosen⁹ to balance¹⁰ out the proposal \(q(\cdot \mid \cdot)\) with the target distribution \(\pi\).

You can interpret the two ratios in the acceptance probability separately. The first one prefers proposals from high-density regions over proposals from low-density regions. The second ratio balances this by down-weighting proposed states that were easy to propose from the current location. When the proposal is symmetric, ie \(q(\theta'\mid \theta)= q(\theta \mid \theta')\), the second ratio is always 1. However, in better algorithms like MALA¹¹, the proposal is not symmetric. If we look at the MALA proposal \[ q(\theta'\mid \theta) \sim N\left(\theta + \frac{1}{2}\Sigma\nabla \log \pi(\theta), \Sigma\right) \] it’s pretty easy to see that we are biasing our samples towards the mode of the distribution. If we did not have the second ratio in the acceptance probability we would severely under-sample the tails of the distribution.

MCMC with approximate acceptance probabilities

With this definition in hand, it’s now possible to re-cast the question my friend asked as > What happens to my MCMC algorithm if, instead of \(\alpha_{k+1}\) I accidentally compute \(\tilde \alpha_{k+1}\) and use that instead to simulate \(r_{k+1}\)?

So let’s go about answering that!

A bit of a literature review

Unsurprisingly, this type of question has popped up over and over again in the literature:

This exact question was asked by Gareth Roberts and Jeff Rosenthal first¹² with Peter Schwartz and a second, more¹³ ¹⁴ realistic, time with Laird Breyer. They found that as long as the chain’s convergence is sufficiently nice¹⁵ then the perturbed chain will converge nicely and have¹⁶ a central limit theorem.
About 10 years ago, an absolute orgy¹⁷ ¹⁸ of research happened around the question What happens if the acceptance probability is random but unbiased?. These exact approximate¹⁹ or pseudo-marginal methods. These have some success in situations²⁰ where the likelihood has a parameter dependent normalising constant that can’t be computed exactly, but can be estimated unbiasedly. The problem with this class of methods is that the extra noise tends to make the Markov chain perform pretty badly²¹. This limits its practical use to models where we really can’t do anything else²². That said, there is some interesting literature on random sub-sampling of data where it doesn’t really work and where it does work.
A third branch of literature is on truly approximate algorithms. These try to understand what happens if you’re just wrong with \(\alpha_{k+1}\) and you don’t do anything to correct it. There are a lot of papers on this, and I’m not going to do anything approaching a thorough review. I have work²³ ²⁴ to do. So I will just list two older papers that were influential for me. The first was by Geoff Nichols, Colin Fox, and Alexis Muir Watt, which looks at what happens when you don’t correct your pseudo-marginal method correctly. It’s a really neat theory paper that is a great presentation²⁵ of the concepts. The second paper is by Pierre Alquier, Nial Friel, Richard Everitt, and Aidan Boland, which looks at general approximate Markov chains. They show empirically that these methods work extremely well relative to pseudo-marginal methods for practical settings. There are also some nice results on perturbations of Markov chains in general, for instance this paper by Daniel Rudolf and Nikolaus Schweizer.

Trying to understand noisy Markov chains

So how do I think of noisy Markov chains. Despite all appearances²⁶ I am not really a theory person. So while I know that there’s a massive literature on the stability of Markov chains, it doesn’t really influence how I think about it.

Instead, I think about it in terms of that Nicholls, Fox, and Muir Watt paper paper. Or, specifically, a talk I saw Colin give at some point that was really clear.

The important thing to recognise is that it is not important how well you compute \(\alpha_{k+1}\). What is important is if you get the same outcome. Imagine we have two random variables \(r_{k+1} \sim \text{Bernoulli}(\alpha_{k+1})\) and \(\tilde r_{k+1} \sim \text{Bernoulli}(\tilde \alpha_{k+1})\). If our realisation of \(r_{k+1}\) is the same as our realisation of \(\tilde r_{k+1}\), then we get the same \(x_{k+1}\). Or, to put it another way, when \(r_{k+1} = \tilde r_{k+1}\), no one can tell²⁷ that it’s an approximate Markov chain.

This means that one way to understand inexact MCMC is to think of the Markov chain \[ (\tilde{\theta}_k, s_k), \qquad k=0, 1, \ldots, \infty, \] where²⁸ \[ s_k = \begin{cases} 0, \quad & r_{k} = \tilde r_k \\ 1, &r_k \neq \tilde r_k\end{cases} \] indicates whether or not we made the wrong decision. It’s important to note that while \(\tilde \theta_k\) is marginally a Markov chain, \(s_k\) is not. You can actually think of \(s_k\) as the observation of a hidden Markov model if you want to. I won’t stop you. Nothing will. There is no morality, there is no law. It is The Purge.

Although we can never actually observe \(s_k\), thinking about it is really useful. In particular, we note that until \(s_k =1\) for the first time, the samples of \(\tilde \theta_k\) are identical to a correct Metropolis-Hastings algorithm. After this point, the approximate chain and the (imaginary) exact chain will be different. But we can iterate this argument.

To do this, we can define the length \(N_j\) of the Markov chain that would be the same as the exact MCMC algorithm started at \(\theta_{N_{k-1}}\) by \(N_0=0\) and \[ N_k = \inf_{i > N_k}\{i - N_{k-1}: s_i = 1\}. \]

If we run our algorithm for \(N\) steps, we can then think of the output as being the same as running \(J = \sum_{k=1}^N s_k\) Markov chains of different lengths. The \(j\)th chain starts at \(\theta_{N_{j-1}}\) and is length \(N_{j}-1\). It is worth remembering that these chains are not started from independent points. In particular, if \(N_j\) is small, then the starting position of the \(j\)th and the \(j+1\)th chain will be heavily correlated.

To think about this we need to think about what happens after \(N_k\) steps of a Markov chain. We are going to need the notation \(\theta_k = P^k \theta_0\) denotes \(k\) steps of the exact algorithm.

The topic of convergence of Markov chains is a complex business, but we are going to assume that our exact Markov chain is²⁹ geometrically ergodic, which means that \[ \|P^k \theta_0 - \pi\| \leq M(\theta_0) \rho^{k} \] for some function³⁰ \(M(x_0)\) and \(0 < \rho < 1\).

Geometric ergodicity is a great condition because, among other things, it ensures that sample means from the Markov chain satisfy a central limit theorem. It’s also bloody impossible to prove. But usually indicators like R-hat do a decent job at suggesting that there might be problems. Also if you are spending a lot of time rejecting proposals in certain parts of the space, there’s a solid chance that you’re not geometrically ergodic.

Now let’s assume that we are interested in computing \(\mathbb{E}_\pi(h(\theta))\) for some nice³¹ function \(h\). Then the nice thing about Markov chains is that, give or take³² \[ \left|\frac{1}{N_j-1}\sum_{k=N_{j-1}}^{N_j-1}h(\theta_k) - \mathbb{E}_\pi(h(\theta))\right| \leq C \frac{M(\theta_{N_{j-1}})}{N_j-1}\frac{1 - \rho^{N_{j}-1}}{1- \rho}. \] where \(C\) might depend on \(h\) if \(h\) is unbounded.

This suggests that the error is bounded by, roughly, \[ \left|\frac{1}{N}\sum_{k=1}^{N}h(\theta_k) - \mathbb{E}_\pi(h(\theta))\right| \leq \frac{C}{N} \sum_{j = 1}^J M(\theta_{N_{j-1}})\frac{1 - \rho^{N_{j}-1}}{1- \rho}. \]

This suggests a few things:

If \(J\) is small relative to \(N\), we are going to get very similar estimates to just running \(J\) parallel Markov chains and combining them without removing any warm up iterations. In particular, if almost all \(N_j\) are big, it will be a lot like combining \(J\) warmed up independent chains.
Effective sample size and Monte Carlo standard error estimates will potentially be very wrong. This is because instead of computing them based on multiple dependent chains, we are pretending that all of our samples came from a single ergodic Markov chain. Is this a problem? I really don’t know. Again, if the \(N_j\)s are usually large, we will be fine.
Because \(M(\theta)\) can be pretty large when \(\theta\) is large, we might have some problems. It’s easy to imagine cases where we get stuck out in a tail and we just fire off a lot of events when \(\theta_{N_j}\) is really big. This will be a problem. But also, if we are stuck out in a tail, we are rightly fucked anyway and all of the MCMC diagnostics should be screaming at you. We can take heart that \(\mathbb{E}_\pi(M(\theta))\) is usually finite³³ and not, you know, massive.

What do the \(N_j\) look like?

So the take away from the last section was that if the random variables \(N_j\) are usually pretty big, then everything will work ok. Intuitively this makes sense. If the \(N_j\)s were always small, it would be very difficult to ever get close to any sort of stationary distribution.

The paper by Nicholls, Fox, and Muir Watt paper talks about potential sizes for \(N_j\). The general construction that they use is a coupling, which is a bivariate Markov chain \((\theta_k, \tilde \theta_k)\) that start from the same position and are updated as follows:

Propose \(\theta' \sim q(\theta' \mid \tilde \theta_{k})\)
Generate a uniform random number \(u_{k+1}\)
Update \(\theta\) as \[ \theta_{k+1} = \begin{cases} \theta', \qquad & u_{k+1} \leq \alpha_{k+1} \\ \theta_{k}, & u_{k+1} > \alpha_{k+1}.\end{cases} \]
Update \(\tilde \theta\) as \[ \tilde \theta_{k+1} = \begin{cases} \theta', \qquad & u_{k+1} \leq \tilde \alpha_{k+1} \\ \tilde \theta_{k}, & u_{k+1} > \tilde \alpha_{k+1}.\end{cases} \]

This Markov chain is coupled in three ways ways. The chain starts at the same values \(\theta_0 = \tilde \theta_0\), the proposed \(\theta'\) is the same for both chains, and the randomness³⁴ used to do the accept/reject step is the same. Together, this things mean that \(\theta_k = \tilde \theta_k\) for all \(k < N_1\).

For this coupling construction, we can get the exact distribution of the \(s_k\). To do this, we remember that we will only make different decisions in the two chains (or uncouple) if \(u\) is on different sides of the two acceptance probabilities. The probability of happening is \[\begin{align*} \Pr(s_k = 1) &= \Pr( u \in [\min\{ \alpha_{k}, \tilde \alpha_k\}, \max\{ \alpha_{k}, \tilde \alpha_k\}]) \\ &= |\alpha_k - \tilde \alpha_k|. \end{align*}\]

I guess you could write down the distribution of the \(N_j\) in terms of this. In particular, you get \[ \Pr(N_1 = n) = |\alpha_n - \tilde \alpha_n|\prod_{k=1}^{n-1} (1- |\alpha_k - \tilde \alpha_k|) \], but honestly it would be an absolute nightmare.

When people get stuck in probability questions, the natural thing to do is to make the problem so abstract that you can make the answer up. In that spirit, let’s ask a slightly different: what is the distribution of the maximal decoupling time between the exact and the approximate chain. This is the distribution of the longest possible coupling of the two chains over all³⁵ possible random sequences \((\theta_k, \tilde \theta_k)\) such that the distribution of \((\theta_1, \theta_2, \ldots)\) is the same as our exact Markov chain and the distribution of \((\tilde\theta_1,\tilde \theta_2, \ldots)\) is the same as our approximate Markov chain.

This maximal value of \(N_1\) is called the maximal agreement coupling time or, more whimsically, the MEXIT time. It turns out that getting the distribution of \(N_1\) is … difficult, but we³⁶ can construct a random variable \(\tau\) that is independent of \(\tilde \theta_k\) such that \(\tau \leq N_1\) almost surely and \[ \Pr(\tau = t\mid \tau \geq t) = 1 - \operatorname*{ess\,inf}_{B, \theta_{<t}} \left\{\frac{P(\theta_t \in B \mid \theta_{<t})}{\tilde P(\theta_t \in B \mid \theta_{<t})}\right\}, \] where \(P(\theta_t \mid \theta_{<t})\) is the transition distribution for the exact Markov³⁷ chain and \(\tilde P(\theta_t \mid \theta_{<t})\) is the transition distribution for the approximate Markov chain.

For a Metropolis-Hastings algorithm, the transition distribution has the form \[ P(B, \theta)= \begin{cases} \alpha(\theta)Q(B \mid \theta),\qquad & \theta \not \in B \\ \alpha(\theta)Q(B\mid \theta) + (1-\alpha(\theta)), &\theta \in B \end{cases} \] where \(Q(B\mid \theta)\) is the probability associated with the proposal density \(q(\cdot \mid \theta)\) and I have been very explicit about the dependence of the acceptance probability on \(\theta\). (The \((1-\alpha(\theta))\) term takes into account the probability of starting at \(\theta\) and not accepting the proposed state.)

That definition of \(\tau\) looks pretty nasty, but it’s not too bad: in particular, the infinitum only cares of \(\theta_{t-1}\in B\). This means that the condition simplifies to \[ \Pr(\tau = t\mid \tau \geq t) = 1 - \min\left\{\operatorname*{ess\,inf}_{B, \theta_{t-1}} \frac{\alpha_t(\theta_{t-1}) Q(B \mid \theta_{t-1})}{\tilde\alpha_t(\theta_{t-1}) Q(B \mid \theta_{t-1})}, \operatorname*{ess\,inf}_{B, \theta_{t-1}} \frac{\alpha_t(\theta_{t-1}) Q(B \mid \theta_{t-1}) + (1-\alpha_t(\theta_{t-1}))}{\tilde\alpha_t(\theta_{t-1}) Q(B \mid \theta_{t-1}) + (1- \tilde \alpha_t(\theta_{t-1}))}\right\}. \]

This simplifies further if we assume that the proposal distribution \(Q(\cdot \mid \theta_k)\) is absolutely continuous and has a strictly positive density. Then, it truly does not matter what \(B\) is. For the first term, it just cancels, while the second term is monotone³⁸ in \(Q(B \mid \theta_{t-1})\), so we can take this term to be either zero or one and get³⁹ \[ \Pr(\tau = t\mid \tau \geq t) = 1 - \min\left\{\operatorname*{ess\,inf}_{ \theta_{t-1}} \frac{\alpha_t(\theta_{t-1}) }{\tilde\alpha_t(\theta_{t-1})}, \operatorname*{ess\,inf}_{\theta_{t-1}} \frac{1-\alpha_t(\theta_{t-1})}{ 1- \tilde \alpha_t(\theta_{t-1})},1\right\}. \]

This is, as the Greeks would say, not too bad.

If, for instance, we know the relative error \[ \tilde\alpha(\theta) = (1 + \delta(\theta))\alpha(\theta), \] then \[ \frac{\alpha(\theta)}{\tilde \alpha(\theta)} = \frac{1}{1 + \delta(\theta)}, \] and if we know⁴⁰ \(\delta(\theta) \leq \bar \delta\), we get \[ \frac{\alpha(\theta)}{\tilde \alpha(\theta)} \geq \frac{1}{1 + \bar\delta}. \] Similarly, if \[ 1-\tilde \alpha(\theta) = (1-\alpha(\theta))(1+\epsilon(\theta)), \] and \(\epsilon(\theta) \leq \bar \epsilon\), then we get \[ \frac{1-\alpha(\theta)}{1-\tilde \alpha(\theta)} = \frac{1}{1+\epsilon(x)} \geq \frac{1}{1+\bar\epsilon}. \]

The nice thing is that we can choose our upper bounds so that \(\rho = (1+ \bar \delta)^{-1} = (1+ \bar\epsilon)^{-1}\) and get the upper bound \[ \Pr(\tau = t\mid \tau \geq t) \leq 1 - \rho. \] It follows that \[ \Pr(\tau = t) \leq \rho^{t-1}(1-\rho). \]

Now this is a bit nasty. It’s an upper bound on the probability of a lower bound on the maximal decoupling time. Probability, eh.

Probably the most useful thing we can get from this is an upper bound on \(\mathbb{E}(\tau)\), which is⁴¹ \[ \mathbb{E}(\tau) \leq \frac{1}{1-\rho} = 1 + \bar \delta^{-1}. \]

This confirms our intuition that if the relative error is large, we will have, on average, quite small \(N_j\). It’s not quite enough to show the opposite (good floating point error begets big \(N_j\)), but that’s probably true as well.

And that is where we end this saga. There is definitely more that could be said, but I decided to spend exactly one day writing this post and that time is now over.

Footnotes

Usually this is a lie, but it was actually a thing that happened last week↩︎
Don’t judge me (or my friends) based on this. I promise we also talk about other shit.↩︎
Hi GPUs!↩︎
usually reversible, although a lot of cool but not ready for prime time work is being done on non-reversible chains.↩︎
A stationary distribution, if it exists, is the distribution that is preserved by the Markov chain. If \(\pi\) is the stationary distribution and \(x_1 \sim \pi\), then if we construct \(x_2, x_3,\ldots\) by running the Markov chain then for every \(k\), the marginal distribution is \(x_k \sim \pi\).↩︎
But critically not all! The dynamic HMC algorithm used in Stan, for instance, is not a Metropolis-Hastings algorithm. Instead of doing an accept/reject step it samples from the proposed trajectory. Betancourt’s long intro to Hamiltonian Monte Carlo covers this very well.↩︎
The conditions for this to work are very light. But that’s because the definition of “working” only thinks about what happens after infinitely many steps. To get a practically useful Metropolis-Hastings algorithm, you’ve got to work very hard on choosing your proposal density.↩︎
sometimes called the Hastings correction↩︎
This is not the only choice that will work, but in some sense it is the most efficient one.↩︎
Technically, it is chosen by requiring that the Markov proposal \(P(\theta,\theta')\) satisfies the detailed balance condition \(\pi P(\theta,\theta') = P(\theta', \theta)\pi\), but everything about that equation is beyond the scope of this particular post.↩︎
Metropolis-adjusted Langevin Algorithm↩︎
Under the assumption that the total floating point error was bounded by a constant \(\delta\)↩︎
This time the assumption was that the rounding error for the acceptance probability at state \(\theta_k\) was bounded by \(\delta \|\theta_k\|\). This is a lot closer to how floating point arithmetic actually works. The trade off is that it requires a tighter condition on the drift function \(V\).↩︎
IEEE floating point arithmetic represents a real number using \(B\) bits. Typically \(B = 64\) (double precision) or \(B = 32\) (single precision). You can read a great intro to this on Nick Higham’s blog. But in general, the best we can represent a real number \(\theta\) by is by a floating point number \(\tilde \theta\) that satisfies \[ |\theta - \tilde \theta| \leq 2^{-N+1}|\theta|, \] where \(N=23\) in single precision and \(N=32\) in double precision. Of course, the acceptance probability is a non-linear combination of floating point numbers, so the actual error is going to be more complicated than that. I strongly recommend you read Nick Higham’s book on the subject.↩︎
\(V\)-geometrically ergodic with some light conditions on \(V\)↩︎
Geometric ergodicity implies the existence of a CLT! Which is nice, because all of our intuition about how to use the output from MCMC depends on a CLT.↩︎
Like all good orgies, this one was mostly populated by men↩︎
Yes, I know. My (limited) contribution this literature was some small contributions to a paper lead by Anne-Marie Lyne. But if years of compulsory catholicism taught me anything (other than “If you’re drinking with a nun or an aging homosexual, don’t try to keep up”) it’s that something does not have to be literally true to be morally true.↩︎
We have to slightly redefine the word “exact” to mean “targets the correct stationary distribution” for this name to make sense↩︎
Random graph models and point processes are two great examples↩︎
for instance, it gets stuck for long times at single values↩︎
the aforementioned point process and graph models↩︎
Playing God of War: Ragnarok↩︎
The first run of God of War Games were not my cup of tea, but the 2008 game, which is essentially a detailed simulation of what happens when a muscle bear is entrusted with walking an 11 year old up a hill, was really enjoyable. So far this is too.↩︎
Does it talk about involutions for not fucking reason? Of course it does. Read past that.↩︎
Yeah, like I have also read my blog. Think of it as being like social media. It is not a representation of me a whole person. It’s actually biased towards stuff that I have either found or find difficult.↩︎
A friend of mine has a “No one knows I’m a transexual” t-shirt that she likes to wear to supermarkets.↩︎
Note that both \(r_k\) and \(\tilde r_k\) are computed using the same value \(\tilde \theta_{k-1}\).↩︎
The norm here is usually either the total variation norm of the \(V\)-norm. But truly it’s not important for the hand waving.↩︎
In most cases \(M(\theta) \rightarrow \infty\) as \(\|\theta\| \rightarrow \infty\).↩︎
Bounded and continuous always works. But everything is probably ok for unbounded functions as long as \(h(\theta)\) has a pile of finite moments.↩︎
This is roughly true. I basically used the geometric ergodicity bound to bound \[ \sum_{k=N_{j-1}}^{N_j-1} \left(\theta_k - \frac{1}{N_j-1}\mathbb{E}_\pi(h(\theta)\right) \] and summed it up. There are smarter things to do, but it’s close enough for government work. ↩︎
Sometimes, if you squint, this term will kinda, sorta start to look like \(\mathbb{E}_\pi(\pi(\theta)^{-1/2})\), which isn’t usually toooo big. But also, sometimes it looks totally different. Theory is wild.↩︎
If you’ve ever wondered how rbinom(1,p) works, there you are.↩︎
Think of this as the opposite of an adversarial example. We are trying to find the exact chain that is scared to leave the approximate chain behind. Which is either romantic or creepy, depending on finer details.↩︎
Well not me. Florian Völlering did it in his Theorem 1.4. I most certainly could not have done it.↩︎
Well the result does not need this to be a Markov chain!↩︎
it goes up if \(\alpha>\tilde \alpha\) otherwise it goes down↩︎
The 1 case can basically never happen except in the trivial case where both acceptance probabilities are the same. And if we thought that was going to happen we would’ve done something bloody else↩︎
The the relative error being bounded does not stop the absolute error growing!↩︎
Look above and recognize the Geometric distribution↩︎

Reuse

CC BY-NC 4.0

Citation

BibTeX citation:

@online{simpson2022,
  author = {Simpson, Dan},
  title = {MCMC with the Wrong Acceptance Probability},
  date = {2022-11-23},
  url = {https://dansblog.netlify.app/posts/2022-11-23-wrong-mcmc/wrong-mcmc.html},
  langid = {en}
}

For attribution, please cite this work as:

Simpson, Dan. 2022. “MCMC with the Wrong Acceptance Probability.” November 23, 2022. https://dansblog.netlify.app/posts/2022-11-23-wrong-mcmc/wrong-mcmc.html.