Un garçon pas comme les autres (Bayes) - Priors: Night work (Track 1)

I have feelings. Too many feelings. And ninety six point seven three percent of them are about prior distributions¹. So I am going to write a few blog posts about prior distributions.

To be very honest, this is mostly a writing exercise² to get me out of a slump.

So let’s do this.

No love, deep web

As far as I am concerned it’s really fucking stupid to try to write about priors on their own. They are meaningless outside of their context. But, you know, this is a blog. So I get to be stupid.

So what is a prior distribution? It is whatever you want it to be. It is a probability distribution³ that … I don’t know. Exists⁴.

Ok. This is not going well. Let’s try again.

A prior distribution is, most of the time, a probability distribution on the parameters of a statistical model. For all practical purposes, we tend to work with its density, so if the parameter \(\theta\), which could be a scalar but, in any interesting case, isn’t, has prior \(p(\theta)\).

Captain fantastic and the brown dirt cowboy

But what does it all meeeeeeeean?

We have a prior distribution specified, gloriously, by it’s density. And unlike destiny, density is meaningless. It only makes sense when we integrate it up to get a probability \[ \Pr(A) = \int_A p(\theta)\,d\theta. \]

So what does the prior probabilty \(\Pr(A)\) of a set \(A\) actually mean in real life? The answer may shock you: it means something between nothing and everything.

Scenario 1: Let’s imagine that we were trying to estimate the probability that someone in some relative homogeneous subgroup of customers completed a purchase on our website. It’s a binary process, so the parameter of interest can probably just be the probability that a sale is made. While we don’t know what the probability of a sale is for the subgroup of interest, we know a lot sales on our website in general (in particular, we know that about 3% of visits result in sales). So if I also believe that it would be wildly unlikely for 20% of visits to result in a sale, I could posit a prior like a \(\text{Beta}(0.4,5)\) prior that captures (a version of) these two pieces of information.

  ## Step 1: 
  
fn <- \(x) (qbeta(0.5,x[1], x[2]) - 0.02)^2 + 
  (qbeta(0.9, x[1], x[2]) - 0.2)^2

best <- optim(c(1/2,1/2), fn)

## Step 3: Profit.
## (AKA round and check)
qbeta(0.9, 0.4, 5)
qbeta(0.5, 0.4, 5)

Scenario 2: Let’s imagine I want to do variable selection. I don’t know why. I was just told I want to do variable selection. So I fire up the Bayesian Lasso ⁵ and then threshold in some way. In this case, the prior encode a hoped-for property of my posterior. (To paraphrase Lana, hope is a dangerous thing for a woman like you to have because the Bayeisan Lasso does not work to the point that the original paper doesn’t even suggest using it for variable selection⁶ it just, idk, liked the name. Statistics is wild.)

Scenario 3: I’m doing a regression with just one variable (because why not) and I think that the relationship between the response \(y\) and the covariate \(x\) is non-linear. That is, I think there is some unknown to me function \(f(x)\) such that \(\mathbb{E}(y_i) = f(x_i)\). So I ask a friend and they tell me to use a Gaussian Process prior for \(f(\cdot)\) with an exponential covariance function.

While I can write down the density for the joint prior of \((f(x_1), f(x_2,), \ldots, f(x_n))\), I do not know⁷ what this prior means in any substantive sense. But I can tell you, you’re gonna need that maths degree to even try.

And should you look deeper, you will find more and more scenarios where priors are doing different things for different reasons⁸. For each of these priors in each of these scenarios, we will be able to compute the posterior (or a reasonable computational approximation to it) and then work with that posterior to answer our questions.

Different people⁹ will use priors different ways even for very similar problems¹⁰. This remains true even though they are nominally working under the same inferential framework.

Bayesians are chaotic.

Mapping out a sky / What you feel like, planning a sky

Sondheim’s ode to pointillism feels relevant here. The reality of the prior distribution—and the whole reason the concept is so slippery and chaotic—is that you are, dot by dot, constructing the world of your inference. This act of construction is fundamental to understanding how Bayesian methods work, how to justify your choices, and how to use a Bayesian workflow to solve complex problems.

To torture the metaphor, our prior distribution is just our paint, unmixed, slowly congealing, possibly made of ground up mummys. It is nothing without a painter and a brush.

The painter is the likelihood or, more generally, the generative link between the parameter values and the actual data, \(p(y \mid \theta)\). The brush is the computational engine you use to actually produce the posterior painting¹¹.

This then speaks to the core challenge with writing about priors: it depends on how you use them. It is a fallacy, or perhaps a foolishness, or perhaps a heresy¹². Hell, when trying to understand a single inference The Prior Can Only Be Understood In The Context Of The Likelihood ¹³. In the context of an entire workflow, The Experiment is just as Important as the Likelihood in Understanding the Prior.

For instance, using independent Cauchy priors for the coefficients in a linear regression model will result in a perfectly ok posterior. Whereas the same priors used in a logistic regression, you may end up with posteriors with such heavy tails that they don’t have a mean! (Do we care? Well, yes. If we want reasonable uncertainty intervals we probably want 2 or so moments otherwise those large deviations are gonna getcha!)

So what?

All of this is fascinating. And it is a lot less chaotic than it initially sounds.

The reality is that while two Bayesians may use different priors and, hence, produce different posteriors for the same data set.This can be extreme. For example, if I am trying to estimate the mean of data generated by \(y_i \sim N(\mu, 1)\), then I can choose a prior¹⁴ (that depends on the data) so that the posterior mean \(\mathbb{E}(\mu \mid y) =1\). Or, to put it differently, I can get any answer I want if I choose an prior carefully (and in a data-dependent manner).

But this isn’t necessarily a problem. This is because the posteriors produced by two sensible priors for the same problem will produce fairly similar results¹⁵. The prior I used to cheat in the previous example would not be considered sensible by anyone looking at it¹⁶.

But what is a sensible prior? Can you tell if a prior is sensible or not in its particular context? Well honey, how long have you got. The thing about starting a (potential) series of blog posts is that I don’t really know how far I’m going to get, but I would really like to talk a lot about that over the next little while.

Footnotes

The rest are about the night I saw Patti LuPone trying to get through the big final scene in War Paint as part of her costume loudly disintegrated.↩︎
I’m told it’s useful to warm up sometimes because this pandemic has me ice cold.↩︎
Sometimes.↩︎
Except, and I cannot stress this enough, when it doesn’t.↩︎
Please do not do this!↩︎
Except for once in the abstract in a sentence that is in no way shape or formed backed up in the text. Park and Casella (2008)↩︎
I do know. I know a very large amount about Gaussian processes. But lord in heaven I have seen the greatest minds of my generation subtly fuck up the interpretation of GP priors. Because it’s increadibly hard. Maybe I’ll blog about it one day. Because this is in markdown so I can haz equations.↩︎
Some reasons are excellent. Some, like the poor Bayesian Lasso, are simply misguided.↩︎
or the same person in different contexts↩︎
Are any two statistical problems ever the same?↩︎
Yes. I have a lot of feelings about this too, but meh. A good artist can make great art with minimal equipment (see When Doves Cry), but most people are not the genius Prince was so just use good tools and stress less!↩︎
I have written extensively about priors in the context of the Arianist heresy because of course I fucking have. Part 1, Part 2, Part 3. Apologies for mathematics eaten by a recent formatting change!↩︎
Editors forced the word often into the published title and, like, who’s going to fight?↩︎
\(N(2-\bar{y},n^{-1})\)↩︎
What does this even mean? Depends on your context really. But a working definition is that the big picture features of the posterior are similar enough that if you were to use it to make a decision, that decision doesn’t change very much.↩︎
But omg subtle high dimensional stuff and I guess I’ll talk about that later maybe too?↩︎

Reuse

CC BY-NC 4.0

Citation

BibTeX citation:

@online{simpson2021,
  author = {Simpson, Dan},
  title = {Priors: {Night} Work {(Track} 1)},
  date = {2021-10-15},
  url = {https://dansblog.netlify.app/priors1},
  langid = {en}
}

For attribution, please cite this work as:

Simpson, Dan. 2021. “Priors: Night Work (Track 1).” October 15, 2021. https://dansblog.netlify.app/priors1.