How The Posterior Got Its Width - A visual story about uncertainty in Bayesian models.

Before you read

This post assumes you're comfortable with:

  1. Basic mathematical notation[1].
  2. Probability density functions (PDFs). What it means for a continuous random variable to have a density, and why density isn't probability. Fuzzy ? Check out the What does density mean? section.
  3. Bayes' theorem in the inference setting. Specifically, the form f(θD)f(Dθ)f(θ)f(\theta | D) \propto f(D|\theta)f(\theta) — prior, likelihood, posterior. More here : [2]
  4. Linear regression at the level of y=β0+β1x+ϵy = \beta_0 + \beta_1x + \epsilon .
  5. The Gaussian distribution. That its defined (i.e., parameterized) by a mean (μ\mu) and a variance ( σ2\sigma^2 )

What is a probabilistic model doing ?

What we are after is a model of an input-output relationship. A function g:XYg : X \rightarrow Y. We begin with the assumption that YY is a random variable.For any given xx, yy can take one of many values with some probability.

The uncertainty ladder

  1. Rung 1 - A deterministic model (e.g., a standard neural network or ordinary least squares regression (OLS)), maps a single input to single point predictionHere θ\theta denotes some set of parameters which are part of defining the function gg (e.g., the weights of a neural network or the β\betas in a linear regression).gθ:xy^g_{\theta}: x \rightarrow \hat{y} , where θ\theta is one set of parameters. Take the example of a univariate OLS. We assume that the data generating function is Y=β0+β1x+ϵY = \beta_0 + \beta_1x + \epsilon , a linear function of xx plus error This is exactly aleatoric uncertainty : the irreducible variation in yy that remains even when xx is known exactly. For OLS we assume that ϵN(0,σ2)\epsilon \sim \mathcal{N}(0,\sigma^2). All of the randomness is contained in this term. By making this assumption, we are also claiming that yxy|x is also normally distributed.. We model We are asking what is the mean of Y at each value of xx?. The noise hasn't disappeared, it's still in YY. We assumed ϵN(0,σ2)\epsilon \sim \mathcal{N}(0,\sigma^2) so a mean-zero noise term drops out under expectation. E[YX;θ]=β0+β1X\mathbb{E}[Y|X; \theta] = \beta_0 + \beta_1X . YY is a random variable — that's the whole premise. Because YY is random, at any fixed xx, there is a distribution of possible yy values. E[YX=x]\mathbb{E}[Y|X = x] collapses that distribution down to a single number, the mean. OLS gives you a function that takes xx and gives you a single number out. This is useful, but it's thrown away everything else about the distribution of yy at that xx.
  2. Rung 2 - A probabilistic model avoids collapsing the uncertainty around yxy|x to a single number. For a fixed θ\theta, the model specifies a full distribution over yy at each xx. What this distribution should look like is a choice we make. We can suppose a normal distribution such that yxN(gθ(x),σ2)y | x \sim \mathcal{N}(g_{\theta}(x), \sigma^2) — a Gaussian whose mean is given by some function (e.g., linear, neural network or something else) and variance captures the noise around the meanThis is the key difference from rung 1. In rung 1, we only modeled E[YX]\mathbb{E}[Y|X]σ2\sigma^2 lived in our assumption about the data generation process but wasn't part of what we fit. Here, it's a first class parameter that is estimated jointly with the rest via maximum likelihood : find a single θ\theta that makes the observed xi,yix_i, y_i pairs jointly most probable under the model. [3].. This lets you make statements such as there is 95% chance that yy falls in this interval given this xx — something Rung 1 could not do. Fit via MLE, you end up with a single best-guess for the parameters , θ^\hat{\theta}. It tells you how uncertain yy is given θ\theta but not how uncertain θ\theta is given the dataQuantifying parameter uncertainty is certainly possible at this level but requires extra machinery that still only provides an approximation about the uncertainty, among other problems that I don't quite understand yet (^_^) ..
  3. Rung 3 - A Bayesian model takes it a step further and places a distribution over θ\theta itself. We start with the prior f(θ)f(\theta) In standard statistical lingo ff denotes a probability density function and FF denotes a cumulative density function.—a belief about plausible parameter values before seeing any data — and update it to get the posterior — f(θD)f(\theta|D). The result now carries two layers of uncertainty: the noise around what the model predicts (aleatoric), and our uncertainty about the parameters itself (epistemic)Uncertainty due to the lack of knowledge. Uncertainty as a function of not seeing enough data or because the model is incomplete.. Unlike rung 2, the model can now express the difference between having seen 10 data points and 10,000, the subject of the rest of this post.

What is θ\theta ?

Let's continue with our linear regression example. First consider a standard linear regression of the form:

θ={β0,β1}E[YX]=β0+β1X\theta = \{\beta_0, \beta_1\}\\ \mathbb{E}[Y|X] = \beta_0 + \beta_1 \cdot X

For a given xx, we get a new yy. Now, instead of that suppose parameters the parameters θ\theta define a distribution over yy For this toy example, the variance is modeled linearly, which would allow negative values. In reality, we would prevent this with a transformation (e.g., log scale). Left out here for clarity. :

θ={β0,β1,β2,β3}μ=β0+β1xσ2=β2+β3xyN(μ,σ2)\theta = \{\beta_0, \beta_1, \beta_2, \beta_3\}\\ \mu = \beta_0 + \beta_1 \cdot x\\ \sigma^2 = \beta_2 + \beta_3 \cdot x\\ y \sim N(\mu, \sigma^2)

θ\theta defines a normal distribution, and for each xx, you get a new distribution. xx in , a distribution over yy at that xx out.

In traditional ML , we would try to minimize some loss to get to a single θ\theta (i.e., a model). In the probabilistic setting, we are starting with a distribution over θ\theta. Say we start with the assumption that any value of θ\theta is equally likely, then we would have the probability mass (note that all of the mass = 1) spread evenly over all possible θ\thetas within a range [a,b]. The idea is to update this distribution of mass around as informed by the data (i.e., the evidence) we have collected.

We start with a prior distribution (f(θ)f(\theta))This is a tricky object. In this example, θ\theta consists of 4 parameters. Each parameter can take a value within some range. They don't all have to be between [a,b]. That is difficult to visualize. θ\theta is a compact way to say the joint distribution over the parameters that define the predictive distribution of y. :

f(θ)U[a,b]f(\theta) \sim U[a, b]

and update it using our evidence (i.e., the likelihood) to get a posterior distribution:

f(θD)f(Dθ)×f(θ)f(\theta|D) \propto f(D|\theta) \times f(\theta)

here ff denotes a pdf and DD is the data. DD is a collection that spans the input space. It's a set of samples that vary in both the input space, and has some noise in the output space. Let's suppose our data looks like this in the input space.

(x,y)(x,y) is our evidence. A lot of it comes from the yellow region and very little from the green. Recall that for each xx we get a distribution over yy out. Let's consider an example.

What's happening here ? For this data point, we evaluate how likely y=10y=10 is under each θ. The full likelihood f(Dθ)f(D|\theta) multiplies these across all points.

For x=3x = 3, you get a set of distributions out. Which of these could have plausibly generated y=10y = 10? For the second, yes 10 is comfortably around the peak (high density). For the third, less so, its way out in the tail (low density).

Literally : what is the likelihood of the data point given these parameters ?

Now, let's plug in the visual pieces to the Bayesian update, f(θD)f(Dθ)f(θ)f(\theta | D) \propto f(D|\theta)f(\theta):

The likelihood shapes f(θx=3)f(\theta| x= 3). High likelihood θ\thetas get scaled up, low likelihood θ\thetas get squashed down. Over many examples, the process looks like this:

How does the distribution of data in the input space play into the training process and what information does the Bayesian model capture about it ?

Recall again that each xx generates a distribution f(yx,θ)f(y|x,\theta) but to do that, we first need to sample from θ\theta to get parameters. During training θ\theta was shaped by the likelihood (i.e., data).

Most of the data will come from the yellow region. Over many samples, as the prior-to-posterior loop goes on, samples in the yellow region contribute many likelihood multiplications, constraining which θ\thetas are plausible for xx in that region. When shaping the posterior, for each example we check if a given set of θ\theta agrees with it (i.e., what's the likelihood of y given this θ\theta?). High agreement θ\thetas survive (weighted up by likelihood), low agreement θ\thetas don't (weighted down by the likelihood). These θ\thetas were mostly tested against data in the yellow region. They were asked often: Does the distribution over y built with you give high likelihood to y = 10, when x = 3?.

Note that these θ\thetas also say something about when x=10x = 10. You pump x=10x = 10 through, and get many distributions over yy : f(yx=10,θ)f(y | x = 10, \theta). However, the θ\thetas were rarely tested at x=10x = 10. That is, very few likelihood multiplications (f(yx=10,θ)f(y | x = 10, \theta)) were applied to squash θ\thetas that disagreed with yy values at x=10x = 10. What we want ideally is a distribution over θ\theta that does well in both regions of xx. However, we don't have much evidence in this region. Many of the θ\thetas, when evaluated at x=10x = 10, will produce distributions centered at very different values of yy because they were never forced to converge during training .

So when we pump a new example through the machinery and get a set of distributions f(yx=10,θ)f(y|x = 10, \theta) out, the means of these distributions will not be clustered around some number. Thus when you take the average distribution over all of these θ\thetas — the posterior predictive distribution f(yx,D)=f(yx,θ)f(θD)dθf(y|x,D) = \int f(y|x,\theta)f(\theta|D)d\theta — it will have a big spread, reflecting that there was not much training data in this region : epistemic uncertainty.

And so, that is how the posterior (predictive distribution) gets its width. In the regions the likelihood visited often, the posterior over θ\theta got squeezed, the sampled distributions over y agree, and their average is narrow. In the regions the likelihood barely touched, many θ\thetas survived the update, each telling a different story about y, and their average distribution fans out. Epistemic uncertainty is the visible residue of how much the data got to push the prior around at each xx.

There's more to the story, though. This is the more data would help flavor — uncertainty in the parameters θ\theta. The other is uncertainty about the model family itself. A linear model fit to sinusoidal data will produce confident, narrow predictive bands in data-rich regions and be confidently wrong. Fixing that kind of uncertainty requires a better model, not a bigger dataset. But that's for another time.

References

  1. Glossary of mathematical notation. https://en.wikipedia.org/wiki/Glossary_of_mathematical_symbols
  2. Updating Priors. https://udesh.io/Updating_priors.html
  3. Maximum likelihood estimation. https://www.probabilitycourse.com/chapter8/8_2_3_max_likelihood_estimation.php