How The Posterior Got Its Width - A visual story about uncertainty in Bayesian models.

Before you read

This post assumes you're comfortable with:

Basic mathematical notation[1].
Probability density functions (PDFs). What it means for a continuous random variable to have a density, and why density isn't probability. Fuzzy ? Check out the What does density mean? section.
Bayes' theorem in the inference setting. Specifically, the form $f(\theta | D) \propto f(D|\theta)f(\theta)$ — prior, likelihood, posterior. More here : [2]
Linear regression at the level of $y = \beta_0 + \beta_1x + \epsilon$ .
The Gaussian distribution. That its defined (i.e., parameterized) by a mean ( $\mu$ ) and a variance ( $\sigma^2$ )

What is a probabilistic model doing ?

What we are after is a model of an input-output relationship. A function $g : X \rightarrow Y$ . We begin with the assumption that $Y$ is a random variable.For any given $x$ , $y$ can take one of many values with some probability.

The uncertainty ladder

Rung 1 - A deterministic model (e.g., a standard neural network or ordinary least squares regression (OLS)), maps a single input to single point predictionHere $\theta$ denotes some set of parameters which are part of defining the function $g$ (e.g., the weights of a neural network or the $\beta$ s in a linear regression). — $g_{\theta}: x \rightarrow \hat{y}$ , where $\theta$ is one set of parameters. Take the example of a univariate OLS. We assume that the data generating function is $Y = \beta_0 + \beta_1x + \epsilon$ , a linear function of $x$ plus error This is exactly aleatoric uncertainty : the irreducible variation in $y$ that remains even when $x$ is known exactly. For OLS we assume that $\epsilon \sim \mathcal{N}(0,\sigma^2)$ . All of the randomness is contained in this term. By making this assumption, we are also claiming that $y|x$ is also normally distributed.. We model We are asking what is the mean of Y at each value of $x$ ?. The noise hasn't disappeared, it's still in $Y$ . We assumed $\epsilon \sim \mathcal{N}(0,\sigma^2)$ so a mean-zero noise term drops out under expectation. $\mathbb{E}[Y|X; \theta] = \beta_0 + \beta_1X$ . $Y$ is a random variable — that's the whole premise. Because $Y$ is random, at any fixed $x$ , there is a distribution of possible $y$ values. $\mathbb{E}[Y|X = x]$ collapses that distribution down to a single number, the mean. OLS gives you a function that takes $x$ and gives you a single number out. This is useful, but it's thrown away everything else about the distribution of $y$ at that $x$ .
Rung 2 - A probabilistic model avoids collapsing the uncertainty around $y|x$ to a single number. For a fixed $\theta$ , the model specifies a full distribution over $y$ at each $x$ . What this distribution should look like is a choice we make. We can suppose a normal distribution such that $y | x \sim \mathcal{N}(g_{\theta}(x), \sigma^2)$ — a Gaussian whose mean is given by some function (e.g., linear, neural network or something else) and variance captures the noise around the meanThis is the key difference from rung 1. In rung 1, we only modeled $\mathbb{E}[Y|X]$ — $\sigma^2$ lived in our assumption about the data generation process but wasn't part of what we fit. Here, it's a first class parameter that is estimated jointly with the rest via maximum likelihood : find a single $\theta$ that makes the observed $x_i, y_i$ pairs jointly most probable under the model. [3].. This lets you make statements such as there is 95% chance that $y$ falls in this interval given this $x$ — something Rung 1 could not do. Fit via MLE, you end up with a single best-guess for the parameters , $\hat{\theta}$ . It tells you how uncertain $y$ is given $\theta$ but not how uncertain $\theta$ is given the dataQuantifying parameter uncertainty is certainly possible at this level but requires extra machinery that still only provides an approximation about the uncertainty, among other problems that I don't quite understand yet (^_^) ..
Rung 3 - A Bayesian model takes it a step further and places a distribution over $\theta$ itself. We start with the prior $f(\theta)$ In standard statistical lingo $f$ denotes a probability density function and $F$ denotes a cumulative density function.—a belief about plausible parameter values before seeing any data — and update it to get the posterior — $f(\theta|D)$ . The result now carries two layers of uncertainty: the noise around what the model predicts (aleatoric), and our uncertainty about the parameters itself (epistemic)Uncertainty due to the lack of knowledge. Uncertainty as a function of not seeing enough data or because the model is incomplete.. Unlike rung 2, the model can now express the difference between having seen 10 data points and 10,000, the subject of the rest of this post.

What is $\theta$ ?

Let's continue with our linear regression example. First consider a standard linear regression of the form:

\theta = \{\beta_0, \beta_1\}\\ \mathbb{E}[Y|X] = \beta_0 + \beta_1 \cdot X

For a given $x$ , we get a new $y$ . Now, instead of that suppose parameters the parameters $\theta$ define a distribution over $y$ For this toy example, the variance is modeled linearly, which would allow negative values. In reality, we would prevent this with a transformation (e.g., log scale). Left out here for clarity. :

\theta = \{\beta_0, \beta_1, \beta_2, \beta_3\}\\ \mu = \beta_0 + \beta_1 \cdot x\\ \sigma^2 = \beta_2 + \beta_3 \cdot x\\ y \sim N(\mu, \sigma^2)

$\theta$ defines a normal distribution, and for each $x$ , you get a new distribution. $x$ in , a distribution over $y$ at that $x$ out.

In traditional ML , we would try to minimize some loss to get to a single $\theta$ (i.e., a model). In the probabilistic setting, we are starting with a distribution over $\theta$ . Say we start with the assumption that any value of $\theta$ is equally likely, then we would have the probability mass (note that all of the mass = 1) spread evenly over all possible $\theta$ s within a range [a,b]. The idea is to update this distribution of mass around as informed by the data (i.e., the evidence) we have collected.

We start with a prior distribution ( $f(\theta)$ )This is a tricky object. In this example, $\theta$ consists of 4 parameters. Each parameter can take a value within some range. They don't all have to be between [a,b]. That is difficult to visualize. $\theta$ is a compact way to say the joint distribution over the parameters that define the predictive distribution of y. :

f(\theta) \sim U[a, b]

and update it using our evidence (i.e., the likelihood) to get a posterior distribution:

f(\theta|D) \propto f(D|\theta) \times f(\theta)

here $f$ denotes a pdf and $D$ is the data. $D$ is a collection that spans the input space. It's a set of samples that vary in both the input space, and has some noise in the output space. Let's suppose our data looks like this in the input space.

$(x,y)$ is our evidence. A lot of it comes from the yellow region and very little from the green. Recall that for each $x$ we get a distribution over $y$ out. Let's consider an example.

What's happening here ? For this data point, we evaluate how likely $y=10$ is under each θ. The full likelihood $f(D|\theta)$ multiplies these across all points.

For $x = 3$ , you get a set of distributions out. Which of these could have plausibly generated $y = 10$ ? For the second, yes 10 is comfortably around the peak (high density). For the third, less so, its way out in the tail (low density).

Literally : what is the likelihood of the data point given these parameters ?

Now, let's plug in the visual pieces to the Bayesian update, $f(\theta | D) \propto f(D|\theta)f(\theta)$ :

The likelihood shapes $f(\theta| x= 3)$ . High likelihood $\theta$ s get scaled up, low likelihood $\theta$ s get squashed down. Over many examples, the process looks like this:

How does the distribution of data in the input space play into the training process and what information does the Bayesian model capture about it ?

Recall again that each $x$ generates a distribution $f(y|x,\theta)$ but to do that, we first need to sample from $\theta$ to get parameters. During training $\theta$ was shaped by the likelihood (i.e., data).

Most of the data will come from the yellow region. Over many samples, as the prior-to-posterior loop goes on, samples in the yellow region contribute many likelihood multiplications, constraining which $\theta$ s are plausible for $x$ in that region. When shaping the posterior, for each example we check if a given set of $\theta$ agrees with it (i.e., what's the likelihood of y given this $\theta$ ?). High agreement $\theta$ s survive (weighted up by likelihood), low agreement $\theta$ s don't (weighted down by the likelihood). These $\theta$ s were mostly tested against data in the yellow region. They were asked often: Does the distribution over y built with you give high likelihood to y = 10, when x = 3?.

Note that these $\theta$ s also say something about when $x = 10$ . You pump $x = 10$ through, and get many distributions over $y$ : $f(y | x = 10, \theta)$ . However, the $\theta$ s were rarely tested at $x = 10$ . That is, very few likelihood multiplications ( $f(y | x = 10, \theta)$ ) were applied to squash $\theta$ s that disagreed with $y$ values at $x = 10$ . What we want ideally is a distribution over $\theta$ that does well in both regions of $x$ . However, we don't have much evidence in this region. Many of the $\theta$ s, when evaluated at $x = 10$ , will produce distributions centered at very different values of $y$ because they were never forced to converge during training .

So when we pump a new example through the machinery and get a set of distributions $f(y|x = 10, \theta)$ out, the means of these distributions will not be clustered around some number. Thus when you take the average distribution over all of these $\theta$ s — the posterior predictive distribution $f(y|x,D) = \int f(y|x,\theta)f(\theta|D)d\theta$ — it will have a big spread, reflecting that there was not much training data in this region : epistemic uncertainty.

And so, that is how the posterior (predictive distribution) gets its width. In the regions the likelihood visited often, the posterior over $\theta$ got squeezed, the sampled distributions over y agree, and their average is narrow. In the regions the likelihood barely touched, many $\theta$ s survived the update, each telling a different story about y, and their average distribution fans out. Epistemic uncertainty is the visible residue of how much the data got to push the prior around at each $x$ .

There's more to the story, though. This is the more data would help flavor — uncertainty in the parameters $\theta$ . The other is uncertainty about the model family itself. A linear model fit to sinusoidal data will produce confident, narrow predictive bands in data-rich regions and be confidently wrong. Fixing that kind of uncertainty requires a better model, not a bigger dataset. But that's for another time.

References