How The Linear Regression Got Its Bowtie.

A standard simple linear regression plot. Confidence bands add uncertainty quantification. <strong>Wait, but why ? </strong>
Figure 1: A standard simple linear regression plot. Confidence bands add uncertainty quantification. Wait, but why ?

Suppose we want to investigate the relationship between height(yy, cm) and weight(xx, kg) in a population. We have a hypothesis : any given observation of height is generated by the following data generating function (DGP):

y=β0+β1x+ϵϵN(0,15)y = \beta_0 + \beta_1\cdot x + \epsilon \\ \epsilon \sim \mathcal{N}(0,15)

Say we have 5 people that are the exact same weight. Do we expect them to have the exact same height ? with the above definition we say no. The error, ϵ\epsilon, contributes to all the randomness in yy — there will be a distribution around any given xx with a local mean defined as a lineIt is a simple claim on the face of it, but it is actually a remarkably bold assumption. We assume that the relationship between xx and yy is a simple line. Think of all the factors that influence an individuals height at a given weight — genetics, childhood nutrition, developmental disorders — the world is a complex place. However, we just say : doesn't matter, all of that complexity is basically noise compared to the simple linear relationship..

Let's fix the DGP and XX as follows:

n  = 100
sigma = 15
X = np.random.normal(75,10,n)
beta_0 = 1
beta_1 = 2
eps = np.random.normal(0,sigma, n)
Y = beta_0 + beta_1 * X + eps

With an ordinary linear regression (OLS) we model the following:

E[YX]=β0+β1x\mathbb{E}[Y|X] = \beta_0 + \beta_1\cdot x

Notice that E\mathbb{E} is a single number, not a distribution We are asking what is the mean of Y at each value of xx?. The noise hasn't disappeared, it's still in YY. We assumed ϵN(0,σ2)\epsilon \sim \mathcal{N}(0,\sigma^2) so a mean-zero noise term drops out under expectation.. When we plug any given xx into a linear regression, we are asking the question : What is the expected value of the distribution of yy around this xx ?

Suppose we have a fitted linear regression and we want to know yy at x=85x = 85. We can see this from sampling from the DGP.

Randomness in y
Figure 2: Randomness in yy

The blue line is the single number provided by the model. It has collapsed all of the local uncertainty into its prediction of the mean.

Prediction intervals - Two pieces of uncertainty.

Uncertainty around any single line

Usually, we only have one sample of data —(xi,yi)(x_i,y_i) pairs — from the population to work with. With this sample, we fit one best-fit line. Let's ignore boostrapping for now, since we assume we have enough samples and assumptions for linear regression are met.

Error around a single fitted line
Figure 3: Error around a single fitted line

Figure 3 shows the variation of our observed examples around the fitted line. What is the scatter of points around the fitted line ? Recall that (by definition) we say that anything not captured by our model is noise with variation σ\sigma. We can estimate this randomness as : We divide by np1=n2n - p - 1 = n-2, where pp is the number of covariates. It is a generalization of Bessel's correction [1]. The formula is undefinewd for n2n\leq 2. With 2 points, the line passes perfectly through both lines. Both residuals are 0 and 00\frac{0}{0} has no meaning. This is the math saying : you can't estimate noise from a perfect fit. You need at least one point of slack — n3n \geq 3 — for σ^\hat{\sigma} to be defined at all.

σ^=(yy^)2n2\hat{\sigma} = \sqrt{\frac{\sum(y - \hat{y})²}{n-2}}

Even if we knew exactly what the data generating line was, we would still have some randomness around it (that is our assumption of the data generating process) Here we assume that the model is correctly specified. For this example, this is exactly aleatoric uncertainty because we defined the DGP as a linear function with some noise. A nuance on this is discussed later.. Note that σ\sigma is a fixed property of the world (as seen by our model) while σ^\hat{\sigma} is our estimate — it becomes more reliable , converging to σ\sigma, as we get more data (increasing nn). The scatter we see around the model is our model-based definition of aleatoric uncertaintyUncertainty as a function of the inherent randomness in the world. It's irreducible - No amount of additional data eliminates it..

Why can't we just slap ±σ^\pm \hat{\sigma} around a point prediction and call it a day? σ^\hat{\sigma} does capture randomness — our estimate of how much the points scatter around the line — but notice, there's no xx involved. The model assumes the noise is the same size at every xxThis is the homoscedasticity of errors assumptions. That is, the model doesn't systematically perform worse for some parts of the input space. so σ^\hat{\sigma} is a single scalar that summarizes the vertical scatter at every xx. However, notice that our fitted line E[yx]\mathbb{E}[y|x] is built from a finite sample. The estimates of {β0,β1}\{\beta_0, \beta_1\} could have been different with a different sample from the same random process. That uncertainty does depend on xx but is absent in σ^\hat{\sigma}. This uncertainty about the line itself, which depends on which xx you happen to sample, is what's missing.

Uncertainty about the line itself

What about the line itself ? Since we have the full data generating process, let's fit a few separate lines to 6 independent samples of data.

Lines fitted to six samples of data.
Figure 4: Lines fitted to six samples of data.

Figure 4 shows how the variance between samples effects the line. All of these are plausible lines generated from the same DGP.

Now, if we got the prediction for x=85x = 85 from each of these models (fitted over many sample datasets), we get the following distribution of E[yx=85]\mathbb{E}[y|x = 85].

Distribution of \mathbb{E}[y|x = 85] from many different samples of data from the DGP
Figure 5: Distribution of E[yx=85]\mathbb{E}[y|x = 85] from many different samples of data from the DGP

This is the uncertainty we have about the model itself. Specifically, the model parameters because we have fixed the family of the model to be a linear one. It is the standard error of the conditional meanRecall that the linear regression gives E[yx]\mathbb{E}[y|x].. It quantifies how much a prediction for a given input could vary over many possible model fits.

se^line(xnew)=σ^1n+(xnewxˉ)2i=1n(xixˉ)2\widehat{\text{se}}_{\text{line}}(x_{\text{new}}) = \hat{\sigma} \sqrt{\frac{1}{n} + \frac{(x_{\text{new}} - \bar{x})^2}{\sum_{i=1}^{n}(x_i - \bar{x})^2}}

Let's look at the terms inside the square root, first:

(xnewxˉ)2i=1n(xixˉ)2\frac{(x_{\text{new}} - \bar{x})^2}{\sum_{i=1}^{n}(x_i - \bar{x})^2}

This term measures the distance of the new example from the sample mean of xx relative to the total spread of xx. This is essentially a surprise score. If xnewx_{\text{new}} is very far from the mean but the overall spread of xx is wide (large denominator), it's not very surprising. A wider spread of training xx's puts data across more of the input space, which pins down the line better. The closer a given input is to the mean, the more confident you can be about the prediction there.

The term is zero when xnew=xˉx_{\text{new}} = \bar{x} — predicting at the center of your data is the most confident you can be. As xnewx_{\text{new}} moves away from xˉ\bar{x}, the term grows quadratically, but a wide spread of training xx's damps that growth — you've earned the right to be confident further out. This quadratic growth is the mathematical signature of the bowtie shape we see in the bands : narrow at the center of the data, flaring outward as you move toward the edges.

The 1n\frac{1}{n} term keeps the value in the square root strictly positive — even when xnew=xˉx_{\text{new}} = \bar{x} zeros out the surprise term. Suppose the noise is high (large σ^\hat{\sigma}) but we have lots of data. The noise on individual points averages out across many points, and the line gets pinned down even through a noisy cloud. We can't make the world less noisy, but we can buy a more confident estimate of the line by collecting more data.

This can be seen in the following animation. For the first few data points, the fitted lineAlways important to keep in mind that the fitted line gives E[yx]\mathbb{E}[y|x] — the mean of the distribution of yy at each xx, not a single deterministic prediction. jumps around a lot — a few points do very little to pin down the line. As new points arrive from around in the input space, the line settles towards true line.This isn't se^line\hat{se}_{line} shrinking per se. That would be best illustrated with many lines fitted with mm samples of size nn, with increasing nn. It would show the lines cluster together as nn increases.

As more points come in, we get a better picture of where the line should be.
Figure 6: As more points come in, we get a better picture of where the line should be.

So the line uncertainty has two separable knobs: σ^\hat{\sigma} — our estimate of the noise in the world — and 1n+surprise\frac{1}{n} + \text{surprise} — the information factor, which we control through sample size and where we choose to sample. Drive the information factor toward zero with enough well-spread data, and the line uncertainty vanishes. What remains is just σ\sigma — the irreducible scatter of new points around a now-well-known line.

With se^line\hat{\text{se}}_{\text{line}}, we can calculate the confidence band for a specific xx within which we can expect a line to fall:

E^[yxnew]±tse^line(xnew)\widehat{\mathbb{E}}[y|x_{new}] \pm t\cdot \hat{\text{se}}_{\text{line}}(x_{new})

where tt is how many standard deviations wide do I want the interval to be? (e.g., 1.98We have a finite sample estimate of σ\sigma which in itself is noisy. A different sample would have given us a different estimate. Here, what we want are the bounds that capture some proportion (e.g., 95%) of the values E[yxnew]\mathbb{E}[y|x_{new}] can take. For that, we need a distribution. We use a tt-distribution here because it has fatter tails, reflecting that we should be more uncertain as compared to using a normal distribution. to capture 95% of the probability mass). And now, for a range of test inputs, we can create a band around the entire line which shrinks and grows depending on the value of xx.

This is our model-based definition of epistemic uncertaintyUncertainty due to the lack of knowledge. We could conceivable improve our model with more data (keywords being our model, more below).

Bringing the parts together

What we are after is the prediction interval — a band that tells us with some confidence where a new, unseen point will land. We need a way to combine the effects of both σ^\hat{\sigma} and se^line\hat{se}_{\text{line}} into a standard error of prediction, se^pred\hat{se}_{\text{pred}}.

Recall that we have two sources of uncertainty about the true yy of a new example: a) where the true line might actually be at xnewx_{\text{new}}, and b) how a new observation will scatter around that line. Both contribute to the error of a new prediction, and both are measured in standard deviation units.

We can't just add them directly. Adding standard deviations would assume the two errors always reinforce each other — but they don't. They're independent: sometimes they push in the same direction (the line happens to be too high and the new point happens to land high, doubling up), and sometimes they push in opposite directions (the line is too high but the new point lands low, partially cancelling). Over many predictions, the average magnitude of their combined effect is less than the sum of their individual spreads. Picture the prediction error ynewy^(xnew)y_{\text{new}} - \hat{y}(x_{\text{new}}) as a quantity being pushed around by two independent random forces — one from the line being off, one from the new point's noise. The Pythagorean combination accounts for the mix of reinforcement and cancellation honestly.

To combine them properly, we work in variance units (where the cancellation/reinforcement bookkeeping works out cleanly), then translate back to a standard deviation:

se^pred=(σ^2+se^line2)\widehat{se}_{\text{pred}} = \sqrt{(\hat{\sigma}^2 + \hat{se}_{\text{line}}^2)}

and finally, to establish the bounds for a new observation, we multiply by a desired confidence level to get the final prediction interval:

E^[yx]±tse^pred\widehat{\mathbb{E}}[y|x] \pm t\cdot \hat{se}_{\text{pred}}

The confidence band narrows and flares as a function of x. As does the prediction band however the \hat{\sigma} term dominates the sum and is not noticeably tapered.
Figure 7: The confidence band narrows and flares as a function of xx. As does the prediction band however the σ^\hat{\sigma} term dominates the sum and is not noticeably tapered.

A tale of two uncertainties ?

Why have I constantly reminded you that all of this is model-based ? Let's look at all the calculations in one place.

σ^=(yy^)2n2se^line(xnew)=σ^1n+(xnewxˉ)2i=1n(xixˉ)2se^pred=σ^2+se^line2Interval=y^±tse^pred\begin{aligned} \hat{\sigma} &= \sqrt{\frac{\sum(y - \hat{y})^2}{n-2}} \\ \widehat{\text{se}}_{\text{line}}(x_{\text{new}}) &= \hat{\sigma} \sqrt{\frac{1}{n} + \frac{(x_{\text{new}} - \bar{x})^2}{\sum_{i=1}^{n}(x_i - \bar{x})^2}} \\ \widehat{se}_{\text{pred}} &= \sqrt{\hat{\sigma}^2 + \hat{se}_{\text{line}}^2} \\ \text{Interval} &= \hat{y} \pm t \cdot \hat{se}_{\text{pred}} \end{aligned}

Notice how σ^\hat{\sigma} shows up everywhere and it is calculated as the error between the model and the true yy. We labelled this as aleatoric uncertainty — irreducible uncertainty about the world — but consider the following hypothetical DGP and fitted linear model.

y=β0+β1(x60)2+ϵϵN(0,15)y = \beta_0 + \beta_1(x - 60)^2 + \epsilon \\ \epsilon \sim \mathcal{N}(0,15)
n  = 100
sigma = 15
X = np.random.normal(75, 10, n)
beta_0 = 5
beta_1 = 1 
eps = np.random.normal(0, sigma, n)

Y = beta_0 + beta_1 * (X - 60)**2 + eps

\hat{\sigma} is estimated as model error against the observed data.
Figure 8: σ^\hat{\sigma} is estimated as model error against the observed data.

We are using the model to estimate σ^\hat{\sigma} and if the model is badly misspecified, the world looks much noisier than it really is. This is because, according to the model, the world is linear and any deviation from this hypothetical line is irreducible noise. However, clearly, if we had a better model (i.e., more knowledge), we could reduce this "irreducible" uncertainty.

So, the split between epistemic and aleatoric uncertainty is not as clean as it sounds. One might say it's all epistemic uncertainty and all we lack is a better model and more data. We could be missing a crucial covariate (e.g., gender) in our modelFixes structural ignorance which reduces that apparent aleatoric uncertainty. or perhaps we could do a wider samplePins down the parameters better which reduces epistemic uncertainty once we've made a commitment to a model.. I find this to be a more optimistic position than resigning parts of the world to inscrutable randomness. It says that we can do better, we just have to keep going. That, though is philosophical debate better tackled elsewhere.

Thanks for reading.

References

  1. Why divide by n -1 ?. https://www.youtube.com/watch?v=_sdNwY0KKUA