Consider the task where someone has to predict the outcome of a coin flip. What do we think are the odds that they will get it right? Phrased differently, this is the probability of a success (right answer) each time they have to make a choice. Let’s call this belief \(\theta\) : the probability of a success. The important thing here is that we don’t know the long run frequency of successes which can take any value between 0 and 1.

Let’s suppose that we have no good reason to believe that any value for \(\theta\) (between 0 and 1) is more likely than any other : we have no strong prior beliefs on what this. We haven’t done experiments, we don’t have any personal intuition, nothing.

Suppose further that we have a dataset where someone played this game 100 times. The dataset looks as follows. A list of 100 where 65 of the times the person correctly called what side the coin will land on.

The question is How should we update our prior beleif based on this evidence ? To do so, we need a way of formalizing a couple of things:

Prior belief

Let’s start by defining \(\theta\) as a distribution of probabilities. Now, we have decided that we have no good intuition about what the value of \(\theta\) might be. This is equivalent to saying that any value of \(\theta\) between [0,1] (remember, \(\theta\) is a probability so it can only be between [0,1])) is equally likely. The Beta distribution, having the property that covers the range [0,1], gives us a nice way to define our prior expecation in a “tunable” way, by varying its parameters \(a\) and \(b\). Note that the Beta distribution is defined by its probability density function (PDF) and its cumulative distribution function (CDF). The PDF is defined as :

\[ \begin{align} \text{Beta}(\theta; a, b) &= \left [\frac{1}{B(a, b)}\right]\theta^{a - 1} (1 - \theta)^{b - 1} &\quad{\text{(1)}} \end{align} \] where \(B(a,b)\) is the Beta function, defined as :

\[ B(a, b) = \int_{0}^{1} t^{a - 1} (1 - t)^{b - 1} \, dt \] The parameters of the Beta function can be interpreted as (number of successes, number of failures). The higher these numbers are, the more certain you are of where the distribution is centered. For example, having seen 10 successes and 10 failures (\(Beta(10,10)\)), you may be more confident that the true value for \(\theta\) is centered around 0.5 than having seen 1 success and 1 failure (\(Beta(1,1)\)).

What does density mean ?

Figure 1: Probability density function (PDF) of Beta(1,1) Distribution.

# Create a data frame with x values and corresponding Beta(1,1) density values
data <- data.frame(x = seq(0, 1, length.out = 100))
# Get values from the distribution, here shape1 = a, shape2 = b.
data$y <- dbeta(data$x, shape1 = 1 , shape2 = 1)

# Plot the Beta(1,1) distribution using ggplot2
ggplot(data, aes(x = x, y = y)) +
  geom_line(color = "skyblue", linewidth = 1) +
  labs(x = expression(theta), y = "Density") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5),
        panel.grid = element_blank())  # Remove grid lines

Our prior is a (Beta) distribution of probabilities and the density (y-axis Figure 1) shows where the mass of this function is most concentrated. A way to think about density is as follows.

Imagine you’re a police officer trying to find a criminal, you don’t want to search the entire city, you want to search where you’re most likely to find the criminal. If you have no information at all to where they might be, then you might consider that any location in the city is equally likely : the probability of finding the true value is uniformly distributed (Figure 1) in its domain. Conversely, if you know that there is a shady bar where criminals routinely hangout, then you may decide that you are more likely to find your target in the vicinity of the bar because there are more criminals per unit area as you get close to it.

Bringing it back to our problem, the true value of \(\theta\) is the criminal we’re looking for. Since \(\theta\) is a real number, between any two values (say 0.8 and 0.9), there are infinitely many real numbers (0.88, 0.888, etc.). That is, there infinitely many values that \(\theta\) can take, thus it is impossible for us to know its exact value but the PDF tells us how \(\theta\) is concentrated at each point between [0,1].

\[ x \rightarrow PDF \rightarrow \text{a real number that is }\textbf{density }\text{around } x \] The PDF takes a value (\(x\)) and returns a real number which is the density of the distribution at the value \(x\). Indeed, if you look at the code, you will see that the plots of the PDFs (Figures 1 and 2)are generated by taking a sequence of values (data) and getting the corresponding output from the Beta PDF of a specific shape. For the Beta(1,1) case, the density around all of the values of \(\theta\) that we tried is the same. Since the number of criminals per unit length (a specific value for theta \(\theta\) ) is the same every where, you are equally likely to bump into the criminal anywhere betwen [0,1].

Alternatively, if we had some intuition that most of the mass for \(\theta\) is centered around 0.5 (the bar in our analogy), we could define a prior as Beta(10,10) :

Figure 2 : Probability density function (PDF) of Beta(10,10) Distribution.

# Create a data frame with x values and corresponding Beta(1,1) density values
data <- data.frame(x = seq(0, 1, length.out = 100))
data$y <- dbeta(data$x, shape1 = 10 , shape2 = 10)

# Plot the Beta(1,1) distribution using ggplot2
ggplot(data, aes(x = x, y = y)) +
  geom_line(color = "skyblue", linewidth = 1) +
  labs(x = expression(theta), y = "Density") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5),
        panel.grid = element_blank())  # Remove grid lines

What we see is that the density per unit length increases as we get close to \(\theta = 0.5\) : Our prior belief is that the true success rate is more likely to be found around 0.5.

Anyways, we will stick to the case of a very uninformative prior (i.e., a uniform distribution).

To update our prior beliefs means to shift our prior (distribution) around in someway, incorporating new data. How do we do this?

Indeed, Bayes’ rule gives us the way to do so.

\[ P(\theta|\text{data}) = \frac{P(data|\theta)\cdot P(\theta)}{P(\text{data})} \]

\[ \begin{align} \text{Posterior} &= \frac{\text{Likelihood}\cdot\text{Prior}}{\text{Marginal Likelihood}}\\ \end{align} \]

If we ignore the denominator for a moment, what this says is that we want to scale our prior belief (about \(\theta\)) in proportion to how likely the observed data is given our prior belief. This should make sense intuitively. The whole point is to update our belief about what \(\theta\) might be. If the data we see is pretty likely given our belief as it is, there is no need to do much updating. If we have no specific prior belief what so ever (\(P(\theta) = 1\)), then we should update our belief directly by the likelihood.

\[ P(\theta|\text{data}) \propto P(\text{data}|\theta)\cdot P(\theta) \]

Substituting our prior with the PDF of the Beta distribution:

\[ \begin{align} \text{Posterior} = \frac{\text{Likelihood}\cdot \frac{1}{B(1,1)}\theta^{1-1}(1-\theta)^{1-1}}{\text{Marginal Likelihood}} \end{align} \] Here, \(B\) is the Beta function (which is not the same as the Beta distribution) which can be evaluated using :

\[ \begin{align} B(a,b) &= \int_0^1 \theta^{a-1}(1-\theta)^{b-1} \, d\theta\\ B(1,1) &= \int_0^1 \theta^{1-1}(1-\theta)^{1-1} \, d\theta \\ &= \int_0^1 \theta^{0}(1-\theta)^{0} \, d\theta \\ &= \int_0^1 1\, d\theta \\ &= \left[ \theta \right]_0^1 = 1 - 0 = 1 \end{align} \] Indeed, since \(\theta^0\cdot (1-\theta)^0 =1\) for \(a =1, b =1\), the prior evaluates to 1.

We just need to figure out the rest of the components of this equation and it will update our prior distribution to give us our posterior distribution : our belief of where the values of \(\theta\) are concentrated after incorporating our new data.

Likelihood

Suppose we set \(\theta\) = 0.4. We can ask, how likely is the result we have obtained in our data given \(\theta\) = 0.4? (i.e., 65 successes out of 100). Note some specifics about our task:

There are exactly 2 discrete outcomes at each attempt : success or failure.
The probability (\(\theta\)) remains fixed between attempts : that is, one out come doesn’t influence the probability of the next one.
There is a fixed number of attempts (n).

The outcomes of a \(n\) success-failiure attempts (also known as Bernoulli trails) with some probability of success on each attempt (\(\theta\)) can be modelled by the binomial distribution. The probability of getting \(k\) successes out of \(n\) attempts is given by:

\[ \binom{n}{k}\cdot\theta^k\cdot(1 - \theta)^{n - k} \] Here \(\binom{n}{k}\) represents the number of ways \(k\) successes can be observed. In our example, the player could have gotten the first five wrong, then a 10 right, then two wrong, etc. This value, also called the binomial coefficient, gives you the number of possible ways these 65/100 correct could have been generated. \(\theta^k\) is the probability of success and \((1-p)^{n-k}\) is the probability of failure [source].

So, if we have \(\theta = 0.4\), just for illustration, to get the likelihood of seeing this outcome, we have:

\[ \binom{100}{65}\cdot(0.4)^{65}\cdot(1 - 0.4)^{100 - 65} = 2.562323e-07 \] Indeed, this is what the dbinom function will give you:

# where x = 65 is the number of successes
# n = 100 is the total number of attempts
# prob = 0.4 is the probability of success in each attempt
dbinom(65,100,0.4)

## [1] 2.562323e-07

Unhelpfully, the function is called dbinom to follow the conventions of R when it should rightly be called, perhaps, mbinom since this is a probability mass function (PMF).

What does mass mean ?

We need the PDF for continuous random variables because the probability of the variable taking any one value is essentially zero. What is the probability of there being exactly 100 ml of water in a can if you have can infinite precision in your measurement ? 0. For a discrete random variable however, takes on distinct, separate values. For a discrete variable that can take values 1 or 0, there is no in-between.

\[ x \rightarrow PMF \rightarrow \text{a real number that is }\textbf{probability of } x \]

For a continuous random variable, it only makes sense to ask “What is the probability of \(\theta\) taking a value between \(\theta_a\) and \(\theta_b\)?”. Since the PDF essentially tells you probability density per unit length at any given value, we can integrate the PDF betwen \(a\) and \(b\) to get the answer. For a discrete random variable, it makes sense to ask “What is the probability of \(X\) taking value \(a\) ?”. The PMF answers this question directly where as with a continous variable, we integrate the density over a range to get the mass (in our case, the probability) within that range.

Figure 3 : Binomial Distribution PMF (n =100,𝜃= 0.4).

# Define parameters
n <- 100   # Number of trials
theta <- 0.4   # Probability of success

# Create a data frame for the binomial distribution
df <- data.frame(k = 0:n, 
                 pmf = dbinom(0:n, size = n, prob = theta))

# Plot the PMF using ggplot
ggplot(df, aes(x = k, y = pmf)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  theme_minimal() +
  labs(x = "Number of Successes (k)",
       y = "Probability") +
  theme(plot.title = element_text(hjust = 0.5),
        panel.grid = element_blank())

Note the difference between the PMF plot (Figure 3) and the PDF plots (Figures 2 and 3). In the PMF, The horizontal axis is the number of discrete sucesses where as in the PDF, its the continuous value between 0 and 1.

As we can see, the likelihood of seeing the given results if \(\theta = 0.4\) is exceedingly small. If we don’t specify \(\theta\), then we have a function parameterized by \(\theta\) defined by \(n\) and \(k\):

\[ P(\theta,65,100) =\binom{100}{65}\cdot(\theta)^{65}\cdot(1 - \theta)^{100 - 65} \] Updating our Bayes formula with this result:

\[ \begin{align} \text{Posterior} &=\frac{\text{Likelihood}\cdot Beta_{\theta}(1,1)}{\text{Marginal Likelihood}} \\ &=\frac{ \binom{100}{65}\cdot(\theta)^{65}\cdot(1 - \theta)^{100 - 65}\cdot \frac{1}{B(1,1)}\theta^{1-1}(1-\theta)^{1-1}}{\text{Marginal Likelihood}} \end{align} \]

Marginal likelihood

The marginal likelihood is simply the integral of the numerator which serves to insure that once evaluted, the posterior remains a valid probability distribution that integrates to 1. Wait, but why ? Why do we need this ?

To reiterate the likelihood is a probability mass function. For any value \(\theta\), it tells you the probability of seeing the data. The prior is a probability density function. For any value \(\theta\), it tells you the density per unit length at that value. To ensure that the resulting posterior distribution (specifically, the posterior PDF of the Beta distribution) integrates to 1 (a requirement for any probability distribution), we normalize the proportional result.

\[ \begin{align} P(\theta|\text{data}) &\propto P(\text{data}|\theta)\cdot P(\theta)\\ &\propto \binom{100}{65}\cdot(\theta)^{65}\cdot(1 - \theta)^{100 - 65}\cdot \frac{1}{B(1,1)}\theta^{1-1}(1-\theta)^{1-1} \\ &\propto \left[ \binom{100}{65}\cdot\frac{1}{B(1,1)} \right]\theta^{65}\cdot(1-\theta)^{100-65}\cdot(1-\theta)^{1-1} \end{align} \] by dividing with the integral over the entire length (the marginal likelihood)

\[ \begin{align} P(\text{data}) &= \int_0^1 P(\text{data} | \theta) P(\theta) \, d\theta \\ &= \int_0^1 \binom{100}{65} \cdot \theta^{65} (1 - \theta)^{35} \cdot \frac{1}{B(1, 1)} \cdot \theta^{1 - 1} (1 - \theta)^{1 - 1} \, d\theta \\ &= \binom{100}{65} \cdot \frac{1}{B(1, 1)} \int_0^1 \theta^{65} (1 - \theta)^{35} \, d\theta\\ &=\binom{100}{65}\cdot\frac{B(66,36)}{B(1,1)} \end{align} \] Plugging this result back into the Bayes’ rule formula:

\[ \begin{align} \text{Posterior} &=\frac{\left[ \binom{100}{65}\cdot\frac{1}{B(1,1)} \right]\theta^{65}\cdot(1-\theta)^{100-65}\cdot(1-\theta)^{1-1} }{\binom{100}{65}\cdot\frac{B(66,36)}{B(1,1)}}\\ &= \frac{\theta^{65}\cdot(1-\theta)^{100-65}\cdot(1-\theta)^{1-1}}{B(1,1)}\cdot\frac{B(1,1)}{B(66,36)}\\ &= \frac{\theta^{65}\cdot(1-\theta)^{100-65}\cdot(1-\theta)^{1-1}}{B(66,36)}\\ &= \frac{1}{B(66,36)}\cdot\theta^{65}\cdot(1-\theta)^{100-65}\\ &= \text{Beta}(\theta; 66, 36) \end{align} \] Et voilà!, we have arrived at a new probability distribution for our beleif of where the true value of \(\theta\) might be. As we did above, we can now visualize our posterior distribution by sending some values into the PDF and plotting the results.

Figure 4 : Probability density function (PDF) of Beta(1,1) and Beta(66,36) Distributions.

# Create a data frame with x values and corresponding Beta(1,1) and Beta(66,36) density values
data <- data.frame(x = seq(0, 1, length.out = 100))

# Add Beta(1,1) and Beta(66,36) density values to the data frame
data$beta_1_1 <- dbeta(data$x, shape1 = 1, shape2 = 1)
data$beta_66_36 <- dbeta(data$x, shape1 = 66, shape2 = 36)

# Convert data to long format for ggplot legend
library(tidyr)
data_long <- pivot_longer(data, cols = c(beta_1_1, beta_66_36), 
                          names_to = "distribution", values_to = "density")

# Plot the Beta(1,1) and Beta(66,36) distributions using ggplot2
ggplot(data_long, aes(x = x, y = density, color = distribution, linetype = distribution)) +
  geom_line(size = 1) +
  scale_color_manual(values = c("skyblue", "steelblue"), 
                     labels = c("Beta(1,1)", "Beta(66,36)")) +  # Custom legend labels
  scale_linetype_manual(values = c("solid", "dashed"), 
                        labels = c("Beta(1,1)", "Beta(66,36)")) + # Custom line types with labels
  labs(x = expression(theta), y = "Density",
       color = "Distribution", linetype = "Distribution") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5)) +  # Center the title and make it bold
  theme(panel.grid = element_blank()) +  # Remove grid lines
  guides(color = guide_legend(keywidth = 2, keyheight = 1),    # Set equal key widths in legend
         linetype = guide_legend(keywidth = 2, keyheight = 1)) # Adjust linetype legend key size

Finally, we can get summary results of our posterior as follows:

What is the posterior expected value of \(\theta\) ?

\[ \begin{align} \mathbb{E}[\theta|\text{data}] &= \frac{a}{a + b}\\ \mathbb{E}[\theta] &= \frac{66}{66 + 36} = \frac{66}{102} \approx 0.647 \end{align} \] 2. What is the varience of \(\theta\) under the posterior ?

\[ \begin{align} \text{Var}[\theta|\text{data}] &= \frac{\alpha \beta}{(\alpha + \beta)^2 (\alpha + \beta + 1)} \\ \text{Var}[\theta] &= \frac{66 \times 36}{(66 + 36)^2 (66 + 36 + 1)} = \frac{66 \times 36}{102^2 \times 103} \approx 0.0022 \end{align} \] 3. What is the credible interval for \(\theta\) ?

This question asks : between which two values does 95% of the mass of the posterior distribution fall? That is, we want the area under the curve which accounts for 95% of the total area under the curve. This corresponds to:

\[ \text{95% of mass}= [\text{area below 97.5}^{th} \text{ quantile}] - [\text{area below 2.5}^{th} \text{ quantile}] \]

Figure 5 : 95% Credible Interval for Beta(66,36).

# Define the shape parameters of the Beta distribution
alpha <- 66
beta <- 36

# Generate the quantiles (2.5th and 97.5th percentiles)
q_lower <- qbeta(0.025, alpha, beta) # the value below which 5% / 2 = 2.5% of the mass is
q_upper <- qbeta(0.975, alpha, beta) # the value below which 199% - (5% / 2) = 97.5% of the mass is

# Create a sequence of values for theta (the x-axis) between 0 and 1
theta <- seq(0, 1, length.out = 1000)

# Calculate the PDF of the Beta distribution for each value of theta
pdf_values <- dbeta(theta, alpha, beta)

# Create a data frame to hold the theta and PDF values
data <- data.frame(theta, pdf_values)

# Plot the Beta distribution and shade the 95% credible interval
ggplot(data, aes(x = theta, y = pdf_values)) +
  geom_line(color = "skyblue", size = 1) +  # Plot the PDF line
  geom_area(data = subset(data, theta >= q_lower & theta <= q_upper), 
            aes(x = theta, y = pdf_values), fill = "skyblue") +  # Shade the 95% credible interval
  geom_vline(xintercept = q_lower, linetype = "dashed", color = "red", size = 1) +  # Lower bound
  geom_vline(xintercept = q_upper, linetype = "dashed", color = "red", size = 1) +  # Upper bound
  labs(x = expression(theta),
       y = "Density") +
  annotate("text", x = q_lower, y = 5, label = round(q_lower, 3), color = "red", vjust = -1.5) +
  annotate("text", x = q_upper, y = 5, label = round(q_upper, 3), color = "red", vjust = -1.5) +
  theme(plot.title = element_text(hjust = 0.5),panel.grid = element_blank(),panel.background = element_rect(fill = "white", color = NA))

Doing everything in R with Stan

Though we were able to step through that entire process analytically, it is usually very difficult or impossible to do so in reality, especially if we want to estimate multiple parameters. Thus, we need a way to approximate the posterior distribution. Specically, what makes it difficult is taking the integral in this denominator:

\[ \begin{align} \text{Posterior} &=\frac{\left[ \binom{100}{65}\cdot\frac{1}{B(1,1)} \right]\theta^{65}\cdot(1-\theta)^{100-65}\cdot(1-\theta)^{1-1} }{\int_0^1P(\text{data}|\theta)P(\theta)d\theta} \end{align} \] It’s straight forwad in this case because its only 1-dimensional (\(\theta\)) but can quickly become difficult or impossible to solve analytically as the number of dimensions increases. Computational methods (like the methods used in Stan), allow us to do approximate the answer to this problem in an efficient way. Without the denominator, we only have the proportional relationship:

\[ P(\theta|\text{data}) \propto P(\text{data}|\theta)\cdot P(\theta) \] Instead of trying to calculate this integral, we can go around it entirely by directly sampling from the un-normalized posterior.

Start with our posterior distribution.
Get the un-normalized posterior by multiplying with the likelihood.
Sample from the posterior over many paths in the space to approximate the characteristics of the posterior.

How exactly Stan does it using an estimating technique called Hamiltonian Monte Carlo (HMC). This would require an entire document in itself to explain (refer 1 and 2 for details). To put briefly, Stan explores the space of \(\theta\) in the posterior probability density landscape (the log if the posterior) to collect representative samples.

We can visualize this as follows. The landscape we’re exploring is Figure 5(B). As you can see, there is a nice valley around the point of highest density (Note that this is for illustration only because the plot on the left is the proper normalized posterior). Essentially, once we have the un-normalized posterior, each iteration of HMC starts at random points on the landscape and explores it for a defined number of steps, collecting samples of \(\theta\) at the end of each trajectory. How the next step in each iteration is taken is a function of how the landscape is shaped at the current point (potential energy, momentum). Note that each trajectory is a Markov chain: the next state depends only on the current state. Samples from the valleys will be generated more often (regions of highest density) and we give it a random kick at every new iteration to also visit areas that are further away so that we can collect representative samples. Once we have these samples, we can directly ask the same questions we answered above (You can find all the beautiful mathy specifics here).

Figure 6 : The posterior (A) and (B) the posterior landscape (negative log-posterior).

# Define the parameters for the posterior (Beta distribution)
alpha <- 66  # 65 successes + 1 for prior
beta <- 36   # 35 failures + 1 for prior

# Create a sequence of theta values (x-axis)
theta_vals <- seq(0, 1, length.out = 1000)

# Calculate the posterior density values (Beta PDF)
posterior_density <- dbeta(theta_vals, shape1 = alpha, shape2 = beta)

# Calculate the "inverted landscape" (negative log-posterior)
inverted_landscape <- -log(posterior_density)

# Create a data frame for both plots
data <- data.frame(theta = theta_vals, 
                   posterior_density = posterior_density, 
                   inverted_landscape = inverted_landscape)

# Plot 1: Posterior density (Beta(66, 36) distribution)
p1 <- ggplot(data, aes(x = theta, y = posterior_density)) +
  geom_line(color = "skyblue", size = 1) +
  labs(title = "(A)", 
       x = expression(theta), y = "Density") +
  theme_minimal() +
  theme(
    plot.background = element_rect(fill = "white", color = NA),  # White background
    panel.background = element_rect(fill = "white", color = NA),
    plot.title = element_text(hjust = 0.5),
    panel.grid = element_blank(),  # No grid lines
    axis.line = element_line(color = "gray")  # Add axis lines
  )

# Plot 2: Inverted landscape (-log(posterior density))
p2 <- ggplot(data, aes(x = theta, y = inverted_landscape)) +
  geom_line(color = "skyblue", size = 1) +
  labs(title = "(B)", 
       x = expression(theta), y = "Inverted Energy") +
  theme_minimal() +
  theme(
    plot.background = element_rect(fill = "white", color = NA),  # White background
    panel.background = element_rect(fill = "white", color = NA),
    plot.title = element_text(hjust = 0.5),
    panel.grid = element_blank(),  # No grid lines
    axis.line = element_line(color = "gray")  # Add axis lines
  )

# Combine both plots in a 1-row, 2-column layout
grid.arrange(p1, p2, ncol = 2)

First, we define the model.

stan_model_code <- "
data {
  int<lower=1> N; //number of trials (100)
  int<lower=0> y; // number of sucesses (65 correct guesses)
}

parameters {
  real<lower=0, upper=1> theta; // the probability of guessing correctly
}

model {
 theta ~ beta(1,1); // Beta(1,1) prior for theta
 y ~ binomial(N, theta); // Binomial likelihood
}
"

Then we fit it with our data.

Finally , we can answer the same questions as above.

Figure 7 : Posterior Density of θ

posterior_samples <- rstan::extract(fit)
theta_samples <- posterior_samples$theta

# Convert samples to data frame for ggplot
theta_df <- data.frame(theta = theta_samples)

# Create the density plot
ggplot(theta_df, aes(x = theta)) +
  geom_density(fill = "skyblue", alpha = 0.5) +
  labs(x = "θ", y = "Density")+
   xlim(0, 1)+
  theme(plot.title = element_text(hjust = 0.5),panel.grid = element_blank(),panel.background = element_rect(fill = "white", color = NA))

What is the posterior expected value of \(\theta\) ?

mean(theta_samples)

## [1] 0.6467528

What is the varience of \(\theta\) under the posterior ?

var(theta_samples)

## [1] 0.002292304

1. What is the credible interval for \(\theta\) ?

 quantile(theta_samples, probs = c(0.025, 0.975))

##      2.5%     97.5% 
## 0.5518046 0.7369175

Prior predictive checks

We decided at the outset to have a uniform prior : we had no particular belief about the density of \(\theta\). Is this a reasonable prior to have ? To check that, we can generate some data from the prior distribution. That is, we generate samples of \(\theta\) from the prior distribution and visually inspect how they are distributed.

Figure 8 : Prior Predictive Distribution. Red line indicates our measured data

set.seed(42)  # for reproducibility
n_simulations <- 1000  # number of prior predictive simulations
N <- 100  # number of attempts (same as our actual data)

# Step 1: Draw 1000 theta values from the prior Beta(1,1)
theta_prior_samples <- rbeta(n_simulations, shape1 = 1, shape2 = 1)

# Step 2: For each drawn theta value, simulate successes (y) of N attempts from the binomial distribution where
# theta_prior_samples is a list. That is, for each theta proposed by the prior distribution, generate an actual result.
# This generates a list of 1000 where each element is how many guesses were correct out of N attempts in that simulation.
y_prior_predictive <- rbinom(n_simulations, size = N, prob = theta_prior_samples)

# Step 3: Plot the distribution of simulated successes
library(ggplot2)
df_prior_predictive <- data.frame(successes = y_prior_predictive)

ggplot(df_prior_predictive, aes(x = successes)) +
  geom_histogram(fill = "skyblue", bins = 30, color = "black", alpha = 0.7) +
  geom_vline(xintercept = 65, color = "red", linetype = "dashed", size = 1.2) +
  labs(x = "Number of Successes in 100 Trials",
       y = "Frequency") +
  theme_minimal() +
  theme(panel.grid = element_blank())

Indeed, the results generated are all over the place, as expected. Can we do better ? There are a variety of different priors we can use, each capturing different levels of belief that we have about where \(\theta\) might lie before we’ve incorporated the data that we do have, broadly these are informative with degrees of how strong they are (2-3 below) or uninformative (1):

Flat/uninformative priors (as above)
Regularizing priors which set limits on where the values can lie but otherwise are informative. For example, we might say that values of \(\theta\) below 0.5 is impossible because you can’t do worse than chance (which is ofcourse not the case for any one simulation, there is some varation around 0.5 but its 0.5 in the long run).
Principled priors which incorporate pretty much everything you believe about what the values should be, using for example your past understanding that no one can is clairvoyant and thus values should be around 0.5.

For example, if we used the \(Beta(10,10)\) distribution as our prior :

Figure 9 : Prior Predictive Distribution with Beta(10,10). Red line indicates our measured data

set.seed(42)  # for reproducibility
n_simulations <- 1000  # number of prior predictive simulations
N <- 100  # number of attempts (same as our actual data)

# Step 1: Draw 1000 theta values from the prior Beta(10,10)
theta_prior_samples <- rbeta(n_simulations, shape1 = 10, shape2 = 10)

# Step 2: For each drawn theta value, simulate successes (y) of N attempts from the binomial distribution where
# theta_prior_samples is a list. That is, for each theta proposed by the prior distribution, generate an actual result.
# This generates a list of 1000 where each element is how many guesses were correct out of N attempts in that simulation.
y_prior_predictive <- rbinom(n_simulations, size = N, prob = theta_prior_samples)

# Step 3: Plot the distribution of simulated successes
library(ggplot2)
df_prior_predictive <- data.frame(successes = y_prior_predictive)

ggplot(df_prior_predictive, aes(x = successes)) +
  geom_histogram(fill = "skyblue", bins = 30, color = "black", alpha = 0.7) +
  geom_vline(xintercept = 65, color = "red", linetype = "dashed", size = 1.2) +
  labs(x = "Number of Successes in 100 Trials",
       y = "Frequency") +
  theme_minimal() +
  theme(panel.grid = element_blank())

Indeed, the data (values of \(\theta\)) generated by this distribution is now centered around 0.5, which might be more reasonable.

The influence of priors

Let’s now investigate our choice of priors influences our posterior distribution (a.k.a senitivity analysis). First, we define the two models, uniform and informed.

stan_model_uniform_prior <- "
data {
  int<lower=1> N; //number of trials (100)
  int<lower=0> y; // number of sucesses (65 correct guesses)
}

parameters {
  real<lower=0, upper=1> theta; // the probability of guessing correctly
}

model {
 theta ~ beta(1,1); // Beta(1,1) prior for theta
 y ~ binomial(N, theta); // Binomial likelihood
}
"

stan_model_informed_prior <- "
data {
  int<lower=1> N; //number of trials (100)
  int<lower=0> y; // number of sucesses (65 correct guesses)
}

parameters {
  real<lower=0, upper=1> theta; // the probability of guessing correctly
}

model {
 theta ~ beta(1,1); // Beta(10,10) prior for theta
 y ~ binomial(N, theta); // Binomial likelihood
}
"

Then we fit both it with our data.

Let’s take a look at how the two posterior distributions look

Figure 10 : Posterior Density of θ under uninformed (Beta(1,1)) and informed (Beta(10,10)) priors.

# Extract posterior samples from both models
posterior_samples_uninformed <- rstan::extract(fit_uniform)
posterior_samples_informed <- rstan::extract(fit_informed)

# Convert samples to data frame for ggplot
theta_uninformed <- posterior_samples_uninformed$theta
theta_informed <- posterior_samples_informed$theta

# Create a combined data frame with a column to indicate which model the samples came from
theta_df <- data.frame(
  theta = c(theta_uninformed, theta_informed),
  model = rep(c("Uninformed", "Informed"), each = length(theta_uninformed))
)

# Create the density plot
ggplot(theta_df, aes(x = theta, fill = model)) +
  geom_density(alpha = 0.5) + # Adjust transparency to differentiate the plots
  labs(x = expression(theta), y = "Density") +
  xlim(0, 1) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5),
    panel.grid = element_blank(),
    panel.background = element_rect(fill = "white", color = NA)
  ) +
  scale_fill_manual(values = c("skyblue", "steelblue"))

Interestingly, the two posteriors are nearly identical though it need not always be the case. It’s left to the reader to experiment with priors that have a noticeable impact on the posterior.

Posterior predictive checks

As we did with the prior, we can now draw data from our posterior distribution. We can then check if distribution the predicted values generated by our model makes sense given the data that we have. From the plot, we can see that given the red line, our posterior predictive distribution seems reasonable.

Figure 10 : Data from the posterior predictive distribution of θ under uninformed (Beta(1,1)) priors.

# 1. Extract posterior samples
posterior_samples <- rstan::extract(fit)
theta_samples <- posterior_samples$theta

# 2. Generate posterior predictive data
N <- 100  # Number of trials in your actual data
n_samples <- length(theta_samples)  # Number of posterior samples = 4000
# draw 4000 values of number of successes from 4000 distributions defined by each theta value extracted from the posterior
posterior_predictive <- rbinom(n_samples, size = N, prob = theta_samples)

# 3. Define the observed number of successes
observed_successes <- 65  

# 4. Plot posterior predictive distribution with a red line for observed data
posterior_predictive_df <- data.frame(successes = posterior_predictive)

ggplot(posterior_predictive_df, aes(x = successes)) +
  geom_histogram(fill = "skyblue", bins = 30, color = "black", alpha = 0.7) +
  geom_vline(xintercept = observed_successes, color = "red", linetype = "dashed", size = 1.2) +  # Add red dashed line
  labs(x = "Number of Successes", y = "Frequency") +
  theme_minimal() +
  theme(panel.grid = element_blank())

Updating Prior Beliefs

An elaboration of the Bayesian update methods in R

Udesh Habaraduwa

2024-12-26