Metropolis-Hastings

Author

Dr. Alexander Fisher

See libraries used in these notes

library(tidyverse)
library(latex2exp)
library(patchwork)
library(tidymodels)
library(mvtnorm)
library(coda)
library(animation)

Notation

\(\theta\) is some parameter of interest.
\(\pi(\theta)\) represents the target distribution of the parameter.

Question: what does the phrase “explore parameter space” mean?

Metropolis algorithm

Sample \(\theta^* | \theta^{(s)} \sim J(\theta | \theta^{(s)})\)
Compute the acceptance ratio \(r = \frac{\pi(\theta^*)}{\pi(\theta^{(s)})}\)
Let

\[ \theta^{(s+1)} = \begin{cases} \theta^* \text{ with probability } \min(r, 1)\\ \theta^{(s)} \text{ with probability } 1 - \min(r, 1) \end{cases} \]

Example 1

Let \(\pi(\theta) = \text{dnorm}(\theta, 10, 1)\) and let \(J(\theta | \theta^{(s)}) = \text{normal}(\theta^{(s)},\delta^2)\).

We have to choose \(\delta\). How should we choose it? Let’s gain some intuition by trying out three different values of \(\delta\).

set.seed(360)
theta_s = 0 # starting point
THETA = NULL # empty object to save iterations in
S = 10000 # number of iteations
delta = 1 # proposal variance
accept = 0 # keep track of acceptance rate

for (s in 1:S) {
  # log everything for numerical stability #
  
  ### generate proposal and compute ratio r ###
  theta_proposal = rnorm(1, mean = theta_s, sd = delta) 
  log.r = dnorm(theta_proposal, mean = 10, sd = 1, log = TRUE) - 
    dnorm(theta_s, mean = 10, sd = 1, log = TRUE)
  
  ### accept or reject proposal and add to chain ###
  if(log(runif(1)) < log.r)  {
    theta_s = theta_proposal
    accept = accept + 1 
  }
  THETA = c(THETA, theta_s)
}

Let’s look at a trace plot

trace plot
code

df = data.frame(theta = THETA)
df %>%
  ggplot(aes(x = 1:nrow(df), y = theta)) + 
  geom_line() +
  theme_bw() +
  labs(x = "iteration", y = TeX("\\theta"))

Let’s look at how various \(\delta\) let us sample the target:

Example 2

The fledglings of female song sparrows. To begin, let’s load the data.

Load the data

yX = structure(c(3, 1, 1, 2, 0, 0, 6, 3, 4, 2, 1, 6, 2, 3, 3, 4, 7, 
2, 2, 1, 1, 3, 5, 5, 0, 2, 1, 2, 6, 6, 2, 2, 0, 2, 4, 1, 2, 5, 
1, 2, 1, 0, 0, 2, 4, 2, 2, 2, 2, 0, 3, 2, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 
2, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 4, 4, 4, 4, 5, 
5, 5, 5, 3, 3, 3, 3, 3, 3, 3, 6, 1, 1, 9, 9, 1, 1, 1, 1, 1, 1, 
1, 1, 4, 4, 4, 4, 4, 4, 4, 4, 4, 25, 25, 16, 16, 16, 16, 16, 
16, 16, 16, 16, 16, 16, 16, 25, 16, 16, 16, 16, 25, 25, 25, 25, 
9, 9, 9, 9, 9, 9, 9, 36, 1, 1), .Dim = c(52L, 4L), .Dimnames = list(
    NULL, c("fledged", "intercept", "age", "age2")))

yX %>%
  head(n = 5)

     fledged intercept age age2
[1,]       3         1   3    9
[2,]       1         1   3    9
[3,]       1         1   1    1
[4,]       2         1   1    1
[5,]       0         1   1    1

y = yX[,1]
X = yX[,-1]

The model:

\[ \begin{aligned} Y | X &\sim \text{Poisson}(\exp[ \beta^T \boldsymbol{x}])\\ \beta &\sim MVN(0, \sqrt{10}) \end{aligned} \]

The Metropolis algorithm with

\[ J(\beta | \beta^{(s)}) = MVN(\beta^{(s)}, \hat{\sigma}^2(X^TX)^{-1}) \]

where \(\hat{\sigma}^2\) is the sample variance of \(\{\log(y_1 + 1/2), \ldots, \log(y_n + 1/2)\}\).

Note

This variance is intuitively useful choice for \(\delta\) since the posterior variance would be \(\sigma^2 (X^TX)^{-1}\) in a normal regression problem.
We use \(\log(y + 1/2)\) instead of \(\log y\) because if \(y=0\), \(\log y\) would be \(-\infty\).

set.seed(360)
n = length(y)
p = ncol(X)

pmn.beta = rep(0, p) # prior mean beta
psd.beta = rep(10, p) # prior sd beta

var.prop = var(log(y + 1/2)) * solve(t(X) %*% X) # proposal variance

S = 10000
beta = rep(0, p); accepts = 0
BETA = matrix(0, nrow = S, ncol = p)
set.seed(1)

for (s in 1:S) {
  # multivariate proposal of beta
  beta.p = t(rmvnorm(1, beta, var.prop))
  
  # log ratio
  lhr = sum(dpois(y, exp(X %*%beta.p), log = TRUE)) -
    sum(dpois(y, exp(X %*% beta), log = TRUE)) + 
    sum(dnorm(beta.p, pmn.beta, psd.beta, log = TRUE)) -
    sum(dnorm(beta, pmn.beta, psd.beta, log = TRUE)) 
  
  if (log(runif(1)) < lhr) {
    beta = beta.p ; accepts = accepts + 1
  }
  
  BETA[s,] = beta
}

The acceptance ratio is 0.428

Let’s examine convergence.

value = c(BETA[,1], BETA[,2], BETA[,3])
n = length(value)
beta = c(rep("beta1", n/3), rep("beta2", n/3), rep("beta3", n/3))
df = data.frame(value = value,
                beta = beta) 

df %>%
  ggplot(aes(x = 1:nrow(df), y = value)) + 
  geom_line() + 
  facet_wrap(~ beta, scales = "free_x") +
  theme_bw() +
  labs(x = "iteration")

# effective sample size
BETA %>%
  apply(2, effectiveSize)

[1] 867.4750 825.6214 692.0495

par(mfrow=c(1,3))
acf(BETA[,1])
acf(BETA[,2])
acf(BETA[,3])

Metropolis-Hastings

If we have a vector of parameters \(\theta = \theta_1, \ldots \theta_p\) then we can choose a proposal \(J(\theta | \theta^{(s)})\) that updates all \(\theta\) elements simultaneously (as seen above with \(J(\beta | \beta^{(s)})\).

Alternatively, we could update blocks of \(\theta\), e.g. propose an update for the first \(j\) elements of \(\theta\), \(J_1(\theta_1,\ldots \theta_j | \theta_1^{(s)}, \ldots \theta_j^{(s)})\) and then an update for the \(p-j\) remaining elements, \(J(\theta_{j+1}, \ldots, \theta_{p} | \theta_{j+1}^{(s)}, \ldots, \theta_{p}^{(s)})\).

Separately, we could even update each element of \(\theta\) individually, e.g. have an individual, different proposal on each \(\theta_i\), \(i \in \{1, \ldots p\}\).

We might even combine block updates with individual updates.

Question: Where have we seen block updates and individual updates within MCMC before?

The Metropolis-Hastings algorithm is a generalization of both the Metropolis algorithm and the Gibbs sampler.

The Metropolis-Hastings algorithm

Let \(\pi(\theta_1, \theta_2)\) be the target distribution. The Metropolis-Hastings algorithm proceeds:

Update \(\theta_1\):

sample \(\theta_1^* \sim J_1(\theta_1 | \theta_1^{(s)}, \theta_2^{(s)})\);
compute the acceptance ratio

\[ r = \frac{\pi(\theta_1^*, \theta_2^{(s)})}{\pi(\theta_1^{(s)}, \theta_2^{(s)})} \times \frac{J_1(\theta_1^{(s)}| \theta_1^*, \theta_2^{(s)})}{ J_1(\theta_1^{*}| \theta_1^{(s)}, \theta_2^{(s)}) } \]

set \(\theta_1^{(s+1)}\) to \(\theta_1^*\) with probability \(\min(1, r)\), otherwise set \(\theta_1^{(s+1)}\) to \(\theta_1^{(s)}\).

Repeat the above to update \(\theta_2\) given \(\theta_1^{(s+1)}\).

Important

Here, the proposal distribution \(J\) need not be symmetric!