7  Maximum likelihood estimation

Preparation

Download influenza incidence data here:

library(tidyr)
library(odin)
library(bbmle)
library(ggplot2)
library(glue)
library(ggsci)
library(RColorBrewer)

7.1 Likelihood

\[\mathbb{L}(\theta | x) = \mathbb{P}(x | \theta)\]

7.2 Methods

The maximum likelihood estimation follows 4 steps:

  • Step 1. Write down the likelihood function \(L(\theta)\). Likelihood function is a function of \(\theta\), which is the product of the \(n\) mass/density function terms of the observed values \(y_i\).

\[\mathbb{L}(\theta) = \prod_{i = 1}^{n} f_Y(y_i| \theta)\]

  • Step 2. Take the natural log of the likelihood \(log L(\theta)\), or log-likelihood. The production in likelihood now becomes the sum in log-likelihood. See Section 2.2 for more details.

\[log \mathbb{L}(\theta) = \sum_{i = 1}^{n} log \left[ f_Y(y_i| \theta) \right]\]

  • Step 3. Take the negative log-likelihood \(- log L(\theta)\). Because optimizers work by minimizing a function, we have to work on negative log-likelihood. See Section 2.3 for more details.

\[- log \mathbb{L}(\theta) = - \sum_{i = 1}^{n} log \left[ f_Y(y_i| \theta) \right]\]

  • Step 4. The optimizer try different values of \(\theta\) in the paramater space \(\Theta\) for which negative log-likelihood \(- log \mathbb{L}(\theta)\) is minimized.

7.3 Statistical distributions for fitting

Choosing the right distribution depends on your data type.

7.3.1 Number of cases

7.3.1.1 Poisson distribution

\(Pois(\lambda)\) is the probability of observing \(k\) events occurring in a fixed interval of time if these events occur with a known constant mean rate \(\lambda\) and independently of the time since the last event:

\[\mathbb{P}(X = k) = \frac{\lambda^k e^{-\lambda}}{k!}\]

  • \(\text{Mean} = \text{Variance} = \lambda\).

\(Pois(\lambda)\) is simple as it only requires one parameter, \(\lambda\). However, a limitation is that the mean equals the variance (\(\text{mean} = \text{variance} = \lambda\)), which may not match your data if it has more variability (overdispersion).

Below is examples of \(Pois(\lambda = 4)\) and \(Pois(\lambda = 10)\).

Code
df_plot <- data.frame(k = 0:20,
                      prob04 = dpois(0:20, 4),
                      prob10 = dpois(0:20, 10))

df_plot <- pivot_longer(df_plot,
                        contains("prob"),
                        names_to = "rate",
                        values_to = "prob")

ann_line_4 <- data.frame(
  rate = "prob04",
  x_mean = 4,
  y_mean = 0,
  yend_mean = max(dpois(0:20, 4)),
  x_var = 2,
  xend_var = 6,
  y_var = 0
)

ann_line_10 <- data.frame(
  rate = "prob10",
  x_mean = 10,
  y_mean = 0,
  yend_mean = max(dpois(0:20, 10)),
  x_var = 5,
  xend_var = 15,
  y_var = 0
)

ggplot(df_plot, aes(x = k, y = prob, color = rate)) +
  geom_point() +
  geom_line() +
  geom_segment(data = ann_line_4,
               aes(x = x_mean, y = y_mean, yend = yend_mean),
               linetype = "dashed") +
  geom_segment(
    data = ann_line_4,
    aes(x = x_var, xend = xend_var, y = y_var),
    arrow = arrow(length = unit(0.03, "npc"), ends = "both")
  ) +
  geom_segment(data = ann_line_10,
               aes(x = x_mean, y = y_mean, yend = yend_mean),
               linetype = "dashed") +
  geom_segment(
    data = ann_line_10,
    aes(x = x_var, xend = xend_var, y = y_var),
    arrow = arrow(length = unit(0.03, "npc"), ends = "both")
  ) +
  facet_grid( ~ rate, labeller = as_labeller(c(prob04 = "lambda == 4", prob10 = "lambda == 10"), default = label_parsed)) +
  scale_x_continuous(breaks = seq(0, 20, 2)) +
  scale_color_lancet() +
  labs(y = "P(X = k)") +
  theme_bw() +
  theme(legend.position = "none")

7.3.1.2 Binomial distribution

Imagine a sequence of independent Bernoulli trials: each trial has two potential outcomes called “success” (with probability \(p\)) and “failure” (with probability \(1 - p\)). \(B(n, p)\) is the probability of getting exactly \(k\) successes with success probability \(p\) in a sequence of \(n\) independent Bernoulli trials:

\[\mathbb{P}(X = k) = {n \choose k} p^k (1 - p)^{n - k} \tag{7.1}\]

  • \(\text{Mean} = \text{Median} = \text{Mode} = np\).
  • \(\text{Variance} = np(1 - p)\).

The Poisson distribution is a limiting case of a Binomial distribution when the number of trials \(n\) is large and the success probability \(p\) is small. If \(n \geq 100\) and \(np \leq 10\) (meaning \(p \leq 0.1\)), the Poisson distribution with \(\lambda = np\) is a good approximation of the Binomial distribution (Oxford College of Emory University, n.d.).

Let’s start with \(B(n, p)\): the probability of getting \(k\) successes with success probability \(p\) in a sequence of \(n\) independent Bernoulli trials with Equation 7.1.

\[\mathbb{P}(X = k) = {n \choose k} p^k (1 - p)^{n - k}\]

The expected value, or the mean number of successes is \(np\). Let say this is equal to the mean number of observing events in \(Pois(\lambda)\) which is \(\lambda\):

\[np = \lambda \Leftrightarrow p = \frac{\lambda}{n}\]

Substitute this into Equation 7.1 becomes:

\[\mathbb{P}(X = k) = {n \choose k} \left(\frac{\lambda}{n}\right)^k \left(1 - \frac{\lambda}{n}\right)^{n - k}\]

7.3.1.3 Negative binomial distribution

Imagine a sequence of independent Bernoulli trials: each trial has two potential outcomes called “success” (with probability \(p\)) and “failure” (with probability \(1 - p\)). We keep observing the trials until exactly \(r\) successes occur. \(NB(r, p)\) (or Pascal) distribution is the probability of getting \(k\) failures until \(r\) successes occurs in a sequence of \(n\) independent Bernoulli trials:

Binomial distribution gives the probability of \(k\) successes in a fixed number of \(n\) trials.

Negative binomial distribution gives the probability of \(k\) failures, given that we have \(r\) successes in \(n\) trials.

The Negative binomial has two parameters: \(r\), the number of successes, and \(p\), the probability of success. Its key advantage is that it allows for variance greater than the mean, which makes it suitable for overdispersed data where variability exceeds the average.

7.4 Code

We will use the same data of a H1N1 influenza outbreak in an elementary school in Section 6.2 and fit the deterministic SEIR model with code Listing 2.1.

df <- read.csv("data/flu_inc.csv")

Looking at the original paper, the school has 370 students (Cauchemez et al., 2011). So we set the initial values as below.

S0 <- 369
E0 <- 0
I0 <- 1
R0 <- 0
  • Step 1. Write down the likelihood function \(L(\theta)\).

We assume the incidence data is generated from a normal distribution with mean \(\mu_{inc}\) is the predicted incidence and a standard deviation \(\sigma_{inc}\).

\[L(\theta) = \prod_{i = 1}^{n} \mathcal{N}(\mu_{inc}, \sigma_{inc})\]

We use the dnorm() function to define this likelihood.

dnorm(x = data, mean = pred, sd = sd_inc)
  • Step 2. Take the natural log of the likelihood \(log L(\theta)\), product becomes sum.

\[log L(\theta) = \sum_{i = 1}^{n} log \left[ \mathcal{N}(\mu_{inc}, \sigma_{inc}) \right]\]

sum(dnorm(x = data, mean = pred, sd = sd_inc, log = T))
  • Step 3. Take the negative log-likelihood \(- log L(\theta)\).

\[- log L(\theta) = - \sum_{i = 1}^{n} log \left[ \mathcal{N}(\mu_{inc}, \sigma_{inc}) \right]\]

- sum(dnorm(x = data, mean = pred, sd = sd_inc, log = T))
  • Step 4. Pass the negative log-likelihood \(- log L(\theta)\) to the optimizer.

Root mean squared error as SD

mll <- function(beta, sigma, gamma, sd_inc) {
  # Make sure that parameters are positive
  beta <- exp(beta) 
  sigma <- exp(sigma)
  gamma <- exp(gamma)
  sd_inc <- exp(sd_inc)
  
  pred <- seir_mod(beta = beta, sigma = sigma, gamma = gamma, S0 = S0, E0 = E0, I0 = I0, R0 = R0, times = 0:(length(df$inc) - 1))
  pred <- pred$Inc
  # Return the negative log-likelihood
  - sum(dnorm(x = df$inc, mean = pred, sd = sd_inc, log = T))
}

starting_param_val <- list(beta = 0.004, sigma = 0.5, gamma = 0.5, sd_inc = 3)

estimates <- mle2(minuslogl = mll, start = lapply(starting_param_val, log), method = "Nelder-Mead")

summary(estimates)
Maximum likelihood estimation

Call:
mle2(minuslogl = mll, start = lapply(starting_param_val, log), 
    method = "Nelder-Mead")

Coefficients:
        Estimate Std. Error  z value     Pr(z)    
beta   -5.063479   0.092641 -54.6567 < 2.2e-16 ***
sigma   0.708384   0.134498   5.2669 1.388e-07 ***
gamma   0.600399   0.092968   6.4581 1.060e-10 ***
sd_inc  0.530870   0.099628   5.3285 9.902e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

-2 log L: 198.8602 
params <- exp(coef(estimates))
params
       beta       sigma       gamma      sd_inc 
0.006323524 2.030707475 1.822846603 1.700410622 

Compare the model with data.

Code
pred <- seir_mod(beta = params[1], sigma = params[2], gamma = params[3], S0 = S0, E0 = E0, I0 = I0, R0 = R0, times = df$day)
df_plot <- pred[,c("t", "Inc")]

# Compute 95% confidence interval
lwr <- qnorm(p = 0.025, mean = pred$Inc, sd = params[4])
upr <- qnorm(p = 0.975, mean = pred$Inc, sd = params[4])

my_palette <- brewer.pal(11, "PuOr")[c(10, 1, 4, 3, 8)]

ggplot(df_plot, aes(x = t, y = Inc)) +
  geom_line(color = my_palette[3], linewidth = 1.2) +
  geom_ribbon(aes(ymin = lwr, ymax = upr), fill = my_palette[3], alpha = 0.15) +
  geom_point(data = df, aes(x = day, y = inc)) +
  coord_cartesian(ylim = c(0, max(upr))) +
  labs(x = "Day", y = "Incidence") +
  theme_minimal()

It looks very similar to the model we fitted with LSE at Figure 6.1.

Let compute \(R_0\). For a “closed population” SEIR model, \(R_0 = \frac{\beta}{\gamma} S_0\). Again, this is similar to the \(R_0\) estimated from LSE at Listing 6.1. The reason is MLE is equal to LSE when we pick the normal distribution likelihood, see Section 4 for more details.

rnum0 <- params[1] * S0 / params[3]
rnum0
    beta 
1.280075