8  Maximum likelihood estimation

8.1 Likelihood

\[\mathbb{L}(\theta | x) = \mathbb{P}(x | \theta)\]

Likelihood function is the probability of observing the data \(x\) assuming that \(\theta\) are the true parameters.

8.2 Methods

The maximum likelihood estimation follows 4 steps:

  • Step 1. Write down the likelihood function \(L(\theta)\). Likelihood function is a function of \(\theta\), which is the product of the \(n\) mass/density function terms of the observed values \(y_i\).

\[\mathbb{L}(\theta) = \prod_{i = 1}^{n} f_Y(y_i| \theta)\]

  • Step 2. Take the natural log of the likelihood \(log L(\theta)\), or log-likelihood. The production in likelihood now becomes the sum in log-likelihood. See Section 1.2.2 for more details.

\[log \mathbb{L}(\theta) = \sum_{i = 1}^{n} log \left[ f_Y(y_i| \theta) \right]\]

  • Step 3. Take the negative log-likelihood \(- log L(\theta)\). Because optimizers work by minimizing a function, we have to work on negative log-likelihood. See Section 1.2.3 for more details.

\[- log \mathbb{L}(\theta) = - \sum_{i = 1}^{n} log \left[ f_Y(y_i| \theta) \right]\]

  • Step 4. The optimizer try different values of \(\theta\) in the paramater space \(\Theta\) for which negative log-likelihood \(- log \mathbb{L}(\theta)\) is minimized.

8.3 Statistical distributions for fitting

Choosing the right distribution depends on the nature of your data:

Count data (number of cases)

  • Poisson: when mean \(\approx\) variance.
  • Negative Binomial: if data are overdispersed (variance > mean).
  • Zero-Inflated Poisson/Negative Binomial: if there are excess zeros.

Non-negative continuous (positive real)

  • Gamma: often used for waiting times or skewed data.
  • Weibull: used in reliability and survival analysis.
  • Log-Normal: when the log of the data is normally distributed.

Binary or proportions (0 or 1, or strictly between)

  • Binomial: if data are 0 or 1, or aggregated counts out of a total.
  • Beta: if values lie strictly in (0, 1).

Unbounded continuous (all real numbers)

  • Gaussian (Normal): commonly assumed for symmetric data.
  • Student’s t: if data have heavier tails or potential outliers.

8.4 Code

Data

Download influenza incidence data here:

We will use the same data of a H1N1 influenza outbreak in an elementary school in Section 7.2 and fit the deterministic SEIR model with code Listing 2.1.

library(tidyr)
library(odin)
library(bbmle)
library(ggplot2)
library(glue)
library(ggsci)
library(RColorBrewer)

df <- read.csv("data/flu_inc.csv")

Looking at the original paper, the school has 370 students (Cauchemez et al., 2011). So we set the initial values as below.

S0 <- 369
E0 <- 0
I0 <- 1
R0 <- 0
  • Step 1. Write down the likelihood function \(L(\theta)\).

We assume the incidence data is generated from a normal distribution with mean \(\mu_{inc}\) is the predicted incidence and a standard deviation \(\sigma_{inc}\).

\[L(\theta) = \prod_{i = 1}^{n} \mathcal{N}(\mu_{inc}, \sigma_{inc})\]

We use the dnorm() function to define this likelihood.

dnorm(x = data, mean = pred, sd = sd_inc)
  • Step 2. Take the natural log of the likelihood \(log L(\theta)\), product becomes sum.

\[log L(\theta) = \sum_{i = 1}^{n} log \left[ \mathcal{N}(\mu_{inc}, \sigma_{inc}) \right]\]

sum(dnorm(x = data, mean = pred, sd = sd_inc, log = T))
  • Step 3. Take the negative log-likelihood \(- log L(\theta)\).

\[- log L(\theta) = - \sum_{i = 1}^{n} log \left[ \mathcal{N}(\mu_{inc}, \sigma_{inc}) \right]\]

- sum(dnorm(x = data, mean = pred, sd = sd_inc, log = T))
  • Step 4. Pass the negative log-likelihood \(- log L(\theta)\) to the optimizer.

Root mean squared error as SD

mll <- function(beta, sigma, gamma, sd_inc) {
  # Make sure that parameters are positive
  beta <- exp(beta) 
  sigma <- exp(sigma)
  gamma <- exp(gamma)
  sd_inc <- exp(sd_inc)
  
  pred <- seir_mod(beta = beta, sigma = sigma, gamma = gamma, S0 = S0, E0 = E0, I0 = I0, R0 = R0, times = 0:(length(df$inc) - 1))
  pred <- pred$Inc
  # Return the negative log-likelihood
  - sum(dnorm(x = df$inc, mean = pred, sd = sd_inc, log = T))
}

starting_param_val <- list(beta = 0.004, sigma = 0.5, gamma = 0.5, sd_inc = 3)

estimates <- mle2(minuslogl = mll, start = lapply(starting_param_val, log), method = "Nelder-Mead")

summary(estimates)
Maximum likelihood estimation

Call:
mle2(minuslogl = mll, start = lapply(starting_param_val, log), 
    method = "Nelder-Mead")

Coefficients:
        Estimate Std. Error  z value     Pr(z)    
beta   -5.063479   0.092641 -54.6567 < 2.2e-16 ***
sigma   0.708384   0.134498   5.2669 1.388e-07 ***
gamma   0.600399   0.092968   6.4581 1.060e-10 ***
sd_inc  0.530870   0.099628   5.3285 9.902e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

-2 log L: 198.8602 
params <- exp(coef(estimates))
params
       beta       sigma       gamma      sd_inc 
0.006323524 2.030707475 1.822846603 1.700410622 

Compare the model with data.

Code
pred <- seir_mod(beta = params[1], sigma = params[2], gamma = params[3], S0 = S0, E0 = E0, I0 = I0, R0 = R0, times = df$day)
df_plot <- pred[,c("t", "Inc")]

# Compute 95% confidence interval
lwr <- qnorm(p = 0.025, mean = pred$Inc, sd = params[4])
upr <- qnorm(p = 0.975, mean = pred$Inc, sd = params[4])

my_palette <- brewer.pal(11, "PuOr")[c(10, 1, 4, 3, 8)]

ggplot(df_plot, aes(x = t, y = Inc)) +
  geom_line(color = my_palette[3], linewidth = 1.2) +
  geom_ribbon(aes(ymin = lwr, ymax = upr), fill = my_palette[3], alpha = 0.15) +
  geom_point(data = df, aes(x = day, y = inc)) +
  coord_cartesian(ylim = c(0, max(upr))) +
  labs(x = "Day", y = "Incidence") +
  theme_minimal()

It looks very similar to the model we fitted with LSE at Figure 7.1.

Let compute \(R_0\). For a “closed population” SEIR model, \(R_0 = \frac{\beta}{\gamma} S_0\). Again, this is similar to the \(R_0\) estimated from LSE at Listing 7.1. The reason is MLE is equal to LSE when we pick the normal distribution likelihood, see Section 1.2.4 for more details.

rnum0 <- params[1] * S0 / params[3]
rnum0
    beta 
1.280075