library(tidyr)
library(odin)
library(bbmle)
library(ggplot2)
library(glue)
library(ggsci)
library(RColorBrewer)
7 Maximum likelihood estimation
7.1 Likelihood
\[\mathbb{L}(\theta | x) = \mathbb{P}(x | \theta)\]
7.2 Methods
The maximum likelihood estimation follows 4 steps:
- Step 1. Write down the likelihood function \(L(\theta)\). Likelihood function is a function of \(\theta\), which is the product of the \(n\) mass/density function terms of the observed values \(y_i\).
\[\mathbb{L}(\theta) = \prod_{i = 1}^{n} f_Y(y_i| \theta)\]
- Step 2. Take the natural log of the likelihood \(log L(\theta)\), or log-likelihood. The production in likelihood now becomes the sum in log-likelihood. See Section 2.2 for more details.
\[log \mathbb{L}(\theta) = \sum_{i = 1}^{n} log \left[ f_Y(y_i| \theta) \right]\]
- Step 3. Take the negative log-likelihood \(- log L(\theta)\). Because optimizers work by minimizing a function, we have to work on negative log-likelihood. See Section 2.3 for more details.
\[- log \mathbb{L}(\theta) = - \sum_{i = 1}^{n} log \left[ f_Y(y_i| \theta) \right]\]
- Step 4. The optimizer try different values of \(\theta\) in the paramater space \(\Theta\) for which negative log-likelihood \(- log \mathbb{L}(\theta)\) is minimized.
7.3 Statistical distributions for fitting
Choosing the right distribution depends on your data type.
7.3.1 Number of cases
7.3.1.1 Poisson distribution
\(Pois(\lambda)\) is the probability of observing \(k\) events occurring in a fixed interval of time if these events occur with a known constant mean rate \(\lambda\) and independently of the time since the last event:
\[\mathbb{P}(X = k) = \frac{\lambda^k e^{-\lambda}}{k!}\]
- \(\text{Mean} = \text{Variance} = \lambda\).
\(Pois(\lambda)\) is simple as it only requires one parameter, \(\lambda\). However, a limitation is that the mean equals the variance (\(\text{mean} = \text{variance} = \lambda\)), which may not match your data if it has more variability (overdispersion).
Below is examples of \(Pois(\lambda = 4)\) and \(Pois(\lambda = 10)\).
Code
<- data.frame(k = 0:20,
df_plot prob04 = dpois(0:20, 4),
prob10 = dpois(0:20, 10))
<- pivot_longer(df_plot,
df_plot contains("prob"),
names_to = "rate",
values_to = "prob")
<- data.frame(
ann_line_4 rate = "prob04",
x_mean = 4,
y_mean = 0,
yend_mean = max(dpois(0:20, 4)),
x_var = 2,
xend_var = 6,
y_var = 0
)
<- data.frame(
ann_line_10 rate = "prob10",
x_mean = 10,
y_mean = 0,
yend_mean = max(dpois(0:20, 10)),
x_var = 5,
xend_var = 15,
y_var = 0
)
ggplot(df_plot, aes(x = k, y = prob, color = rate)) +
geom_point() +
geom_line() +
geom_segment(data = ann_line_4,
aes(x = x_mean, y = y_mean, yend = yend_mean),
linetype = "dashed") +
geom_segment(
data = ann_line_4,
aes(x = x_var, xend = xend_var, y = y_var),
arrow = arrow(length = unit(0.03, "npc"), ends = "both")
+
) geom_segment(data = ann_line_10,
aes(x = x_mean, y = y_mean, yend = yend_mean),
linetype = "dashed") +
geom_segment(
data = ann_line_10,
aes(x = x_var, xend = xend_var, y = y_var),
arrow = arrow(length = unit(0.03, "npc"), ends = "both")
+
) facet_grid( ~ rate, labeller = as_labeller(c(prob04 = "lambda == 4", prob10 = "lambda == 10"), default = label_parsed)) +
scale_x_continuous(breaks = seq(0, 20, 2)) +
scale_color_lancet() +
labs(y = "P(X = k)") +
theme_bw() +
theme(legend.position = "none")
7.3.1.2 Binomial distribution
Imagine a sequence of independent Bernoulli trials: each trial has two potential outcomes called “success” (with probability \(p\)) and “failure” (with probability \(1 - p\)). \(B(n, p)\) is the probability of getting exactly \(k\) successes with success probability \(p\) in a sequence of \(n\) independent Bernoulli trials:
\[\mathbb{P}(X = k) = {n \choose k} p^k (1 - p)^{n - k} \tag{7.1}\]
- \(\text{Mean} = \text{Median} = \text{Mode} = np\).
- \(\text{Variance} = np(1 - p)\).
The Poisson distribution is a limiting case of a Binomial distribution when the number of trials \(n\) is large and the success probability \(p\) is small. If \(n \geq 100\) and \(np \leq 10\) (meaning \(p \leq 0.1\)), the Poisson distribution with \(\lambda = np\) is a good approximation of the Binomial distribution (Oxford College of Emory University, n.d.).
Let’s start with \(B(n, p)\): the probability of getting \(k\) successes with success probability \(p\) in a sequence of \(n\) independent Bernoulli trials with Equation 7.1.
\[\mathbb{P}(X = k) = {n \choose k} p^k (1 - p)^{n - k}\]
The expected value, or the mean number of successes is \(np\). Let say this is equal to the mean number of observing events in \(Pois(\lambda)\) which is \(\lambda\):
\[np = \lambda \Leftrightarrow p = \frac{\lambda}{n}\]
Substitute this into Equation 7.1 becomes:
\[\mathbb{P}(X = k) = {n \choose k} \left(\frac{\lambda}{n}\right)^k \left(1 - \frac{\lambda}{n}\right)^{n - k}\]
7.3.1.3 Negative binomial distribution
Imagine a sequence of independent Bernoulli trials: each trial has two potential outcomes called “success” (with probability \(p\)) and “failure” (with probability \(1 - p\)). We keep observing the trials until exactly \(r\) successes occur. \(NB(r, p)\) (or Pascal) distribution is the probability of getting \(k\) failures until \(r\) successes occurs in a sequence of \(n\) independent Bernoulli trials:
Binomial distribution gives the probability of \(k\) successes in a fixed number of \(n\) trials.
Negative binomial distribution gives the probability of \(k\) failures, given that we have \(r\) successes in \(n\) trials.
The Negative binomial has two parameters: \(r\), the number of successes, and \(p\), the probability of success. Its key advantage is that it allows for variance greater than the mean, which makes it suitable for overdispersed data where variability exceeds the average.
7.4 Code
We will use the same data of a H1N1 influenza outbreak in an elementary school in Section 6.2 and fit the deterministic SEIR model with code Listing 2.1.
<- read.csv("data/flu_inc.csv") df
Looking at the original paper, the school has 370 students (Cauchemez et al., 2011). So we set the initial values as below.
<- 369
S0 <- 0
E0 <- 1
I0 <- 0 R0
- Step 1. Write down the likelihood function \(L(\theta)\).
We assume the incidence data is generated from a normal distribution with mean \(\mu_{inc}\) is the predicted incidence and a standard deviation \(\sigma_{inc}\).
\[L(\theta) = \prod_{i = 1}^{n} \mathcal{N}(\mu_{inc}, \sigma_{inc})\]
We use the dnorm()
function to define this likelihood.
dnorm(x = data, mean = pred, sd = sd_inc)
- Step 2. Take the natural log of the likelihood \(log L(\theta)\), product becomes sum.
\[log L(\theta) = \sum_{i = 1}^{n} log \left[ \mathcal{N}(\mu_{inc}, \sigma_{inc}) \right]\]
sum(dnorm(x = data, mean = pred, sd = sd_inc, log = T))
- Step 3. Take the negative log-likelihood \(- log L(\theta)\).
\[- log L(\theta) = - \sum_{i = 1}^{n} log \left[ \mathcal{N}(\mu_{inc}, \sigma_{inc}) \right]\]
- sum(dnorm(x = data, mean = pred, sd = sd_inc, log = T))
- Step 4. Pass the negative log-likelihood \(- log L(\theta)\) to the optimizer.
<- function(beta, sigma, gamma, sd_inc) {
mll # Make sure that parameters are positive
<- exp(beta)
beta <- exp(sigma)
sigma <- exp(gamma)
gamma <- exp(sd_inc)
sd_inc
<- seir_mod(beta = beta, sigma = sigma, gamma = gamma, S0 = S0, E0 = E0, I0 = I0, R0 = R0, times = 0:(length(df$inc) - 1))
pred <- pred$Inc
pred # Return the negative log-likelihood
- sum(dnorm(x = df$inc, mean = pred, sd = sd_inc, log = T))
}
<- list(beta = 0.004, sigma = 0.5, gamma = 0.5, sd_inc = 3)
starting_param_val
<- mle2(minuslogl = mll, start = lapply(starting_param_val, log), method = "Nelder-Mead")
estimates
summary(estimates)
Maximum likelihood estimation
Call:
mle2(minuslogl = mll, start = lapply(starting_param_val, log),
method = "Nelder-Mead")
Coefficients:
Estimate Std. Error z value Pr(z)
beta -5.063479 0.092641 -54.6567 < 2.2e-16 ***
sigma 0.708384 0.134498 5.2669 1.388e-07 ***
gamma 0.600399 0.092968 6.4581 1.060e-10 ***
sd_inc 0.530870 0.099628 5.3285 9.902e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
-2 log L: 198.8602
<- exp(coef(estimates))
params params
beta sigma gamma sd_inc
0.006323524 2.030707475 1.822846603 1.700410622
Compare the model with data.
Code
<- seir_mod(beta = params[1], sigma = params[2], gamma = params[3], S0 = S0, E0 = E0, I0 = I0, R0 = R0, times = df$day)
pred <- pred[,c("t", "Inc")]
df_plot
# Compute 95% confidence interval
<- qnorm(p = 0.025, mean = pred$Inc, sd = params[4])
lwr <- qnorm(p = 0.975, mean = pred$Inc, sd = params[4])
upr
<- brewer.pal(11, "PuOr")[c(10, 1, 4, 3, 8)]
my_palette
ggplot(df_plot, aes(x = t, y = Inc)) +
geom_line(color = my_palette[3], linewidth = 1.2) +
geom_ribbon(aes(ymin = lwr, ymax = upr), fill = my_palette[3], alpha = 0.15) +
geom_point(data = df, aes(x = day, y = inc)) +
coord_cartesian(ylim = c(0, max(upr))) +
labs(x = "Day", y = "Incidence") +
theme_minimal()
It looks very similar to the model we fitted with LSE at Figure 6.1.
Let compute \(R_0\). For a “closed population” SEIR model, \(R_0 = \frac{\beta}{\gamma} S_0\). Again, this is similar to the \(R_0\) estimated from LSE at Listing 6.1. The reason is MLE is equal to LSE when we pick the normal distribution likelihood, see Section 4 for more details.
<- params[1] * S0 / params[3]
rnum0 rnum0
beta
1.280075