Model fitting
Model fitting, or calibration, is the process of adjusting a model’s parameters so that the model’s predictions align as closely as possible with real-world observations.
In simple terms, you run the model with a given set of parameters and compare its simulated predictions with observed data points. Consider the example of an SEIR model with the following differential equations:
\[\frac{dS}{dt} = -\beta SI\]
\[\frac{dE}{dt} = \beta SI - \sigma E\]
\[\frac{dI}{dt} = \sigma E - \gamma I\]
\[\frac{dR}{dt} = \gamma I\]
Below you’ll find an interactive plot. The dots represent observed data points, and the line shows the model’s predicted incidence cases. Drag the sliders for \(\beta\), \(\sigma\) and \(\gamma\) and watch how the curve responds, aiming to pull it through the dots. The readout tells you how well the current parameters score - smaller is better on both the residual sum of squares (RSS) and the negative log-likelihood (\(-\log L\)). Click Optimize to jump to the values a computer picks.
Methods
Model fitting methods typically include (Rui et al., 2024):
- Least squares estimation (LSE)
- Maximum likelihood estimation (MLE):
- Optimisation: Nelder-Mead
- Simulated annealing
- MCMC
- Newton Raphson
- Bayesian methods
Least squares estimation
For each day, the residual is the gap between the data point and the model’s prediction. Least squares penalises squared residuals, so a single large miss is much worse than many small ones. The blue squares below have side length equal to the residual — their total area is the RSS. Slide the parameters until the squares shrink.
\[RSS = \sum_{i = 1}^{n}[y_i - f(x_i)]^2 \tag{1}\]
Least squares estimation corresponds to maximum likelihood estimation when the noise is normally distributed with equal variances.
Maximum likelihood estimation
Instead of minimising squared distance, maximum likelihood asks: if our model were correct, how probable is the data we actually observed? For count data like incidence, a natural choice is the Poisson distribution — each day’s prediction \(\hat{y}_i\) plays the role of the Poisson mean \(\lambda_i\), and the probability of observing exactly \(y_i\) cases is
\[P(Y_i = y_i \mid \lambda_i) = \frac{\lambda_i^{y_i}\, e^{-\lambda_i}}{y_i!}\]
The purple lollipops below show each day’s Poisson PMF, with one stick per integer count. The solid stick sits at the observed \(y_i\) — that height is its likelihood. Summing the log-likelihoods across days (and flipping sign) gives the negative log-likelihood, which optimisers minimise.
A generalization of the least squares idea is the likelihood.
Likelihood
The likelihood is the probability of observing data \(x\) given that our model is true[^2], \(\mathbb{P}(x|M)\).
Log-likelihood
We have a lot of data points, and they are independent. The probability of observing all these data points at the same time is the production of these likelihood \(\mathbb{P}(x_1|M) \times \mathbb{P}(x_2|M) \times \mathbb{P}(x_3|M)... = \prod\mathbb{P}(x|M)\).
Multiplying things together, you will end up losing precision if the numbers are too low. Here you are dealing with probability (a value < 1), multiplying 100 probabilities you will end up with 1e-100.
But remember that \(log(a \times b) = log(a) + log(b)\), and very convenient that \(log(1^{-100} = -230.2585)\).
So \(log(\mathbb{P}(x_1|M) \times \mathbb{P}(x_2|M) \times \mathbb{P}(x_3|M)...)\) \(=\) \(log(\mathbb{P}(x_1|M)) + log(\mathbb{P}(x_2|M)) + log(\mathbb{P}(x_3|M))...\) and it is so much easier to handle.
Negative log-likelihood
Because statistical packages optimizers work by minimizing a function. Minimizing means decrease the distance of two distributions to its lowest, this is fairly easy because we get this when the distance close to 0 (just like the trick we do in hypothesis testing).
Minimizing negative log-likelihood (meaning that \(-1 \times \text{log-likelihood}\)) is equivalent to maximizing the log-likelihood, which is what we want to do (MLE: maximum likelihood estimation).
When MLE is equal to LSE
Assuming the data is generated from a normal distribution with mean \(\mu\) and a standard deviation \(\sigma\).
- Step 1. Write down the likelihood function \(L(\theta)\).
\[L(\theta) = \prod_{i = 1}^{n} \mathcal{N}(\mu_i, \sigma) = \prod_{i = 1}^{n} \frac{1}{\sigma\sqrt{2 \pi}} \text{exp} \left[ \frac{-1}{2 \sigma^2} (y_i - \mu_i)^2 \right]\]
\[L(\theta) = \frac{1}{(\sigma \sqrt{2 \pi})^n} \text{exp} \left[ \frac{-1}{2 \sigma^2} \sum_{i = 1}^n (y_i - \mu_i)^2 \right]\]
Since \(\sigma\) and \(\sqrt{2 \pi}\) are constant.
\[L(\theta) \propto \text{exp} \left[ - \sum_{i = 1}^n (y_i - \mu_i)^2 \right]\]
- Step 2. Take the natural log of the likelihood \(log L(\theta)\).
\[log L(\theta) \propto - \sum_{i = 1}^n (y_i - \mu_i)^2\]
- Step 3. Take the negative log-likelihood \(- log L(\theta)\).
\[- log L(\theta) \propto \sum_{i = 1}^n (y_i - \mu_i)^2\]
This looks exactly like the residual sum of squares at Equation 1.
- Step 4. The optimizer minimize negative log-likelihood \(- log L(\theta)\). It does the same thing as the LSE finds the optimal parameters by minimizing the RSS.
Nelder-Mead is a numerical algorithm to find the minimum (or maximum) of a multidimentional function. It’s a direct search method: calculate the value of the function and compare it to other values. It does not use derivatives.
It use a simplex for a search. A simplex is a
Uncertainty
Uncertainties in the parameter estimates are described by confidence intervals (Bolker, 2008). A confidence interval \([\sigma_i^-, \sigma_i^+]\) of a parameter estimate \(\hat{\theta_i}\) to a confidence level \(1 - \alpha\) signifies that the true value \(\theta_i^*\) is located within this interval with probability \(1 - \alpha\) (Bolker, 2008).