Model fitting

Model fitting, or calibration, is the process of adjusting a model’s parameters so that the model’s predictions align as closely as possible with real-world observations.

In simple terms, you run the model with a given set of parameters and compare its simulated predictions with observed data points. Consider the example of an SEIR model with the following differential equations:

\[\frac{dS}{dt} = -\beta SI\]

\[\frac{dE}{dt} = \beta SI - \sigma E\]

\[\frac{dI}{dt} = \sigma E - \gamma I\]

\[\frac{dR}{dt} = \gamma I\]

Below you’ll find an interactive plot. The dots represent observed data points, and the line shows the model’s predicted incidence cases. You can adjust the values of \(\beta\), \(\sigma\) and \(\gamma\) using the controls. Your task is to change these parameters and see how the line shifts. Keep tweaking them until the line closely follows the dots - this means the model’s predictions align well with the observed data.

Methods

Model fitting methods typically include (Rui et al., 2024):

  • Least squares estimation (LSE)
  • Maximum likelihood estimation (MLE):
    • Optimisation: Nelder-Mead
    • Simulated annealing
    • MCMC
    • Newton Raphson
  • Bayesian methods

Least squares estimation

\[RSS = \sum_{i = 1}^{n}[y_i - f(x_i)]^2 \tag{1}\]

Least squares estimation corresponds to maximum likelihood estimation when the noise is normally distributed with equal variances.

Maximum likelihood estimation

A generalization of the least squares idea is the likelihood.

Likelihood

The likelihood is the probability of observing data \(x\) given that our model is true[^2], \(\mathbb{P}(x|M)\).

Log-likelihood

We have a lot of data points, and they are independent. The probability of observing all these data points at the same time is the production of these likelihood \(\mathbb{P}(x_1|M) \times \mathbb{P}(x_2|M) \times \mathbb{P}(x_3|M)... = \prod\mathbb{P}(x|M)\).

Multiplying things together, you will end up losing precision if the numbers are too low. Here you are dealing with probability (a value < 1), multiplying 100 probabilities you will end up with 1e-100.

But remember that \(log(a \times b) = log(a) + log(b)\), and very convenient that \(log(1^{-100} = -230.2585)\).

So \(log(\mathbb{P}(x_1|M) \times \mathbb{P}(x_2|M) \times \mathbb{P}(x_3|M)...)\) \(=\) \(log(\mathbb{P}(x_1|M)) + log(\mathbb{P}(x_2|M)) + log(\mathbb{P}(x_3|M))...\) and it is so much easier to handle.

Negative log-likelihood

Because statistical packages optimizers work by minimizing a function. Minimizing means decrease the distance of two distributions to its lowest, this is fairly easy because we get this when the distance close to 0 (just like the trick we do in hypothesis testing).

Minimizing negative log-likelihood (meaning that \(-1 \times \text{log-likelihood}\)) is equivalent to maximizing the log-likelihood, which is what we want to do (MLE: maximum likelihood estimation).

When MLE is equal to LSE

Assuming the data is generated from a normal distribution with mean \(\mu\) and a standard deviation \(\sigma\).

  • Step 1. Write down the likelihood function \(L(\theta)\).

\[L(\theta) = \prod_{i = 1}^{n} \mathcal{N}(\mu_i, \sigma) = \prod_{i = 1}^{n} \frac{1}{\sigma\sqrt{2 \pi}} \text{exp} \left[ \frac{-1}{2 \sigma^2} (y_i - \mu_i)^2 \right]\]

\[L(\theta) = \frac{1}{(\sigma \sqrt{2 \pi})^n} \text{exp} \left[ \frac{-1}{2 \sigma^2} \sum_{i = 1}^n (y_i - \mu_i)^2 \right]\]

Since \(\sigma\) and \(\sqrt{2 \pi}\) are constant.

\[L(\theta) \propto \text{exp} \left[ - \sum_{i = 1}^n (y_i - \mu_i)^2 \right]\]

  • Step 2. Take the natural log of the likelihood \(log L(\theta)\).

\[log L(\theta) \propto - \sum_{i = 1}^n (y_i - \mu_i)^2\]

  • Step 3. Take the negative log-likelihood \(- log L(\theta)\).

\[- log L(\theta) \propto \sum_{i = 1}^n (y_i - \mu_i)^2\]

This looks exactly like the residual sum of squares at Equation 1.

  • Step 4. The optimizer minimize negative log-likelihood \(- log L(\theta)\). It does the same thing as the LSE finds the optimal parameters by minimizing the RSS.

Nelder-Mead is a numerical algorithm to find the minimum (or maximum) of a multidimentional function. It’s a direct search method: calculate the value of the function and compare it to other values. It does not use derivatives.

It use a simplex for a search. A simplex is a