Calibration
Model fitting methods typically include (Rui et al., 2024):
- Least squares estimation (LSE).
- Maximum likelihood estimation (MLE).
- Root mean square estimation (RMSE).
- Akaike information criterion (AIC).
- Bayesian information criterion (BIC).
Least squares estimation
\[RSS = \sum_{i = 1}^{n}[y_i - f(x_i)]^2 \tag{1}\]
Least squares estimation corresponds to maximum likelihood estimation when the noise is normally distributed with equal variances.
Maximum likelihood estimation
A generalization of the least squares idea is the likelihood.
Likelihood
The likelihood is the probability of observing data \(x\) given that our model is true[^2], \(\mathbb{P}(x|M)\).
Log-likelihood
We have a lot of data points, and they are independent. The probability of observing all these data points at the same time is the production of these likelihood \(\mathbb{P}(x_1|M) \times \mathbb{P}(x_2|M) \times \mathbb{P}(x_3|M)... = \prod\mathbb{P}(x|M)\).
Multiplying things together, you will end up losing precision if the numbers are too low. Here you are dealing with probability (a value < 1), multiplying 100 probabilities you will end up with 1e-100.
But remember that \(log(a \times b) = log(a) + log(b)\), and very convenient that \(log(1^{-100} = -230.2585)\).
So \(log(\mathbb{P}(x_1|M) \times \mathbb{P}(x_2|M) \times \mathbb{P}(x_3|M)...)\) \(=\) \(log(\mathbb{P}(x_1|M)) + log(\mathbb{P}(x_2|M)) + log(\mathbb{P}(x_3|M))...\) and it is so much easier to handle.
Negative log-likelihood
Because statistical packages optimizers work by minimizing a function. Minimizing means decrease the distance of two distributions to its lowest, this is fairly easy because we get this when the distance close to 0 (just like the trick we do in hypothesis testing).
Minimizing negative log-likelihood (meaning that \(-1 \times \text{log-likelihood}\)) is equivalent to maximizing the log-likelihood, which is what we want to do (MLE: maximum likelihood estimation).
Poisson distribution
Poisson distribution is a discrete probability distribution that expresses probability of a given number of events occurring in a fixed interval of time1.
Suitable to use because here we are fitting a number of prev occurring in this period.
When MLE is equal to LSE
Assuming the data is generated from a normal distribution with mean \(\mu\) and a standard deviation \(\sigma\).
- Step 1. Write down the likelihood function \(L(\theta)\).
\[L(\theta) = \prod_{i = 1}^{n} \mathcal{N}(\mu_i, \sigma) = \prod_{i = 1}^{n} \frac{1}{\sigma\sqrt{2 \pi}} \text{exp} \left[ \frac{-1}{2 \sigma^2} (y_i - \mu_i)^2 \right]\]
\[L(\theta) = \frac{1}{(\sigma \sqrt{2 \pi})^n} \text{exp} \left[ \frac{-1}{2 \sigma^2} \sum_{i = 1}^n (y_i - \mu_i)^2 \right]\]
Since \(\sigma\) and \(\sqrt{2 \pi}\) are constant.
\[L(\theta) \varpropto \text{exp} \left[ - \sum_{i = 1}^n (y_i - \mu_i)^2 \right]\]
- Step 2. Take the natural log of the likelihood \(log L(\theta)\).
\[log L(\theta) \varpropto - \sum_{i = 1}^n (y_i - \mu_i)^2\]
- Step 3. Take the negative log-likelihood \(- log L(\theta)\).
\[- log L(\theta) \varpropto \sum_{i = 1}^n (y_i - \mu_i)^2\]
This looks exactly like the residual sum of squares at Equation 1.
- Step 4. The optimizer minimize negative log-likelihood \(- log L(\theta)\). It does the same thing as the LSE finds the optimal parameters by minimizing the RSS.