Model selection

Bias and variance

Suppose we want to estimate the true parameter \(\theta\) of a distribution. We collect some samples and use them to construct an estimator \(\hat{\theta}\) (the hat denotes an estimated quantity). How do we know whether \(\hat{\theta}\) is any good?

The mean squared error (MSE) answers this:

\[\text{MSE}(\hat{\theta}) = \mathbf{E}\big[(\hat{\theta} - \theta)^2\big]\]

The MSE is the expected squared distance from \(\hat{\theta}\) to the truth. It penalises any deviation, regardless of source. Lower MSE means a more accurate estimator overall.

A single number is convenient, but it hides the question of why an estimator misses. Two estimators can have the same MSE for very different reasons: one might be systematically off-target, another might be on-target on average but jump around from sample to sample. Decomposing the MSE makes these two failure modes visible.

Decomposing the MSE

Add and subtract \(\mathbf{E}[\hat{\theta}]\) inside the square:

\[ \begin{aligned} \text{MSE}(\hat{\theta}) &= \mathbf{E}\big[(\hat{\theta} - \mathbf{E}[\hat{\theta}] + \mathbf{E}[\hat{\theta}] - \theta)^2\big] \\[6pt] &= \mathbf{E}\big[(\hat{\theta} - \mathbf{E}[\hat{\theta}])^2\big] + 2\,(\mathbf{E}[\hat{\theta}] - \theta)\,\underbrace{\mathbf{E}\big[\hat{\theta} - \mathbf{E}[\hat{\theta}]\big]}_{\substack{=\, \mathbf{E}[\hat{\theta}] - \mathbf{E}[\mathbf{E}[\hat{\theta}]] \\ =\, \mathbf{E}[\hat{\theta}] - \mathbf{E}[\hat{\theta}] =\, 0}} + (\mathbf{E}[\hat{\theta}] - \theta)^2 \\[12pt] &= \underbrace{\mathbf{E}\big[(\hat{\theta} - \mathbf{E}[\hat{\theta}])^2\big]}_{\text{variance}} + \underbrace{(\mathbf{E}[\hat{\theta}] - \theta)^2}_{\text{bias}^2} \end{aligned} \tag{1}\]

The MSE splits into two non-negative pieces. The first measures how much \(\hat{\theta}\) scatters around its own mean. The second measures how far that mean sits from the truth.

The plot below walks through this.

  • The truth is \(\theta = 0\) (red dashed line) and the population is \(\mathcal{N}(\theta, 2)\).

  • Each click of Draw a new dataset is one experiment: \(n\) points are drawn from the population and the estimator \(\hat{\theta} = \overline{x}\) (the sample mean) is computed. Each dataset gets a unique colour, so you can read off how much they vary from one experiment to the next.

  • The second panel zooms into the slice of row 1’s axis around \(\theta\): you can still see each \(\hat{\theta}\) and the the distance whose square gets averaged into the MSE.

  • The third panel uses the same axis, but switches the deviation arms to run from \(\mathrm{mean}(\hat{\theta})\), which together make up the variance. A red bracket at the top showing the gap between \(\theta\) and \(\mathrm{mean}(\hat{\theta})\), which squared is the bias².

The MSE compares each \(\hat{\theta}\) with the true \(\theta\). When we decompose it, it splits into 2 steps via the mediator \(\mathbf{E}(\hat{\theta})\):

  • Variance: compare each \(\hat{\theta}\) with the mediator \(\mathbf{E}(\hat{\theta})\)
  • Bias: then compare this \(\mathbf{E}(\hat{\theta})\) with the true \(\theta\)

Variance

Variance is the spread of \(\hat{\theta}\) around its own mean:

\[\text{Var}(\hat{\theta}) = \mathbf{E}\big[(\hat{\theta} - \mathbf{E}[\hat{\theta}])^2\big] \tag{2}\]

A high-variance estimator jumps around from sample to sample. It might be right on average, but you only have one sample.

Bias

Bias is how far the estimator lands from the truth on average:

\[\text{Bias}(\hat{\theta}, \theta) = \mathbf{E}[\hat{\theta}] - \theta \tag{3}\]

An estimator with zero bias is unbiased: across repeated samples, it lands on \(\theta\) on average.

But unbiased is not enough. An estimator can be unbiased on average and still useless if its variance is high.

With both terms defined, we can see how they play out together. The four scenarios below cover every combination. Click one to switch. Use the Side / Above toggle to swap viewpoints: the side view is the axis we’ve been using, and the above view shows each estimate as an arrow on a target, with the bullseye at the true \(\theta\).

Prediction

Now we predict a random outcome \(Y\) using a variable \(x\).

\[Y = f(x) + \varepsilon, \quad \mathbf{E}[\varepsilon] = 0, \quad \text{Var}(\varepsilon) = \sigma^2\]

Your predictor \(\hat{f}(x)\) is trained on random data. The prediction MSE is:

\[\begin{aligned} \mathbf{E}[(\underbrace{Y}_{f(x) + \varepsilon} - \hat{f}(x))^2] &= \mathbf{E}\left[\left((f(x) - \hat{f}(x)) + \varepsilon\right)^2\right] \\ &= \mathbf{E}\left[(f(x) - \hat{f}(x))^2\right] + \underbrace{2 \cdot \mathbf{E}[(f(x) - \hat{f}(x))\varepsilon]}_{= 2 \cdot \mathbf{E}[f(x) - \hat{f}(x)] \cdot \underbrace{\mathbf{E}[\varepsilon]}_{= 0}} + \mathbf{E}[\varepsilon^2] \\ &= \mathbf{E}\left[(f(x) - \hat{f}(x))^2\right] + \mathbf{E}[\varepsilon^2] \end{aligned}\]

The first term \(\mathbf{E}[(f(x) - \hat{f}(x))^2]\) can be treated like \(\mathbf{E}\big[(\hat{\theta} - \theta)^2\big]\) from the Equation 1, therefore:

\[\mathbf{E}\left[(f(x) - \hat{f}(x))^2\right] = \underbrace{\mathbf{E}\left[\left(\hat{f}(x) - \mathbf{E}[\hat{f}(x)]\right)^2\right]}_{\text{variance}} + \underbrace{\left(\mathbf{E}[\hat{f}(x)] - f(x)\right)^2}_{\text{bias}^2}\]

The second term is \(\mathbf{E}[\varepsilon^2]\). By definition, the error is centred at zero, so let \(\mu = \mathbf{E}[\varepsilon] = 0\), and remember that \(\text{Var}(\varepsilon) = \sigma^2\).

\[\begin{aligned}\text{Var}(\varepsilon) &= \mathbf{E}\left[(\varepsilon - \mu)^2\right] \\ &= \mathbf{E}[\varepsilon^2 - 2\varepsilon\mu + \mu^2] \\ &= \mathbf{E}[\varepsilon^2] - \mathbf{E}[2\varepsilon\mu] + \mathbf{E}[\mu^2] \\ &= \mathbf{E}[\varepsilon^2] - 2\mu\mathbf{E}[\varepsilon] + \mu^2 \\ &= \mathbf{E}[\varepsilon^2] - 2\mu(\mu) + \mu^2 \\ &= \mathbf{E}[\varepsilon^2] - \underbrace{\mu^2}_{= 0} \\ &= \mathbf{E}[\varepsilon^2]\end{aligned}\]

Now plug them back to the MSE:

\[MSE = \underbrace{\mathbf{E}\left[\left(\hat{f}(x) - \mathbf{E}[\hat{f}(x)]\right)^2\right]}_{\text{variance}} + \underbrace{\vphantom{\mathbf{E}\left[\left(\hat{f}(x) - \mathbf{E}[\hat{f}(x)]\right)^2\right]}\left(\mathbf{E}[\hat{f}(x)] - f(x)\right)^2}_{\text{bias}^2} + \underbrace{\vphantom{\mathbf{E}\left[\left(\hat{f}(x) - \mathbf{E}[\hat{f}(x)]\right)^2\right]}\text{Var}(\varepsilon)}_{\text{irreducible error}}\]

Tradeoff

So far the picture has been static — one estimator, one decomposition. In practice you choose the complexity of your model, and that choice trades bias against variance directly. To make this concrete, take the canonical example: polynomial regression. Given training data \(\{(x_i, y_i)\}\), fit

\[\hat f(x) = \beta_0 + \beta_1 x + \beta_2 x^2 + \dots + \beta_d x^d\]

by ordinary least squares. The single knob is the degree \(d\), which is also the number of fitted parameters minus one:

  • Small \(d\) → the polynomial is too rigid to follow \(f(x)\). Across different training sets the predictions stay close to each other (low variance), but they systematically miss the truth (high bias).
  • Large \(d\) → the polynomial bends to chase every wiggle in the noise. Averaged over training sets the predictions track the truth (low bias), but any single fit swings wildly (high variance).

The plot on top shows what happens at the current \(d\): the dashed red curve is the truth \(f\), the gray dots are one training sample, the solid teal line is \(\mathbf{E}[\hat f(x)]\) averaged over many training sets, the teal band is \(\pm\) one standard deviation around it (variance), and the red band between the truth and the mean prediction is the bias. The plot below traces those two quantities — averaged over \(x\) — as \(d\) varies. The gray dashed curve is their sum plus the irreducible \(\sigma^2\), so its minimum is the best \(d\) for this problem.

The U-curve is the bias-variance tradeoff in one picture. There’s no \(d\) that makes both terms small at once — you slide along the trade. The minimum of the gray dashed Total curve is where the two errors balance, and is the best \(d\) for this particular function and noise level. Pick a different \(f(x)\), or change \(\sigma\), and the optimum moves. Other complexity knobs (number of spline knots, hidden units in a small neural net, depth of a decision tree) tell exactly the same story — only the parameter on the x-axis changes.