27  Generalised additive model

27.1 Model structure

\[\underbrace{g}_{\text{Link function}}(\underbrace{\mathbb{E}(Y)}_{\text{Expected value}}) = \underbrace{\beta_0}_{\text{Intercept}} + \underbrace{f_1(x_1) + f_2(x_2) + \dots + f_m(x_m)}_{\text{Smooth functions}}\]

Smooth functions are represented by basis functions:

\[f(x) = \sum_{j=1}^{k} b_j(x)\beta_j\]

27.2 Basis functions

\[f(x) = \sum_{j=1}^{k} b_j(x)\beta_j\]

For example, if we choose a cubic polynomial as the basis, we set \(k=4\) and define the individual basis functions as the powers of \(x\):

\(b_1(x) = 1\)

\(b_2(x) = x\)

\(b_3(x) = x^2\)

\(b_4(x) = x^3\)

\[f(x) = \beta_1 + x\beta_2 + x^2\beta_3 + x^3\beta_4\]

Assuming we have only one covariate \(x\), the GAM becomes:

\[g(\mathbb{E}(Y)) = \beta_0 + \beta_1 + x\beta_2 + x^2\beta_3 + x^3\beta_4\]

27.3 Smooth terms in GAM

Source

  • isotropic smoothers: rotation of the covariate co-ordinate system will not change the result of smoothing

    • B-splines (basis spline)
    • P-splines (penalty B-splines)
    • Thin plate regression splines (default of GAM package)
  • Tensor product smooths te smooths have one penalty per marginal basis

27.4 B-spline

\[ B_i^0(x) = \begin{cases} 1 & \text{when } x_i \le x < x_{i+1} \\ 0 & \text{otherwise} \end{cases} \]

Recursion — for degree \(m \ge 1\), two adjacent B-splines of degree \(m-1\) are blended together via two linear “ramps”:

\[ B_i^m(x) \;=\; \underbrace{\frac{x - x_i}{x_{i+m} - x_i}}_{\alpha(x)} \, B_i^{m-1}(x) \;+\; \underbrace{\frac{x_{i+m+1} - x}{x_{i+m+1} - x_{i+1}}}_{\beta(x)} \, B_{i+1}^{m-1}(x) \]

Key properties. The recursion is what makes B-splines so widely used as a GAM basis:

  • Local support. \(B_i^m(x)\) is non-zero only on \([x_i,\, x_{i+m+1}]\) — the \(m+2\) adjacent knot intervals. Moving a single coefficient \(\beta_j\) only perturbs \(f\) in a small neighbourhood.
  • Non-negative and partition of unity. \(B_i^m(x) \ge 0\) and \(\sum_i B_i^m(x) = 1\) inside the domain, so \(f(x)\) behaves like a local weighted average of the \(\beta_j\).
  • Smoothness. At simple knots, a degree-\(m\) B-spline is \(C^{m-1}\). The usual default is cubic (\(m=3\)) — \(C^2\) curves that look smooth to the eye.
  • Count. Given \(K\) knots (boundary and interior), there are \(K - m - 1\) basis functions of degree \(m\).

The GAM smooth is then \[f(x) = \sum_{j=1}^{K-m-1} \beta_j\, B_j^m(x),\] and the \(\beta_j\) are estimated by penalised least squares — a roughness penalty (e.g. on \(\int f''^2\)) keeps the fit from over-wiggling as the number of basis functions grows.

27.5 Thin plate spline

B-splines require the user to place knots. Thin plate splines (TPS) instead build the basis straight from the data — the fit “bends like a thin metal sheet” pinned at the observations, with the tension controlled by a single smoothing parameter.

Penalised criterion. For data \((\mathbf{x}_i, y_i)\) with \(\mathbf{x}_i \in \mathbb{R}^d\), TPS minimises \[\sum_{i=1}^{n} \{y_i - f(\mathbf{x}_i)\}^2 \;+\; \lambda \, J(f),\]

where \(J(f)\) is the bending energy — the integrated squared second derivatives. In 2D, \[J(f) = \iint \left( f_{x_1x_1}^2 + 2\,f_{x_1x_2}^2 + f_{x_2x_2}^2 \right)\, dx_1\, dx_2.\]

Closed-form minimiser. The solution is a sum of radial basis functions centred on the data points plus a low-order polynomial: \[f(\mathbf{x}) = \sum_{i=1}^{n} \alpha_i\, \eta\!\left(\|\mathbf{x} - \mathbf{x}_i\|\right) + \beta_0 + \sum_{k=1}^{d} \beta_k\, x_k,\]

with the thin-plate radial basis \[\eta(r) = \begin{cases} |r|^3 & d = 1 \\[2pt] r^2 \log r & d = 2 \end{cases} \qquad (\eta(0) \equiv 0).\]

The coefficients solve one linear system; the constraint \(\sum_i \alpha_i = \sum_i \alpha_i x_{ki} = 0\) makes the fit coincide with the polynomial part when \(\lambda \to \infty\).

Key properties.

  • No knot placement. Basis centres are the data locations — no user choice needed.
  • \(\lambda\) tunes smoothness. \(\lambda \to 0\) interpolates every point; \(\lambda \to \infty\) collapses the fit to the polynomial null space (plain OLS on the linear part).
  • Isotropic. Bending is the same in every direction — natural for spatial (\(d=2\)) smoothing.
  • Works in any dimension. Change \(\eta\) and the polynomial part to match \(d\); the rest carries over.

Full TPS is \(O(n^3)\); in practice, mgcv uses a thin plate regression spline — a truncated-eigendecomposition basis that keeps these properties but scales.

Illustration — the 1D case with \(\eta(r) = |r|^3/12\). Slide \(\lambda\) to watch the fit morph from interpolation (small \(\lambda\)) to a straight line (large \(\lambda\)).

27.6 Tensor product smooth

When a GAM smooth involves two or more covariates at once, there are two standard constructions:

  • Isotropic smooth (s(x1, x2)). A single distance metric, one smoothness — TPS is the canonical example. Natural when the covariates share units (spatial coordinates, image pixels).
  • Tensor product smooth (te(x1, x2)). Built from per-direction 1D bases and — crucially — equipped with one penalty per marginal basis. Natural when covariates live on different scales (time vs temperature, age vs income).

Marginal basis — the 1D building blocks. Each input direction gets its own 1D basis, chosen independently:

  • Direction 1: \(\{b_j(x_1)\}_{j=1}^{k_1}\) — say, a cubic regression spline with \(k_1\) knots.
  • Direction 2: \(\{c_\ell(x_2)\}_{\ell=1}^{k_2}\) — possibly a different type of basis.

These “marginal” bases don’t depend on each other — each only describes shape along its own direction.

Tensor product basis — outer product of the marginals. The 2D basis takes all pairwise products of the marginal functions: \[B_{j\ell}(x_1, x_2) \;=\; b_j(x_1) \cdot c_\ell(x_2), \qquad j = 1, \dots, k_1, \;\; \ell = 1, \dots, k_2.\]

That’s \(k_1 k_2\) bumps tiling the 2D input rectangle, and the smooth is \[f(x_1, x_2) \;=\; \sum_{j=1}^{k_1} \sum_{\ell=1}^{k_2} \beta_{j\ell}\, b_j(x_1)\, c_\ell(x_2).\]

One penalty per marginal — what makes te distinctive. Each marginal carries its own 1D roughness penalty matrix \(S_k\) (e.g. on the second derivative). In the 2D basis these are lifted using Kronecker products: \[S_1^{(te)} \;=\; S_1 \otimes I_{k_2}, \qquad S_2^{(te)} \;=\; I_{k_1} \otimes S_2,\]

giving the penalised least-squares criterion \[\|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2 \;+\; \lambda_1\, \boldsymbol{\beta}^{\top} S_1^{(te)} \boldsymbol{\beta} \;+\; \lambda_2\, \boldsymbol{\beta}^{\top} S_2^{(te)} \boldsymbol{\beta}.\]

Here \(\lambda_1\) penalises wiggliness in the \(x_1\) direction only, \(\lambda_2\) does the same for \(x_2\), and the two are estimated independently.

Why tensor products?

  • Different units / scales. If \(x_1\) is in seconds and \(x_2\) in °C, an isotropic smoother must pick some distance in \((x_1, x_2)\)-space — an arbitrary choice. te sidesteps this: each direction has its own \(\lambda_k\).
  • Anisotropic smoothness. A fit can be very smooth in one direction and bumpy in the other when the data support it — impossible with a single smoothness parameter.
  • Scale invariance. Rescaling \(x_1 \to a x_1\) leaves the estimated surface unchanged; only \(\lambda_1\) absorbs the change.
  • Modular marginals. The two marginals can be different bases — e.g. a cyclic spline for time-of-day with a cubic regression spline for temperature.

Illustration — each tensor product basis function \(B_{j\ell}(x_1, x_2)\) is literally the outer product of a marginal-1 curve and a marginal-2 curve. Pick \(j\) and \(\ell\): the selected curve is bold in each marginal panel, and the middle panel shows the corresponding tensor basis function as a heatmap.