Cover image

How Do We Estimate f? Turning Data into a Working Rule

4 min read

Mission recap: We want f^\widehat f—an estimate of the hidden rule ff that links predictors X\mathbf X to outcome YY. Today we’ll see how we actually build f^\widehat f from data.


1 The four-step recipe

  1. Pick a loss function to quantify wrongness.
  2. Optimise that loss on training data.
  3. Validate on unseen data to spot over-fitting.
  4. Tune & repeat until performance stabilises.

That’s it. Everything from linear regression to GPT follows this loop.


2 Step 1: Choosing a loss function

Loss = how bad a guess is, measured in one number. Lower is better.

Problem typeCommon lossIntuition (plain English)
RegressionMean Squared Error (MSE)Penalises big errors extra hard
ClassificationCross-Entropy / Log-LossReward correct probability, punish over-confidence
Robust tasksMean Absolute Error (MAE)Treats all errors linearly—good with outliers

Analogy: Think of loss as the distance between a dart and the bullseye. The further away, the higher the penalty.


3 Step 2: Optimisation—the search for lowest loss

Most models frame training as

minθ  L(θ;data),\min_{\theta} \; \mathcal L(\theta; \text{data}),

where θ\theta are the model’s parameters.

Gradient Descent (GD) in one breath

  1. Start with random θ\theta.
  2. Compute the gradient—direction of steepest uphill loss.
  3. Step downhill by a learning rate η\eta.
  4. Repeat until steps become tiny.

Variants (stochastic GD, Adam, LBFGS) juggle step size vs noise, but the spirit is the same: follow the slope until the valley.

Closed-form solutions

Some simple models (ordinary least squares) can solve for θ\theta in one matrix formula—no looping required.


4 Step 3: Train-Validation-Test splits

Why split? To know if we’re learning patterns or memorising noise.

SplitSees model?Purpose
Training setYesFit θ\theta
Validation setDuring tuningPick hyper-parameters, early stopping
Test setOnly onceReport final unbiased performance

Rule-of-thumb sizes: 60%-20%-20% or 70%-15%-15%, but adjust for data volume.


5 Cross-Validation: squeezing more insight from small data

K-fold CV: Split data into k equal folds. Train on k-1 folds, validate on the leftover fold. Repeat k times and average the metric.

Benefits:

Common choices: 5-fold, 10-fold. For time series, use rolling-window CV to respect chronology.


6 Regularisation: taming variance without starving the model

Add a penalty term to the loss.

TechniqueAdded termEffect
L2 (Ridge)λθ2\lambda\sum\theta^2Shrinks parameters smoothly
L1 (Lasso)$\lambda\sum\theta

Choosing λ\lambda (the penalty weight) is a hyper-parameter task—tune it on the validation set or via cross-validation.


7 Early stopping & checkpoints

Plot validation loss each epoch. If it starts rising while training loss keeps falling, you’re over-fitting. Stop training, keep the weights from the best epoch.

Illustration of diverging train vs validation loss.


8 Putting it all together—mini walkthrough

Task: Predict CO₂ emissions of cars from engine size and weight (regression).

  1. Loss: pick MSE.
  2. Model: start simple—linear regression.
  3. Split: 70-15-15.
  4. Train: closed-form solution (no GD needed).
  5. Validate: check R2R^2 on val set. Add polynomial term if under-fits.
  6. Regularise: Ridge to avoid wild coefficients.
  7. Test: final MSE = 18 (g/km)² → interpret as ±4.2 g/km typical error.

Loop done; model shipped.


9 Common gotchas & guardrails


10 Where next?

Upcoming topicTeaser
Flexibility vs InterpretabilityHow trees, splines, and ensembles trade clarity for power.
Bias-Variance in practiceHands-on demo with code to visualise the sweet spot.

Key takeaways

Next stop: making sense of model flexibility and how to keep explanations human-friendly even as algorithms grow complex.

RELATED_POSTS