How Do We Estimate f? Turning Data into a Working Rule

•5/10/2025• 4 min read

loss-functions optimisation cross-validation statistical-learning beginners

Mission recap: We want $\widehat f$ —an estimate of the hidden rule $f$ that links predictors $\mathbf X$ to outcome $Y$ . Today we’ll see how we actually build $\widehat f$ from data.

1 The four-step recipe

Pick a loss function to quantify wrongness.
Optimise that loss on training data.
Validate on unseen data to spot over-fitting.
Tune & repeat until performance stabilises.

That’s it. Everything from linear regression to GPT follows this loop.

2 Step 1: Choosing a loss function

Loss = how bad a guess is, measured in one number. Lower is better.

Problem type	Common loss	Intuition (plain English)
Regression	Mean Squared Error (MSE)	Penalises big errors extra hard
Classification	Cross-Entropy / Log-Loss	Reward correct probability, punish over-confidence
Robust tasks	Mean Absolute Error (MAE)	Treats all errors linearly—good with outliers

Analogy: Think of loss as the distance between a dart and the bullseye. The further away, the higher the penalty.

3 Step 2: Optimisation—the search for lowest loss

Most models frame training as

\min_{\theta} \; \mathcal L(\theta; \text{data}),

where $\theta$ are the model’s parameters.

Gradient Descent (GD) in one breath

Start with random $\theta$ .
Compute the gradient—direction of steepest uphill loss.
Step downhill by a learning rate $\eta$ .
Repeat until steps become tiny.

Variants (stochastic GD, Adam, LBFGS) juggle step size vs noise, but the spirit is the same: follow the slope until the valley.

Closed-form solutions

Some simple models (ordinary least squares) can solve for $\theta$ in one matrix formula—no looping required.

4 Step 3: Train-Validation-Test splits

Why split? To know if we’re learning patterns or memorising noise.

Split	Sees model?	Purpose
Training set	Yes	Fit $\theta$
Validation set	During tuning	Pick hyper-parameters, early stopping
Test set	Only once	Report final unbiased performance

Rule-of-thumb sizes: 60%-20%-20% or 70%-15%-15%, but adjust for data volume.

5 Cross-Validation: squeezing more insight from small data

K-fold CV: Split data into k equal folds. Train on k-1 folds, validate on the leftover fold. Repeat k times and average the metric.

Benefits:

Uses every sample for both training and validation.
Gives more stable estimate than a single split.

Common choices: 5-fold, 10-fold. For time series, use rolling-window CV to respect chronology.

6 Regularisation: taming variance without starving the model

Add a penalty term to the loss.

Technique	Added term	Effect
L2 (Ridge)	$\lambda\sum\theta^2$	Shrinks parameters smoothly
L1 (Lasso)	$\lambda\sum	\theta

Choosing $\lambda$ (the penalty weight) is a hyper-parameter task—tune it on the validation set or via cross-validation.

7 Early stopping & checkpoints

Plot validation loss each epoch. If it starts rising while training loss keeps falling, you’re over-fitting. Stop training, keep the weights from the best epoch.

Illustration of diverging train vs validation loss.

8 Putting it all together—mini walkthrough

Task: Predict CO₂ emissions of cars from engine size and weight (regression).

Loss: pick MSE.

Model: start simple—linear regression.

Split: 70-15-15.

Train: closed-form solution (no GD needed).

Validate: check $R^2$ on val set. Add polynomial term if under-fits.

Regularise: Ridge to avoid wild coefficients.

Test: final MSE = 18 (g/km)² → interpret as ±4.2 g/km typical error.

Loop done; model shipped.

9 Common gotchas & guardrails

Leaking test data into preprocessing—standardise after splitting.
Ignoring non-random missingness—impute wisely or model it.
Huge learning rate—model oscillates and never converges.
Too small learning rate—training takes forever.

10 Where next?

Upcoming topic	Teaser
Flexibility vs Interpretability	How trees, splines, and ensembles trade clarity for power.
Bias-Variance in practice	Hands-on demo with code to visualise the sweet spot.

Key takeaways

Loss functions translate mistakes into numbers we can minimise.
Optimisers search parameter space for the lowest loss.
Validation guards against over-fitting; cross-validation makes the guard stronger.
Regularisation balances model flexibility and stability.
Always keep a pristine test set—your final “exam” score.

Next stop: making sense of model flexibility and how to keep explanations human-friendly even as algorithms grow complex.