
How Do We Estimate f? Turning Data into a Working Rule
Mission recap: We want —an estimate of the hidden rule that links predictors to outcome . Today we’ll see how we actually build from data.
1 The four-step recipe
- Pick a loss function to quantify wrongness.
- Optimise that loss on training data.
- Validate on unseen data to spot over-fitting.
- Tune & repeat until performance stabilises.
That’s it. Everything from linear regression to GPT follows this loop.
2 Step 1: Choosing a loss function
Loss = how bad a guess is, measured in one number. Lower is better.
Problem type | Common loss | Intuition (plain English) |
---|---|---|
Regression | Mean Squared Error (MSE) | Penalises big errors extra hard |
Classification | Cross-Entropy / Log-Loss | Reward correct probability, punish over-confidence |
Robust tasks | Mean Absolute Error (MAE) | Treats all errors linearly—good with outliers |
Analogy: Think of loss as the distance between a dart and the bullseye. The further away, the higher the penalty.
3 Step 2: Optimisation—the search for lowest loss
Most models frame training as
where are the model’s parameters.
Gradient Descent (GD) in one breath
- Start with random .
- Compute the gradient—direction of steepest uphill loss.
- Step downhill by a learning rate .
- Repeat until steps become tiny.
Variants (stochastic GD, Adam, LBFGS) juggle step size vs noise, but the spirit is the same: follow the slope until the valley.
Closed-form solutions
Some simple models (ordinary least squares) can solve for in one matrix formula—no looping required.
4 Step 3: Train-Validation-Test splits
Why split? To know if we’re learning patterns or memorising noise.
Split | Sees model? | Purpose |
---|---|---|
Training set | Yes | Fit |
Validation set | During tuning | Pick hyper-parameters, early stopping |
Test set | Only once | Report final unbiased performance |
Rule-of-thumb sizes: 60%-20%-20% or 70%-15%-15%, but adjust for data volume.
5 Cross-Validation: squeezing more insight from small data
K-fold CV: Split data into k equal folds. Train on k-1 folds, validate on the leftover fold. Repeat k times and average the metric.
Benefits:
- Uses every sample for both training and validation.
- Gives more stable estimate than a single split.
Common choices: 5-fold, 10-fold. For time series, use rolling-window CV to respect chronology.
6 Regularisation: taming variance without starving the model
Add a penalty term to the loss.
Technique | Added term | Effect |
---|---|---|
L2 (Ridge) | Shrinks parameters smoothly | |
L1 (Lasso) | $\lambda\sum | \theta |
Choosing (the penalty weight) is a hyper-parameter task—tune it on the validation set or via cross-validation.
7 Early stopping & checkpoints
Plot validation loss each epoch. If it starts rising while training loss keeps falling, you’re over-fitting. Stop training, keep the weights from the best epoch.
8 Putting it all together—mini walkthrough
Task: Predict CO₂ emissions of cars from engine size and weight (regression).
- Loss: pick MSE.
- Model: start simple—linear regression.
- Split: 70-15-15.
- Train: closed-form solution (no GD needed).
- Validate: check on val set. Add polynomial term if under-fits.
- Regularise: Ridge to avoid wild coefficients.
- Test: final MSE = 18 (g/km)² → interpret as ±4.2 g/km typical error.
Loop done; model shipped.
9 Common gotchas & guardrails
- Leaking test data into preprocessing—standardise after splitting.
- Ignoring non-random missingness—impute wisely or model it.
- Huge learning rate—model oscillates and never converges.
- Too small learning rate—training takes forever.
10 Where next?
Upcoming topic | Teaser |
---|---|
Flexibility vs Interpretability | How trees, splines, and ensembles trade clarity for power. |
Bias-Variance in practice | Hands-on demo with code to visualise the sweet spot. |
Key takeaways
- Loss functions translate mistakes into numbers we can minimise.
- Optimisers search parameter space for the lowest loss.
- Validation guards against over-fitting; cross-validation makes the guard stronger.
- Regularisation balances model flexibility and stability.
- Always keep a pristine test set—your final “exam” score.
Next stop: making sense of model flexibility and how to keep explanations human-friendly even as algorithms grow complex.