Today we’re going to dive into a hugely important task in statistical modelling and machine learning: prediction. Given some existing data, can we predict what a new value will be? This topic spans a lot of complex ground, but I believe the fundamentals are actually quite straightforward.
By the end, I hope you will have an intuiton about the following:
- That prediction is a function
- That parameterisation allows you to choose from a space of prediction functions
- A loss function evaluates the quality of a prediction function (and thus the parameters that define it) by comparing its predictions to real data
- Optimisation automatically explores the parameters to find the best prediction function
It’s okay if this doesn’t mean much to you yet, we’ll explain the terms as we go. Let’s dive in, shall we?
NOTE: This post relies heavily on interactive figures that won’t work properly on mobile devices so it unfortunately won’t be of much value unless you’re on a desktop/laptop.
Manual Prediction
Let’s imagine you manage a coal shop. You sell all kinds of coal. Heating coal, activated charcoal, deactivated charcoal, reactivated charcoal, gender-affirming coal for that manly blue-collar aesthetic after a hard day’s investment banking. You want to know how much coal you need to put out on display every day. Funnily, people in these parts buy more coal when it’s hot, so you stock more when the temperature rises 1. But if you order too much or too little your boss docks your pay by the largest discrepancy that month and you need that money to treat a definitely unrelated case of bronchitis.
You want to make things easier for yourself, so you collect some data by measuring the temperature at 6am every day and the kilos of coal sold that day. We’ll call the temperature the predictor varaible (because it predicts the outcome) and the amount of coal the response variable. 2 After a month of measurements your hypothetical data looks like one of these four datasets 3:
It’s coming up to Christmas and your boss gets particularly grumpy this time of year. Luckily, you have this data to help you predict the day’s sales. For the coming month, you can now use this data to make more accurate purchases and hopefully save a little for a Christmas present for little Timmy. For those that are familiar, we’ve just created a train and test set. That is, “training” data gathered to help us make predictions and “testing” data used to evaluate those predictions.
The interactive figure below tells you the temperature on a new day. Given the existing data and today’s temperature, predict how much coal you will sell. After making a guess it will show how much it differs from the true sales that day.
If you’re anything like me, your predictions pretty much show a line passing through the centre of the existing data.
Prediction function
The data is plotted and stuck to the wall, but you’re getting sick of having to examine the chart every time. Speaking of sickness, you’ve started to cough up lumps of coal whole. It seems like the dust is forming deposits in your lungs. Because you’re feeling so frail, you want to develop a rule that will tell you automatically how much coal to put on display, just enter in the current temperature and you get an output. No thinking required.
This sounds an awful lot like a mathematical function, which maps an input space of temperature to an output space of mass of coal. Let’s start by drawing a line that represents your best guess at the relationship between temperature and sales. Not ideal for all of the datasets, but it’s a start.
Parameterisation
That whole line drawing thing was quite useful, but since you’re drawing with coal, you’re inhaling even more coal dust than before. Also, your boss is now talking about ghosts. Might be the carbon monoxide. He won’t pay to repair the furnace, but the coal dust and bronchitis appear to have formed a protective layer in your lungs4, so you’re immune to it. Let’s see if we can formalise the idea of defining a function and hopefully save you some of that precious mental oxygen.
If you think about it, the line that we’ve chosen is fully described by the position of the two anchor points \(x_1,y_1,x_2,y_2\). A more concise way of describing the same line would be \(y=ax+b\), the famous line equation that I’m sure you’ve already encountered. But we’re also hinting at a larger pattern here. What are \(a\) and \(b\)?
\(y\) in this case is actually a function of three variables, i.e. \(y=f(x,a,b)\). The input value is \(x\), i.e. the temperature. Inputs \(a\) and \(b\) change the function that relates \(x\) and \(y\). This allows us to split up the domain, i.e. the inputs, of this function into two parts, the parameter space \(A\times B\), in this case two dimensional and the input space \(X\), in this case one dimensional5. The upside is that, by varying \(a\) and \(b\), we’re able to choose from a family of functions, one for each combination of \(a\) and \(b\) (or point in the 2d parameter space)6.
That was a lot. The following figure shows how choosing different parameter values defines a different function. There are four function families to choose from, each used to generate one of the datasets7. Choose between them and try to find prediction functions that match the existing data.
Choosing a parameterisation (or family of prediction functions) is flexible. You can easily define your own function like above8. However, there are some practical factors involved, e.g. changes in parameters might cause rapid, chaotic changes in the function that make it hard to vary or the family might not have a good match to the data at all, as you might have seen above. Sometimes there are some clearer constraints, for example, sales can’t be below 0. Should we allow families of functions that can predict sales below 0? Often there are different conventions in different fields of study. Fundamentally though, you have a lot of families to choose from and the proof is in the Christmas pudding, that is, can it predict well?
Speaking of predicting well, you’ve now automated the process of choosing the amount of coal at the start of the day. After twiddling some knobs you can generate a prediction rule to use. No brain power required after choosing the right parameters. Maybe your boss will start docking the chart’s pay instead of yours. But how do we know what is a “good” line to draw? We know we want to reduce the error. Is there a way that we can better define it?
Loss
Your boss is really talking a lot about ghosts now. He also said that he’s not paying you to twiddle knobs 9 while looking at funny charts. Let’s see if you can quickly take a step towards automating the whole process and make your life easier so you can focus more of your efforts on the elaborate Christmas-themed window display. There’s even a nativity made entirely from coal.
Using parameters, we are able to choose from a wide range of functions. You probably have an intuitve understanding that the best prediction function is somewhere in the “middle” of the cloud. The way to formalise the concept is to apply what is called a loss function to the existing data (i.e. the training data). This takes the existing data and the prediction function and gives a score for how bad your preditions are.
Two common losses are the mean squared error (MSE) and mean absolute error (MAE)10. They take the average of all the squared differences or absolute differences between the predicted value and the true value for each datapoint. The absolute and squared values are largely there to make the error symmetric, that is, it is the same if you guess 10 units too high or 10 units too low. The key point is that for each known data point, the loss makes a prediction with the prediction function for the given parameters, and compares it to the true value, penalising differences11.
You might have noticed I’ve already used some notion of error, i.e. how much your boss docks your pay. However, it’s very important to make a distinction between where it’s being applied. Above, it was used to tell how good your predictions on new data were, i.e. the test loss. Now, we want to use it to help us guess the parameters in the first place, i.e. training loss. In our slightly tortured narrative, the training loss is what you use to develop your prediction function from the original training data you collected from last month. The test loss will tell you how well it actually performs in the coming month.
The following figure should help demonstrate how the loss can help us find a good predictive function. Once again, the parameters are on the left and the prediction function is on the right. This time, we’ve calculated the loss values for many different parameter combinations and shown them in a heat map on the left. Play with different parameter values by moving the marker or sliders and see how well regions of low loss correspond to good function fits.
You can see that there isn’t a huge difference in the heatmap between the two loss functions. They’re both serving a very similar purpose. That is, to quantify how far the response of the training data points is from your predicted values.
Now lets try a little experiment. Without looking at the training data at all, just with the loss function, try to choose a good set of parameters. The loss will be calculated in the background. Calculating the heat map involves calculating the loss for every square in the parameter plot and it’s kind of cheating so we won’t include it.
So now you can see that it’s possible to use just the loss function on the original data to make a good prediction, without ever seeing the data itself. The final piece of the puzzle is now passing this loss on to an algorithm that can automatically change the paramters, searching for the lowest loss.
Optimisation
Your boss is now saying that ghosts helped him see the future and that he really loves the work you’ve been doing on these predictions. With the new loss function, they’re better than ever, he said. He even smiled. You should really check on that furnace. It’s only a few days to Christmas and with the extra demand for premium stocking coal, you don’t have time to sort out next years predictions yourself, but you want to be ready for the new year. Is there a way to automate it?
Optimsation is the process of searcing through a parameter space with a loss function, searching for the best solution12. It should be quite clear how this field might be useful in helping us to find the best prediction function, without having to do the work ourselves. There are many ways of going about this. Some are better with more dimensions. Others are better when not all parameter combinations are possible (i.e. there are constraints). You could try 100 points at random and choose the best one - also a form of optimisation. In some lucky cases (like with the linear parameter space), we know the solution “analytically”. That is, we know the final optimum in a single step with some straightforward calculations.
Gradient descent is a type of optimisation which takes steps towards the lowest loss parameters and is a nice way to visualise this process. To keep this article short, I won’t describe how it works here. Just keep in mind the notion that it moves step-by-step to regions of lower loss, like walking down a hill to the lowest point. It will allow us to find a good (but not necessarily the best) solution for each of our predictive function families.
Drag and drop the parameter marker around to vary the starting point of optimisation. You can take a single step or many at once. As always, you can test your final function on the following month’s data as well.
You can see that being able to come up with a parameter space doesn’t mean that it will be easy to find an optimum solution for it. Gradient descent certainly isn’t the best solution for many problems. Some of these function families would be better taxidermied and put on display somewhere than in active use.
Conclusion
The boss is so happy with the work you’ve done that he joined the family for chistmas dinner. He even brought a whole turkey with only 5% sawdust in the stuffing13 and some artisinal peruvian stocking coal for Timmy. He says the ghosts told him to, but you know the real reason. With prediction you saved Christmas!
A huge part of machine learning and predictive modelling is exactly what you’ve just done:
- Defining a family of predictive functions with a parameter space
- Choosing a loss function that compares the predictions to training data
- Searching within that space for the best possible parameters
Of course, the devil is absolutely in the details here, as you saw above. Some combinations of family and loss effectively can’t be optimised, even if the true solution exists somewhere. Often the parameter spaces are vast, with millions and even trillions of parameters. These spaces can’t be graphed or really grasped by the human brain. There are vast subfields of machine learning and statistics dedicated to each one of these components. Neveretheless, the principal and intuition are largely universal and hopefully help provide a scaffold for learning the details.
I hope that if you encouter a predictive problem in the future that hasn’t been solved before, you’re not afraid to define your own parameter space of predictive functions, to choose a loss that makes sense and to pick an optimisation algorithm (even a very simple one) to search that space for a good solution.
Footnotes
Definitely not because I changed the theme after making the diagrams↩︎
In general, because the concept is shared across so many fields, the terminology around this is a bit of a mess. You’ll see the former called covariates, features, inputs or instances and the latter also called labels or outputs↩︎
Realistically, the data from a single month would be highly autocorrelated (i.e. the temperature on one day is usually quite similar to the previous day) and we’d need to be more clever about how we analysed them. It also wouldn’t be so bloody hot at 6am↩︎
I can’t fully claim that one. Give Danger 5 a watch - it’s fantastic! ↩︎
These functions are often written like \(f_{a,b}(x)\) or \(f(x;a,b)\) to make clear the difference between these two spaces.↩︎
Sometimes it’s said that the parameters index a family of functions↩︎
Some of the families that I’ve chosen might seem strange to the maths professionals and entheusiasts reading this. I chose them to reinforce that really any function can be chosen. There’s a good reason we’ve chosen the standard functions in ML and stats (you’ll see that later when trying to optimise some of these), but I think it’s good to highlight that others are fundamentally possible.↩︎
A example from one of my statistics professors who worked with industry stuck with me for many years. They were modelling how successful TV ads were at bringing hits to a website. Their final model had an uptick with an exponential decay at the airtime of each ad. They then modelled the height of the uptick and the rate of decay as a function of the time of day, number of viewers, etc. and they managed to fit the data really well! I was struck by how tailored this solution was to the problem at hand, as opposed to throwing a generic ML algorithm at the problem and calling it a day. Really elegant!↩︎
he put it slightly differently↩︎
Error and loss are synonymous here. I’ve also used the mean values because they make more sense during the iterative animation. Usually the total squared error or total absolute error are used to avoid the extra operation of dividing by the sample size, which is just a constant.↩︎
Here are the equations for reference.
\[ \begin{align*} MSE_{\mathbf{x},\mathbf{y}}(a,b) &= \sum_{i=0}^{n} (f_{a,b}(x_i) - y_i)^2 \\ MAE_{\mathbf{x},\mathbf{y}}(a,b) &= \sum_{i=0}^{n} |f_{a,b}(x_i) - y_i| \end{align*} \]
with \(\mathbf{x},\mathbf{y}\) being vectors of your predictors and response variable in your training data. The notation is quite informative here. \(MSE_{\mathbf{x},\mathbf{y}}(a,b)\) means that, even though MSE is a function of both \(\mathbf{x},\mathbf{y}\) and \(a,b\), we consider the collected dataset to be fixed and we’re interested in varying \(a\) and \(b\). On the other hand, in the sum over the dataset, we are varying \(x\) and \(y\) but \(a\) and \(b\) are fixed, so the notation is \(f_{a,b}(x_i)\). Both hint at intent rather than functional difference. When working quickly, someone might even drop the dependency on the dataset and simply write \(MSE(a,b)\).
Tangent incoming, but sometimes it’s better, instead of expecting full rigour, to see equations and expressions like this as a language with many registers of formality. The goal is usually to communicate an idea for another human to understand. Just like a drawing (or interactive figure) might better convey an idea than purely words. This tripped me up a lot when I was learning, usually because I didn’t understand something that the author considered so obvious that they didn’t include it.
It’s also a good chance to talk about the word loss, which is colloquially used both to mean the function between two individual datapoints as well for the entire dataset/batch. Other terms can also be added, e.g. for regularisation. In statistical learning theory, some of what I’m calling losses should more precisely be called risk. I hope the distinction is clear.↩︎
For example, if we use the squared loss and the linear function, this can be written as
\[ \begin{align*} &\text{argmin}_{a,b} \text{MSE}_{x,y}(a,b) \\ &\text{argmin}_{a,b} \sum_{i=0}^{n} (f_{a,b}(x_i) - y_i)^2 \\ &\text{argmin}_{a,b} \sum_{i=0}^{n} ((ax_i+b) - y_i)^2 \end{align*} \] This basically condenses all of this article into a single equation. We’re finding the parameters that minimise the sum of squared losses between our prediction, by a function selected from the family with our parameters, and the true value for each data point of the training data.↩︎
Honestly, you like a little bit of sawdust for the digestion.
For those that are confused by the talk of ghosts (I expect that many non-native speakers might read this), this strained plot is based on A Christmas Carol by Charles Dickens. A book that many of us had to read in middle school and is hopefully a little more inspiring than I’ve portrayed it here.↩︎