Purpose Statement
In this article, I shall explain the general supervised machine-learning algorithm. To do so, we shall program, from scratch, our own machine learning algorithm to solve a linear regression problem.
Introduction
Machine Learning, shorthanded ML, is a process whereby a computer program observes various examples, then learning to mimic said examples. We thus say that a computer program uses ML to solve a problem when the following occurs: first, the program observes multiple examples of a desired behavior; then, the program exhibits that desirable behavior itself. The usefulness of ML lies in that a programmer does not need to specify exactly how to solve a problem while coding. Instead, he needs only to provide examples of the desired behavior.
Consider the following example. A programmer is developing a program that should be able to receive a message and then classify it as having been written by a native English speaker or by someone who is speaking the language non-natively. He thinks for a while and cannot come up with a reasonable set of rules that he could program to detect a native versus a non-native speaker. Most people have access to a spell checker and grammar checker, so obvious spelling or grammar checks could not do the job. However, his intuition is stellar and can tell, without himself knowing exactly how, that "He got slapped with a killer fine" is written by a native speaker and "He was slapped with a fine that is killing" is not.
Using this intuition, our programmer spends four years of his life collecting texts on the internet and labeling them as either native or non-native written. He saves these data on his computer and now is ready to use machine learning to solve his problem. He shows his massive set of labeled texts to a machine-learning algorithm that therefrom learns to identify a text as either native or non-native written.
Since Mr. Programmer is an expert data scientist, he of course programmed this all in Python.
Python
To get started with Python, you will need some way to write Python code and to observe outputs from the code that you write. I recommend Google Colaboratory to get started. Once you have a way to write code and to view outputs from Python, you may proceed with this article.
Our Problem
We want a program that can convert from Celsius to Fahrenheit. We have forgotten the formula but we remember that it is of the form , where is Fahrenheit, is Celsius, and and are some parameters that we cannot remember. We shall use machine learning to determine what values we need to use for and to have a working program.
A problem of this form, where the general formula is the equation of a line, and where we want the parameters and , is called a linear regression problem.
To determine these parameters, we leverage two sources of data: , which holds multiple Celsius measurements, and , which holds corresponding Fahrenheit measurements.
C = [0, 7, 9, 10, 11, 13, 15, 29, 30, 32, 40]
F = [32, 44.6, 48.2, 50, 51.8, 55.4, 59, 84.2, 86.0, 89.6, 104]
e.g. 7 Celsius is 44.6 Fahrenheit.
Problem Formalization
Let us write a function, predict(c, m, b), that will predict the Fahrenheit measurement corresponding to a Celsius measurement, , given parameters and .
# Function to predict F from C
def predict(c, m, b):
return c * m + b
# Test function with arbitrary c, m, b
predict(10, 2, 1) # Out: 21
We see that our function works, but our choice of and are obviously not correct because we severely underpredicted 10 degrees Celsius to be 21 degrees Fahrenheit instead of the proper 50 degrees Fahrenheit found in the data that we have from earlier.
Let us now compose a function that will measure the error, that is the absolute difference, between the predicted Fahrenheit and the actual Fahrenheit:
def get_error(c, f, m, b):
'''
Get difference between the predicted
and the actual Fahrenheit measurement
'''
return abs((m * c + b) - f)
We know from the given data, and , that 10 Celsius corresponds to 50 Fahrenheit. Using this function, we can see the error in the current parameters' modeling of this conversion:
# Error for prediction with
# m = 2, b = 1
get_error(10, 50, 2, 1) # Out: 29
Instead of evaluating our prediction function on one Celsius-Fahrenheit pair at a time, we shall save ourselves some time and write a function to get the average difference between each pair of predicted and actual Fahrenheit values. This way, we shall know how accurately our parameters are modeling Celsius-to-Fahrenheit conversion.
This function will take as input the two arrays with corresponding degree values along with our prediction parameters. The output will give the average value of get_error(c,f,m,b) for all pairs of values in and . This will get the average absolute value of the difference between the predicted and actual values. We get the average absolute difference because we do not want one prediction that is 10 too high to be canceled out by a prediction that is 10 too low: the two erroneous predictions both must contribute to a poor evaluation. In this way, the function will tell us, on average, how far off these parameters place our predicted Fahrenheit values from the actual values.
def evaluate(C, F, m, b):
'''
Get average difference between
predicted and actual values of
F given parameters m, b
'''
# Accumulator
total_diff = 0
# Iterate through c,f pairs,
# summing the absolute differences
for c, f in zip(C, F):
total_diff += get_error(c,f,m,b)
average_diff = total_diff / len(C)
return average_diff
We hence can determine less anecdotally how our predictions are doing by running this function. Note that a higher value will here correspond to a worse set of parameters because the output is the average difference. An output of zero would indicate a perfect model.
evaluate(C, F, 2, 1) # Out: 27.436
Our predictions, on average, differ from the actual Fahrenheit value by about 27 degrees. A better set of parameters would cause evaluate to return a lower value. We thus formally state our problem as follows:
Find parameters m, b such that evaluate(C, F, m, b) returns the lowest possible value.
In finding such and , we should have the best model of the conversion between temperature units given our assumption that Fahrenheit has a relationship with Celsius of the form .
Parameter Updating
Right now, we have and . What we need, then, is a programmatic way to change these parameters such that the evaluate function would decrease. Allow me first to dispel an idea that would not work, though it seem as a good idea.
As a trivial, yet non-functional, method to update the parameters, we could see whether our predictions are usually overshooting or undershooting and then update the parameters accordingly: i.e. if we were undershooting, we should increase and by a set amount and then we should decrease them if we were decreasing and . However, this method is not very general because it does not account for situations wherein we need to update the parameters by different amounts. If, for example, we were to update our parameters by a constant amount in the current situation, we should only arrive at the proper values of and if the ideal parameters had a difference of 1 between them, with being larger than .
That aside, let us now consider a method that would work. To do this, we shall work through how we could choose to update these parameters.
Earlier, we wrote evaluate, which allows us to measure how well our parameters model the measurement conversion. The lower its output is, the better our parameters are. We can leverage evaluate to update our parameters with the following principle: if increasing a parameter a little bit causes evaluate to return a lower value, then we ought to increase the parameter; and if increasing a parameter a little causes evaluate to return a higher value, then we ought to decrease the parameter. We shall work through an example below.
Let us then choose a small value, . Then, we see the difference between evaluate's output with and and with and .
eps = 0.01
em1 = evaluate(C,F,2,1)
em2 = evaluate(C,F,2+eps,1)
print(em2 - em1) # Out: -0.178
eb1 = evaluate(C,F,2,1)
eb2 = evaluate(C,F,2,1+eps)
print(eb2 - eb1) # Out: -0.010
Our difference for both is negative, telling us that both parameters, when increased, cause evaluate to lower in value, improving our model of the Celsius-to-Fahrenheit conversion. We cannot easily know exactly how much we ought to increase our parameters because, for all we know, increasing either parameter too much could eventually have the opposite effect, perhaps taking us from undershooting to overshooting. We do however know that increasing them a tiny bit would have a positive result. Moreover, we see that increasing seems to have a better effect than increasing because, though we simulated increasing both parameters by the same value, , our increasing of decreased our average error by 17.8 times the amount that increasing did. We should then be wise to increase by 17.8 times the amount by which we increase .
Therefore, let us increase by 0.001 and by 0.0178, 18 times the amount by which we increased . I chose to increase by 0.001 somewhat arbitrarily, mostly considering that the value is small relative to the numbers that we are working with in this problem.
More generally, we ought to select a base rate, which in this case was 0.1, by which to increase our parameters; then, we decrease each parameter by that base rate multiplied by the corresponding difference above. (e.g. for we would update it with .) We decrease each parameter by this product because a negative difference above means that we need to increase the parameter and a positive difference means that we need to decrease the parameter. This base rate is called a learning rate, named thus because it controls the amount by which the parameters are updated, affecting the rate at which the proper parameters are learned.
After we update these parameters, we then can repeat the process to update them again and again until evaluate returns an average error that is low enough for us to tolerate.
To summarize, we use evaluate to see what effect increasing either of the parameters has on the performance of our model. Then, we update each parameter according to how big of an effect on the return value of evaluate we expect the update to have.
The code below implements this algorithm and repeats it 100,000 times, which will be enough here to arrive at close to the proper values for and . We introduce one new meta-parameter in the code, , which is our learning rate. The smaller the learning rate, the more precisely our parameters can be updated, which may lead to a better model; but a lower learning rate also means that we should need to run more iterations before we arrive at our final model.
# Initial parameters for the model
m = 2
b = 1
# Meta parameters for updating
# m and b
eps = 0.01
a = 0.1
# Update the parameters 100000 times
for _ in range(100000):
err = evaluate(C,F,m,b)
erm = evaluate(C,F,m+eps,b)
erb = evaluate(C,F,m,b+eps)
# Get the effect of increasing m
# and of increasing b
m_diff = erm - err
b_diff = erb - err
# Update parameters according to
# measured effect and the
# learning rate
m -= a * m_diff
b -= a * b_diff
# Final average error
err = evaluate(C,F,m,b)
print(err) # Out: 0.181
# Final parameter values
print(m) # Out: 1.790
print(b) # Out: 31.995
The actual parameters for conversion from Celsius to Fahrenheit are and , which are very close to the values that our machine-learning algorithm calculated.
Summary & Onward
We began with
- some data, herein and , including input-output pairs and
- a general model, herein a line equation, for translating from inputs to outputs.
We coded
- a function, herein
evaluate, to measure the performance of our model and - an algorithm to minimize this function's output by updating our parameters.
This is the general approach to all supervised machine learning problems. Supervised ML refers to algorithms that optimize a parameter set by considering known input-output pairs. I simplified the algorithm here minorly to avoid the use of any calculus, but the themes translate nicely. The amazing part of this algorithm is that it could be used for a massive class of functions, not just that of a line. The method used to update parameters here could thus be used to optimize parameters for arbitrarily large functions, such as those classifying an image or text.
For those curious and desiring of starting points for further learning, the function herethroughout called evaluate is analogous to a loss function in a proper ML algorithm. A loss function takes in the current model along with input-output pairs as input, then to return a value corresponding to the current model's performance. The loss function is what an ML algorithm tries to minimize. Therefore, a good loss function must return lower values for better models and higher values for worse models.
Also, our calculation of is a replacement for the more typical derivative used in an ML algorithm. Normally, this would be replaced with the derivative of the loss function, evaluate, with respect to m.
If nothing else, remember this, that machine learning works by updating model parameters such that an evaluation (i.e. loss) function is minimized.