The Mathematics Behind Machine Learning

15 min readAug 8, 2023

On a daily basis, we use machine learning (ML) whether that be when we are asking ChatGPT a question or typing a search query into Google. But, how do these human-like algorithms actually work under the hood?

Well, that is the question I asked myself after I had spent some time building machine learning models in PyTorch(Diagnosing Mental Health Conditions Using Deep Learning and Creating a CNN in PyTorch for the CIFAR10 Dataset). It is pretty interesting how easily accessible ML programming has become and how all it takes for an individual to create a ML model is just the programmatic understanding of the frameworks: whether that be PyTorch or TensorFlow.

This accessibility and democratization is great for increasing the use of ML throughout society, however it causes many individuals to undermine the under the hood processes that allow ML to work. Having a lack of understanding of the basics of ML limits developers from innovating new ML techniques or using the ML approaches introduced in research papers. Thus, to take advantage of ML’s full potential and become an expert in the field, it is important to understand the underlying concepts behind ML.

So let’s get into unraveling the concepts that cause ML to be so powerful…

*ML Model Predicting Pneumonia based on a Chest Radiograph*

Approach to how to Think of a Machine Learning Model

A Machine Learning (ML) model predicts outcomes (or in other words outputs) based on given inputs. Inputs to a model can be anything (images, videos, audio files, numerical data, etc), similar to how the outputs can be anything.

Now with this in mind, it is important to recognize that every form of data whether it be an image, video, etc can be boiled down to numbers. For instance, an image is just a matrix of pixel data. Recognizing this, we can see that a ML model is just predicting numbers (outputs) based on numbers (inputs). As a result, a ML model is just like any other mathematical function (f(x), g(x), etc) and it is purely driven by the language of math.

*Image showing a ML Model as a Mathematical Function*

Taking a Look at a Perceptron and “Learning”

Now that we understand that a ML model is a mathematical function, let’s take a look at one of the simplest ML models that can be created, a perceptron. It is multiple perceptrons chained together that create complex deep learning neural networks (a type of advanced ML model) (for more information about the structure of neural networks and understanding the perceptron, please visit my article: Artificial Intelligence and Medicine).

A perceptron consists of weights (w1 to wn ) for each input, a bias (b) and an activation function. In the case of the perceptron below, it has one weight (w1), a bias (b1) and the tanh activation function.

*Diagram of a Perceptron with One Input*

The output (f) of the perceptron above can be modeled with the following equation:

Thinking of the perceptron as a function of input1, w1 and b1 (which it is), we can model the output of the perceptron mathematically as follows:

Within machine learning problems, our ML model (our function) is supposed to “learn” from given labeled data which contains inputs and corresponding correct outputs. Then using this learning, the model (our function) should be able to predict outputs to a pretty high degree of accuracy based on new inputs.

The concept of “learning” in ML, involves tuning the weights and biases based on labeled data. This tuning is done in a way so that the accuracy of the model increases.

Recognizing this, let’s attempt to make the perceptron learn to predict the probability of someone getting cancer based on age using the following data (this data is not accurate and is generated by a computer algorithm):

Graphing the Data:

x = [21, 57, 23, 75, 15, 90, 39, 46, 10, 26, 77, 36, 16, 19]
y = [0.16257491909664504, 0.4498912027529245, 0.17450833117876016, 0.531685242634573, 0.1862944760997236 ,0.5723139374289526, 0.33334961096655963, 0.3442800291772686, 0.1450107473774238, 0.19879042910822511, 0.5717675652170906, 0.3285269754996645, 0.16914684898481053, 0.12500664046884424]

fig = plt.figure(facecolor='white')
ax = fig.add_subplot()

ax.set_xlabel('Age', color='black')
ax.set_ylabel('Probability of Individual Getting Cancer', color='black')

plt.scatter(x,y)

*Scatterplot of Age versus Probability of Individual getting Cancer*

The output (i.e. the prediction) of the perceptron for our first sample’s age value (input) can be modeled as follows:

Simple right… But, do you notice a problem in this way of representing output?

Well if you do, you have a great career in mathematics ahead of you. If you don’t, do not worry about it. The problem that arises in this representation is that age (which is between 10 to 90) is large causing the input ((Input1)(w1) + b1) to the tanh function to be great. With this, the value of the output to the perceptron is mostly just one at all age values due to the nature of the tanh function (see the graph below), no matter the changes in w1 or b1. This makes the perceptron impossible to train as it will always output 1 no matter the changes in w1 and b1.

Graphing the Tanh Function:

x = np.linspace(-10,10,100).tolist()

def tanh(inp):
    return (math.exp(inp) - math.exp(-1*inp))/(math.exp(inp) + math.exp(-1*inp))

y = [tanh(i) for i in x]

plt.figure(facecolor='white')

plt.plot(x,y) 

#The equation was added in through photoshop

Therefore, for the perceptron to actually be trained from the data, the age values will need to be normalized (made smaller). In this way, the output to the perceptron will deviate based on the w1 and b1 and will not always just be one.

Normalization is a technique which involves calculating the number of standard deviations each data point is away from the mean. By doing this, the data values becomes smaller, but their significance is kept.

The first step in normalization is to calculate the mean(µ) and standard deviation(σ):

From here we can calculate the number of standard deviations each age value is away from the mean age value (z-score):

Next, we can replace each age value with its z-score. This normalizes the age data:

Having the age data normalized, the perceptron can now be trained. The first step in training any ML model is to randomly set the weights and biases. In the case of our perceptron, the w1 value is randomly initialized at 0.3 and the b1 value is randomly initialized at -0.7.

With w1 and b1 set, the output of the perceptron for the first sample’s normalized age value of -0.729 can be calculated as follows:

This prediction of the model is far off from the actual probability that should have been given (0.163). To make the model now learn from this first sample, we need to adjust w1 and b1 so that the model’s output becomes closer to 0.163 (becomes more accurate).

A method used for accomplishing this task involves measuring the inaccuracy of the model’s output (done mostly through a loss function) and then moving the weights and biases in a specific direction to decrease that inaccuracy (gradient descent). By doing so, the model’s output gets closer to the wanted value (becomes more accurate) and it “learns” from the sample.

Applying this method to our perceptron, we can calculate the inaccuracy of our model’s output through the mean squared error (MSE) loss function (it is the most appropriate loss function to use in this case). The MSE loss function squares the difference between the predicted value and the actual target value. In this manner, it provides a numerical value (loss) that measures the inaccuracy of the model (the higher the loss, the higher the inaccuracy). The MSE loss of the perceptron’s output for the first normalized age value can be calculated as shown:

With the MSE loss calculated, we can now go ahead to adjust the w1 and b1 values of our network to lower this loss. But, how do we do this exactly?

Well we can do this by thinking of our MSE loss as a function of our w1 and b1 (which it is):

Graphing this function, we can see that as our w1 and b1 change, the values of MSE loss change. Moreover, we can see how currently we are at the point (w1 = 0.3, b1 = -0.7, MSE loss = 0.788) on this loss function.

3-Dimensional Graph of the MSE Loss Function Plotted in MatLab

Just analyzing the graph, we can see that to decrease the MSE Loss, we have to decrease our w1 and increase our b1.

However, in ML, it is inefficient for us to plot our loss function for every sample and then decide in which direction to adjust the weights and biases to lower the loss (increase accuracy). Therefore, in ML, to adjust the weights and biases correctly, we use a mathematical approach. In this approach, we calculate the gradients (i.e. the slopes) of the weights and biases with respect to the loss function (called back-propagation). The gradient of a weight or bias with respect to the loss function tells us how much the loss changes when there is a change in the weight or bias (i.e. the rate of change). The weights and biases are then subtracted by their corresponding gradients multiplied by a learning rate (called gradient-descent). This subtraction causes the weights and biases to move in the direction of decrease on the loss function resulting in new adjusted weights and biases that have a lower loss (higher accuracy). An example of a weight being adjusted can be seen below:

The learning rate is part of the expression as it makes sure the weights and biases aren’t adjusted too much to the point where they may lead to higher losses (overshooting).

Coming back to our perceptron, to calculate the gradient of w1 and b1 with respect to the MSE loss, we can use the concepts of derivative (which involves calculating the slope of a tangent line at a specific point on a function | watch this video to learn more) and partial derivative (involves calculating the derivative of a multi-variable function with respect to a single variable | watch this video to learn more and watch this video to truly understand the multivariable chain rule) from calculus:

Now that the gradients of w1 and b1 have been calculated with respect to the MSE loss, we can adjust them as follows:

The adjusted w1 and b1 lead to a lower loss value which can be seen graphically (meaning the model is now more accurate and has learned from the first sample):

3-Dimensional Graph Depicting effect of w1 and b1 Being Adjusted

There we go! The perceptron has “learned” from the first sample. The process of conducting this “learning” can be summarised as follows:

Forward propagation (getting the output/prediction of the model based on the sample’s input value and then calculating the loss of the output)
Backward propagation (calculating the gradients of the weights and biases with respect to the loss function)
Gradient descent (adjusting the weights and biases by subtracting them by their corresponding gradients → causing them to move in the direction of decrease on the loss function)

Moving on, to fully train the perceptron, all we have to do is conduct the process detailed above with all of the other samples of the dataset and then repeat over the dataset until we start getting high accuracy. Doing this by hand would be tedious, therefore I have written a python program that trains our perceptron by looping over the dataset 10 times:

#Ages data that will be used for training
age = [21, 57, 23, 75, 15, 90, 39, 46, 10, 26, 77, 36, 16, 19]

#Probability of getting cancer data that will be used for training
prob = [0.16257491909664504, 0.4498912027529245, 0.17450833117876016, 0.531685242634573, 0.1862944760997236 ,0.5723139374289526, 0.33334961096655963, 0.3442800291772686, 0.1450107473774238, 0.19879042910822511, 0.5717675652170906, 0.3285269754996645, 0.16914684898481053, 0.12500664046884424]

#Function that normalizes an array of data and returns that array
def normalize(array):
    avg = sum(array)/len(array)
    stddev = math.sqrt(sum([(i-avg)**2 for i in array])/len(array))
    
    normalized = [(l-avg)/stddev for l in array]
    
    print('----')
    print('avg:',avg)
    print('stddev:',stddev)
    print('----')
    
    return normalized

#Function that calculates the gradient of w1 with respect to the loss function
def gradw1(w1, b1, inp, label):
    return 2*(tanh((inp)*(w1)+b1)-label)*(1-(tanh((inp)*(w1)+b1))**2)*inp

#Function that calculates the gradient of b1 with respect to the loss function
def gradb1(w1, b1, inp, label):
    return 2*(tanh((inp)*(w1)+b1)-label)*(1-(tanh((inp)*(w1)+b1))**2)

#Function that calculates what the value is when input is put into tanh --> tanh(inp) =?
def tanh(inp):
    return (math.exp(inp) - math.exp(-1*inp))/(math.exp(inp) + math.exp(-1*inp))

#Function that returns an ordered array with values going from smallest to biggest
def order(array):
    for i in range(len(array)):
        if(i < len(array)):
            for h in range(i+1, len(array)):
                j = array[i]
                if(array[i]>array[h]):
                    array[i] = array[h]
                    array[h] = j
                    
                    
    return array

#Normalize the age data
agen = normalize(age) 

#Inititalize the weights and biases (w1 and b1)
w1 = 0.3
b1 = -0.7

#Initialize the learning rate
l_rate = 0.05

#Conduct training
for x in range(10): #loop for the number of times to train over the dataset (loop for epochs)
    for i in range(len(agen)): #looping over the entire dataset
        w1 += -1*l_rate*gradw1(w1,b1,agen[i],prob[i]) #adjusting w1
        b1 += -1*l_rate*gradb1(w1,b1,agen[i],prob[i]) #adjusting b1

#printing the w1 after training
print('w1 =', w1)
#printing the b1 after training
print('b1 =', b1)

#Creating a plot with a white-background color and setting the figure variable
figure = plt.figure(facecolor='white')

#Setting the axes labels
ax = figure.add_subplot()
ax.set_xlabel('Age', color='black')
ax.set_ylabel('Probability of Individual Getting Cancer', color='black')

#Adding a scatter plot to the previously created plot of the normalized age values vs the corresponding probability of getting cancer
plt.scatter(agen, prob)

#Ordering the normalized age values
agen = order(agen)


g = []

#Calculating the model's ouputs for the ordered normalized age values
print('-----')
for y in range(len(agen)):
    g.append(tanh(agen[y]*w1+b1))


#Drawing a line on the previously created plot of the model's outputs based on the normalized age value
plt.plot(agen,g)

After training, we plot a scatter plot of the dataset with the normalized age values (blue dots) and the outputs of the model for each of the normalized age values (blue line). Through this plot, we can see that our model has fit the training data very well:

Scatter Plot of Training Dataset and Line Plot of the Model’s Predictions

To use this model in practice, we have to just save the w1 & b1 values of the trained model as well as the average & standard deviation values used to normalize the training dataset’s age values. For instance, given an age value of 30, we will first normalize it as follows using the average and standard deviation values.

Next, we can input this normalized value into our trained model to get a prediction on the probability of the individual getting cancer based on his/her age.

To conduct predictions in python using the trained model, the following code can be implemented after you have already executed the code block above.

#Saving the average and standard deviation of the training dataset
avg = 39.285714285714285
stddev = 25.095328453799556

#Function that normalizes input based on the training dataset's average and standard deviationn
def norm(inp, avg, stddev): 
    return (inp-avg)/stddev

#New input
inp = 30

 #Normalize input before feeding into the model
inp = norm(inp, avg, stddev)

#Get the model's prediction on the input (using the trained w1 and b1)
modelpred = tanh(inp*w1+b1)

#Print the prediction
print(tanh(inp*w1+b1))

Applying our Knowledge on how a Perceptron “Learns” to a Neural Network:

Up till now, we have discussed the mathematical basis for training a perceptron (a type of basic ML model) and have also algorithmically (in python) implemented this training. In this section, we look to apply our knowledge to a more advanced and also very popular ML model, the neural network.

As stated before, the neural network is just layers of perceptrons attached together (to learn more about structure, read my previous article). Due to this structure, the actual training of a neural network is very similar to that of a perceptron and it applies the same mathematical concepts.

So let’s dive into it! Below is the structure of an example neural network that takes the inputs of Age, Monthly Alcohol Intake and Biological sex (0 for female, 0.5 for intersex and 1 for male) to predict the probability of an individual getting cancer:

Neural Network with 2 Hidden Layers that Predicts the Probability of an Individual Getting Cancer Based on Certain Characteristics

The output (i.e. prediction or pred) of the model can be modeled by the following function:

To train this model and adjust all the weights and biases that are contributing to the output, similar to a perceptron, we need to have a loss function calculate the loss after the model makes a prediction. In this scenario, the mean squared error (MSE) loss function suites the neural network the best.

Diagram Emphasizing the Step of Calculating the Loss after the Model Makes a Prediction

Following the calculation of the loss, to make the network learn from a given sample, the gradients of the weights and biases with respect to the loss function will need to be calculated (back-propogation). Then, the gradients multiplied by some learning rate will be subtracted from the weights and biases, adjusting them (gradient descent). Sound familiar… Well it should as this process is the same way that a perceptron is trained. The only difference that arises is that our neural network’s loss function and the gradients exist in more than 3-dimensional space.

Recognizing this, lets go ahead and algebraically show how we would adjust w1 in the model based on some given sample (see below). During actual training, we would have numerical values for each of the variables in the equations below. Additionally, we would do this process of adjusting with all the weights and biases, not just with w1. Finally, our model would loop through the dataset and the weights and biases would be adjusted after each sample, not just once.

And that’s it… It’s is so spectacular how we everything in ML relies on mathematics and it really sheds light on how powerful the language of math truly is.

Overall, I hope you enjoyed reading through this (pretty lengthy) article and you gained insight into how ML models truly “learn”.

Give this article a clap 👏 so it can reach more people :)

Stay in Touch 💬

I am an 18-year old machine learning researcher at the University of Toronto and I am working on building various projects in ML. To get updates on my projects and my personal life, follow my Medium and connect with me on LinkedIn.

Have an amazing day!

The Mathematics Behind Machine Learning

Approach to how to Think of a Machine Learning Model

Taking a Look at a Perceptron and “Learning”

Applying our Knowledge on how a Perceptron “Learns” to a Neural Network:

Stay in Touch 💬

Written by Karan Chahal