There’s a pattern in machine learning that blows my mind every time, even though I’ve seen it more than I can count.
Look at this expression:
Yes, I know. You are more than familiar with linear regression; we are not here to discuss that. I want to share a wonderful mathematical principle with you, learning through the example of linear regression.
Depending on what we understand by a, x, b, +, and ·, the expression “ax + b” can either be the very first machine learning model a student encounters or the main component of a powerful neural network.
Its evolution from basic to state-of-the-art was shaped by the two great forces of mathematics:
generalizing the meaning of simple symbols such as + and · to make them do more,
and then abstracting the complex symbols into a-s, b-s, and x-es to keep them simple.
This dance of generalization and abstraction is the essence of mathematics; it’s why we can treat functions as vectors, use matrices as exponents, build the foundations of mathematics by drawing dots and arrows, and many more.
Let’s see the profound lesson that ax + b can teach us.
Fitting a line to a cluster of points
The fundamental problem of machine learning: predicting one variable from another. Mathematically speaking, we are looking for a function y = h(x) that describes the relation between the predictor variable x and the target variable y.
On the ground floor of machine learning, the simplest idea is to assume that the relation is linear; that is, y = ax + b for some parameters a and b.
In other words, if x is the predictor and a quantifies its effect on the target variable y, then the unwritten “·“ operation calculates x’s total contribution to y.
In yet another word, we are fitting a line to a cluster of points.
This is called linear regression. You know all about it, but let’s recap two essential facts: the parameter a quantifies how the target variable y changes when the predictor variable x moves, and b describes the bias.
How will ax + b become the ubiquitous building block of neural networks?
Let’s kickstart that cycle of mathematical generalization and abstraction by stepping out from the plane into the space.
Launching into higher dimensions
Is ax + b a good model?
Not in most cases. One of the first things that comes to mind is its failure to deal with multiple predictor variables. Say, y describes the USD/m² price of real estate or the lactic acid production of a microbial culture.
Do we only have a single predictor variable?
No. Real estate prices are influenced by several factors, and hundreds of various metabolites drive lactic acid-producing microbial culture processes. Life is complex, and it’s extremely rare that an effect only has a singular cause. Instead of a lonely x, we have a sequence: x₁, x₂, …, xₙ.
As each predictor variable has a different effect on the target, the simplest is to compute the individual effects aᵢxᵢ for some parameter aᵢ, then mix all the effects by summing them together, obtaining
Now that we have generalized our model, it’s time for abstraction.
Mathematically speaking, our predictor variable is stored in the vector
where each xᵢ describes a feature; and similarly, we have a vector of parameters
If we define the operation “·” between two vectors by
then our model takes the form
which is the very same expression as the previous one, with an overloaded multiplication operator. Recall what I claimed a few lines above:
if x is the predictor and a quantifies its effect on the target variable y, then the unwritten “·“ operation calculates x’s total contribution to y.
We generalized “·“ to keep this specific property. The result is the famous dot product, which is the cornerstone of mathematics, full of useful properties and geometric interpretations. Check out this post:
How to measure the angle between two functions
This will surprise you: sine and cosine are orthogonal to each other. Even the notion of the enclosed angle between sine and cosine is unclear, let alone its exact value. How do we even define the angle between two functions? This is not a trivial matter. What’s intuitive for vectors in the Euclidean plane is a mystery for objects such as functions.
The process of abstracting multiple scalars into a vector is called vectorization.
Vectorization has two layers: a mathematical and a computational one. So far, we’ve seen the mathematical part, overloading the symbols of the expression ax + b to do more.
However, there’s another side of the coin. Computationally speaking, vectorized code is much faster than their scalar counterparts because it is
massively parallelizable,
and the array data structure enables contiguous memory allocation.
(Contiguous memory allocation means that the elements of the array are stored next to each other in the memory. This makes a huge difference in performance, as you don’t have to jump around to access the elements.)
Vectorized code makes machine learning possible!
High dimensions → high dimensions
Is the model
good enough?
It’s getting better, but there’s still something missing: we usually have multiple target variables. Think about a microbial colony in a biofuel factory that takes in a bunch of metabolites, producing a different set of molecules. There, we don’t just want to predict ethanol production; we want to think in systems and processes, so we track all the produced metabolites.
In other words, instead of a single target variable y, we have a bunch: y₁, y₂, …, yₘ. Thus, we are jumping from a vector-scalar model to a vector-vector model.
We can think about the vector-vector function h as a vector of vector-scalar functions:
If each hⱼ is linear, then we h(x) takes the form
where each aⱼ is an n-dimensional parameter vector.
This was the generalization part. Now, let’s see that abstraction. Is there a way to simplify this? First of all, we can form a vector from the biases and move the addition outside the vector:
However, there’s an even more important pattern there. Instead of working with n and m-dimensional vectors, let’s shift our perspective and use matrices instead! If we treat an n-dimensional vector as a 1 × n matrix, our updated model can be written as
We are one step away from the final form.
To make that step, notice that the aᵢᵀ-s are column vectors that we can horizontally concatenate into a matrix A:
With this we have reached the final form
Compare this one with the starting point:
It’s almost the same (except the order in which a and x appears), but the vectorized one is light-years ahead of the starting point.
Computing everything at once
There is one more step before we reach the final form.
Data usually comes in the form of tables, batches, and other bundles. In other words, we have a sequence of data samples x₁, x₂, …, xₛ, each forming a row vector, which we stack on top of each other to obtain the data matrix X, defined by
Stacking s number of n-dimensional vectors yields an s × n matrix. It’s an abuse of notations, but a good one.
With another abuse of notation, we might define the function h on the matrix X by
This looks quite repetitive, so there should be a pattern. Must we apply h(x) row by row? First, notice that adding the row vector b can be moved outside of the matrix via
The matrix with x-es and A-s look familiar! Indeed, this is by definition, the product of X and A:
Feel free to check it by hand. If you are confused by matrix multiplication, here’s my earlier post to guide you:
Epsilons, no. 2: Understanding matrix multiplication
Matrix multiplication is not easy to understand. Even looking at the definition used to make me sweat, let alone trying to comprehend the pattern. Yet, there is a stunningly simple explanation behind it. Let's pull back the curtain!Understanding mathematics is a superpower. Subscribing to The Palindrome will instantly unlock it for you. For sure. (Or at least help you get there, step by step.)
One thing that sticks out like a sore thumb is that matrix of b-s. Can we get rid of them? Sure.
Here’s that typical math trick: overload the operations. Instead of defining another matrix B via stacking the b-s, let’s define the addition between an s × m matrix and a
1 × m vector by
making our model to be
which is as clear as it gets.
This last trick of blowing up row vectors is called broadcasting, and it’s built-in to tensor frameworks such as NumPy. Check it out:

(The symbol @
denotes matrix multiplication in NumPy.)
Our loop of generalization and abstraction becomes complete: the linear regression model is vectorized. We have gained
the ability to handle multiple features and target variables,
no additional complexity in the model,
and a massively parallel implementation that uses efficient data structures.
If you are familiar with neural networks, you know that the simple expression
XA + b is the so-called linear layer, one of the most frequent building blocks. (Even the convolutional layer is often implemented as an extremely sparse linear layer.)
Even though this post was about generalization and abstraction, the computational aspects are so interesting that they are worth an entire post.
See you next time with a deep dive into how vectors and matrices work in silico!
Nothing really to comment, I just want to say this was a great read. You’ve got a gift for making math explanations engaging!