There’s a pattern in machine learning that blows my mind every time, even though I’ve seen it more than I can count.

Look at this expression:

Yes, I know. You are more than familiar with linear regression; we are not here to discuss that. I want to share a wonderful mathematical principle with you, learning through the example of linear regression.

Depending on what we understand by *a*, *x*, *b*, +, and ·, the expression “*ax* + *b*” can either be the very first machine learning model a student encounters or the main component of a powerful neural network.

Its evolution from basic to state-of-the-art was shaped by the two great forces of mathematics:

generalizing the meaning of simple symbols such as + and · to make them do more,

and then abstracting the complex symbols into

*a*-s,*b*-s, and*x*-es to keep them simple.

This dance of generalization and abstraction is the essence of mathematics; it’s why we can treat functions as vectors, use matrices as exponents, build the foundations of mathematics by drawing dots and arrows, and many more.

Let’s see the profound lesson that *ax* + *b* can teach us.

# Fitting a line to a cluster of points

The fundamental problem of machine learning: predicting one variable from another. Mathematically speaking, we are looking for a function *y* = *h*(*x*) that describes the relation between the predictor variable *x* and the target variable *y*.

On the ground floor of machine learning, the simplest idea is to assume that the relation is linear; that is, *y* = *ax* + *b* for some parameters *a* and *b*.

In other words, if *x* is the predictor and *a* quantifies its effect on the target variable *y*, then the unwritten “·“ operation calculates *x*’s total contribution to *y*.

In yet another word, we are fitting a line to a cluster of points.

This is called linear regression. You know all about it, but let’s recap two essential facts: the parameter *a* quantifies how the target variable *y* changes when the predictor variable *x* moves, and *b* describes the bias.

How will *ax* + *b* become the ubiquitous building block of neural networks?

Let’s kickstart that cycle of mathematical generalization and abstraction by stepping out from the plane into the space.

# Launching into higher dimensions

Is *ax* + *b* a good model?

Not in most cases. One of the first things that comes to mind is its failure to deal with multiple predictor variables. Say, *y* describes the USD/m² price of real estate or the lactic acid production of a microbial culture.

Do we only have a single predictor variable?

No. Real estate prices are influenced by several factors, and hundreds of various metabolites drive lactic acid-producing microbial culture processes. Life is complex, and it’s extremely rare that an effect only has a singular cause. Instead of a lonely *x*, we have a sequence: *x₁*, *x₂*, …, *xₙ*.

As each predictor variable has a different effect on the target, the simplest is to compute the individual effects *aᵢxᵢ* for some parameter *aᵢ*, then mix all the effects by summing them together, obtaining

Now that we have *generalized* our model, it’s time for *abstraction*.

Mathematically speaking, our predictor variable is stored in the vector

where each *xᵢ* describes a feature; and similarly, we have a vector of parameters

If we define the operation “·” between two vectors by

then our model takes the form

which is the very same expression as the previous one, with an overloaded multiplication operator. Recall what I claimed a few lines above:

if

xis the predictor andaquantifies its effect on the target variabley, then the unwritten “·“ operation calculatesx’s total contribution toy.

We generalized “·“ to keep this specific property. The result is the famous dot product, which is the cornerstone of mathematics, full of useful properties and geometric interpretations. Check out this post:

The process of abstracting multiple scalars into a vector is called *vectorization*.

Vectorization has two layers: a mathematical and a computational one. So far, we’ve seen the mathematical part, overloading the symbols of the expression *ax* + *b* to do more.

However, there’s another side of the coin. Computationally speaking, vectorized code is much faster than their scalar counterparts because it is

massively parallelizable,

and the array data structure enables contiguous memory allocation.

(Contiguous memory allocation means that the elements of the array are stored next to each other in the memory. This makes a huge difference in performance, as you don’t have to jump around to access the elements.)

Vectorized code makes machine learning possible!

# High dimensions → high dimensions

Is the model

good enough?

It’s getting better, but there’s still something missing: we usually have multiple target variables. Think about a microbial colony in a biofuel factory that takes in a bunch of metabolites, producing a different set of molecules. There, we don’t just want to predict ethanol production; we want to think in systems and processes, so we track *all* the produced metabolites.

In other words, instead of a single target variable *y*, we have a bunch: *y₁*, *y₂*, …, *yₘ*. Thus, we are jumping from a vector-scalar model to a vector-vector model.

We can think about the vector-vector function **h** as a vector of vector-scalar functions:

If each *hⱼ* is linear, then we **h**(**x**) takes the form

where each **a***ⱼ* is an *n*-dimensional parameter vector.

This was the generalization part. Now, let’s see that abstraction. Is there a way to simplify this? First of all, we can form a vector from the biases and move the addition outside the vector:

However, there’s an even more important pattern there. Instead of working with *n* and *m*-dimensional vectors, let’s shift our perspective and use matrices instead! If we treat an *n*-dimensional vector as a 1 × *n* matrix, our updated model can be written as

We are one step away from the final form.

To make that step, notice that the **a***ᵢᵀ*-s are column vectors that we can horizontally concatenate into a matrix *A*:

With this we have reached the final form

Compare this one with the starting point:

It’s almost the same (except the order in which *a* and *x* appears), but the vectorized one is light-years ahead of the starting point.

# Computing everything at once

There is one more step before we reach the final form.

Data usually comes in the form of tables, batches, and other bundles. In other words, we have a sequence of data samples **x**₁, **x**₂, …, **x***ₛ*, each forming a row vector, which we stack on top of each other to obtain the *data matrix X,* defined by

Stacking *s* number of *n*-dimensional vectors yields an *s* × *n* matrix. It’s an abuse of notations, but a good one.

With another abuse of notation, we might define the function **h** on the matrix *X* by

This looks quite repetitive, so there should be a pattern. Must we apply **h**(**x**) row by row? First, notice that adding the row vector **b** can be moved outside of the matrix via

The matrix with **x**-es and *A*-s look familiar! Indeed, this is by definition, the product of *X* and *A*:

Feel free to check it by hand. If you are confused by matrix multiplication, here’s my earlier post to guide you:

One thing that sticks out like a sore thumb is that matrix of **b**-s. Can we get rid of them? Sure.

Here’s that typical math trick: overload the operations. Instead of defining another matrix *B* via stacking the **b**-s, let’s *define* the addition between an *s* × *m* matrix and a

1 × *m* vector by

making our model to be

which is as clear as it gets.

This last trick of blowing up row vectors is called *broadcasting*, and it’s built-in to tensor frameworks such as NumPy. Check it out:

(The symbol `@`

denotes matrix multiplication in NumPy.)

Our loop of generalization and abstraction becomes complete: the linear regression model is vectorized. We have gained

the ability to handle multiple features and target variables,

no additional complexity in the model,

and a massively parallel implementation that uses efficient data structures.

If you are familiar with neural networks, you know that the simple expression*XA* + **b** is the so-called linear layer, one of the most frequent building blocks. (Even the convolutional layer is often implemented as an extremely sparse linear layer.)

Even though this post was about generalization and abstraction, the computational aspects are so interesting that they are worth an entire post.

See you next time with a deep dive into how vectors and matrices work in silico!