Yesterday, I got the following question in the Mathematics of Machine Learning early access Discord:

Hi

@TivadarI was reading Gemma-10M Technical Overview on medium. In the last paragraph under Infini-Attention, the author gave an intuition of kernel:

“But remember, while these simple linear operations generally work, there’s a large graveyard of papers showing that we can’t effectively model softmax(Q K-T) with just a linear matrix multiplications. Therefore, instead of just embedding Q and K into our matrix directly, we instead apply a kernel to both Q and K, with the hope that the softmax operation can be learned as an affine transformation of Q and K in the kernel, σ. This can be a lot to digest, and I find it helpful to think of kernels in terms of how (x+y)² = x² + 2xy + y² are non-linear in both x, y; however, linear in [x², xy, y²]. In this case, our kernel σ.”I went back to your book to look for a definition of a kernel. It was mentioned in your book for orthogonality but not in the above context. So, I looked at the code on Github and found out that the a linear layer is applied to the Q, K and V - https://github.com/mustafaaljadery/gemma-2B-10M/blob/main/src/gemma.py#L508-L510. Is the linear operation on input (feature maps?) considered a kernel, as mentioned in this review? Thanks!

(Yes, there is a private Discord server for the early access members, where I answer every question that I can. Sometimes, I even write a whole post about it, as you can see.)

As the word “kernel” is one of the most overloaded terms in mathematics, I figured it would be best to answer in a newsletter post, a topic perfectly suited for the Epsilons series.

So, what is a kernel? Here we go!

# What is a kernel?

## The anatomy of inner products

Let’s talk about the inner product, or dot product, as many of you know it.

For any pair of *n*-dimensional vectors **x** and **y**, the Euclidean inner product is defined by

You are also probably closely familiar with this formula. However, this is not *the* inner product, it is merely an example. I admit, the most ubiquitous one, but still, the Euclidean product is just one of many.

In general, given a (real) vector space *V*, an inner product is a two-variable function

<·, ·>: *V* × *V* → *V* that satisfies three properties:

linearity,

symmetry,

and positive definiteness,

that is,

(Because of the symmetry, inner products are also linear in the second variable.)

Let’s see one more example. For any two-dimensional vector x = (x₁, x₂) and y = (y₁, y₂), the function

is a proper inner product. Feel free to check the linearity, symmetry, and positive definiteness by hand to check your understanding; it’s not that hard.

Here’s another example, or, to be precise, a bundle of examples. This time, we’re back in ℝ*ⁿ*, talking about *n*-dimensional vectors. If *K* is a symmetric and positive definite*n* × *n* matrix, then the formula

also defines an inner product. If *K* is the identity matrix, then <·, ·>ₖ is the good old Euclidean product.

In case you are wondering: it’s not a coincidence that for matrices, symmetry and positive definiteness are defined as they are. But let’s not get ahead of ourselves.

What does all of this have to do with kernels?

## Taking inner products apart

Let’s go back to our good old friend, the *n*-dimensional real vector space ℝ*ⁿ* and select a *basis* **e**₁, **e**₂, …, **e**ₙ:

that is, every vector can be written as the linear combination of the **e***ᵢ*-s. As inner products are linear in both variables, it follows that the inner products of the basis vectors determine it completely! Check it out:

If we arrange these into a matrix, we obtain what is called the inner product's* kernel* (as in the seed):

We can do all kinds of useful things with the kernel representation. For instance, we can rewrite the inner product in terms of matrix multiplications:

Think about this: while the inner product is a bivariate function, the kernel is a matrix; that is, a finite set of values. A discrete object. *n* × *n* numbers!

But hold on! We’ve seen **x**ᵀ*K***y** before. Can we reverse our thinking? What if, instead of obtaining the kernel from the inner product, we define the inner product in terms of the kernel? For any matrix *K*, we can *define* the bivariate function

which, depending on the properties of *K*, is an inner product! In fact, if K is a symmetric and positive definite matrix, then <·, ·>ₖ is an inner product.

Thus, kernels and inner products are two sides of the same coin! Mathematically speaking, the set of inner products and the set of symmetric and positive definite matrices are in one-to-one correspondence with each other.

But again, we are here to do machine learning. Why is this kernel thing useful for us?

## Kernels in machine learning

A dream scenario in data science: linearly separable datasets.

What’s on one side is predicted to be yellow, while the other is predicted to be blue.

All we want is a linearly separable dataset in the Euclidean space. A decision as simple as looking at the sign of **w**ᵀ**x**, the good old Euclidean product of the data sample **x**, and the weight vector **w**.

I have some news. The bad one is, unfortunately, this is almost never possible for real datasets. The good one is that the entire field of machine learning is about finding — excuse me, *learning *— feature transformations that make the data linearly separable.

Let me show you a simple example.

Here, the extremely simple linear model **w**ᵀ𝜙(**x**) with **w** = (0, 1) works perfectly on our toy dataset.

In the general case, let 𝜙: ℝ*ⁿ* → ℝ*ᵐ* be a data transformation that supposedly “straightens out” the dataset and makes it linearly separable.

How can we calculate the inner product of two transformed data points? Simple. As the transformed space is linearly separable, we can just use the Euclidean product there:

This is what a *kernel function* means in machine learning. (Which is slightly different than what kernel is in mathematics, but the two are closely related.)

Why is this beneficial? Because sometimes, we can rewrite models in terms of *k*(*x*, *y*), making the problem easier to handle. This way, we don’t have to handle feature maps explicitly. For instance, the so-called radial basis function kernel

encodes an infinite-dimensional feature mapping which we are otherwise unable to handle.

So, finally, we can answer the original question:

Is the linear operation on the input considered a kernel?

To the best of my understanding, as linear operations are feature maps, they determine a kernel. If *A* is the matrix (encoding our linear transformation feature map), then its corresponding machine learning kernel is *k*(**x**, **y**) = **x**ᵀ*A*ᵀ*A***y**.

Elegant explanation! Thank you for this, I've always wondered. Looking forward to your book !

edited May 31Thank you so much for such lovely post! It ties Math to ML perfectly!

A good example of Kernel and ML is Support Vector Machine (SVM). We can use different kernels to transform data points so that they are linearly separable (hyperplane); Radial basis function is a popular kernel in SVM.