Yesterday, I got the following question in the Mathematics of Machine Learning early access Discord:

Hi

@TivadarI was reading Gemma-10M Technical Overview on medium. In the last paragraph under Infini-Attention, the author gave an intuition of kernel:

“But remember, while these simple linear operations generally work, there’s a large graveyard of papers showing that we can’t effectively model softmax(Q K-T) with just a linear matrix multiplications. Therefore, instead of just embedding Q and K into our matrix directly, we instead apply a kernel to both Q and K, with the hope that the softmax operation can be learned as an affine transformation of Q and K in the kernel, σ. This can be a lot to digest, and I find it helpful to think of kernels in terms of how (x+y)² = x² + 2xy + y² are non-linear in both x, y; however, linear in [x², xy, y²]. In this case, our kernel σ.”I went back to your book to look for a definition of a kernel. It was mentioned in your book for orthogonality but not…

## Keep reading with a 7-day free trial

Subscribe to The Palindrome to keep reading this post and get 7 days of free access to the full post archives.