Explore LLM word representations using similarity analysis (part 2)

Investigate semantic information inside the attention matrices of GPT-2

May 20, 2026

What you will learn in this 2-part post series

The primary goal of this post series is to teach you the Representational Similarity Analysis (RSA), which is a machine-learning analysis that is used to compare distributed representations in different systems.

If you haven’t already read Part 1 in this series, please do so! It provides necessary background about how the RSA score is calculated and interpreted.

As a brief reminder, an RSA (representational similarity analysis) works by comparing cosine similarity matrices across different embeddings spaces (layers, blocks, models, etc.). The idea is that different embeddings spaces may have distinct coordinate systems and even different dimensionalities, but if their internal representational structures are similar, the relative similarities should be strongly correlated even if the vectors are distinct.

The additional goals of this second post are (1) to learn more about RSA and category specificity, and (2) to learn how to dissect the “hidden layers” of an LLM, and in particular, the Query, Keys, and Values vectors inside the transformer block. Those Q, K, and V vectors are part of the mechanism by which LLMs figure out what information from previous words are relevant for using the current word to make predictions about subsequent words.

You will discover that the adjustment vectors are largely uncorrelated while their RSA scores are quite high. These results show that although individual attention matrices have idiosyncratic internal calculations, they learn meaningful representations and words that allow them to interact in elegant and semantically meaningful ways.

If you want to learn more about the attention algorithm, I can humbly recommend my post on the topic.

This post roughly corresponds to Project 38 in my recent book on using machine-learning projects to understand how LLMs work. Don’t worry, you don’t need the book to follow this post.

How to use the code with these posts

The accompanying code file will reproduce all the figures in this post — but you can do so much more by thinking of the code as a starting-point for your continued explorations. Try changing parameters, adding new words or categories, using different similarity/distance metrics, different models, etc.

The code is available here on my GitHub. In the video below, I show how to get and run the code using Google Colab. You can also download the notebook file and run it locally, but I recommend using Colab because you won’t need to worry about local installations or library versions.

What are the Q, K, and V vectors in the attention algorithm?

You’ve probably heard of the “attention algorithm” at the heart of the LLM transformer block. Attention is an elegant and clever trick that allows language models to determine how much information is contained in each pair of tokens and how that information is relevant for generating predictions about new text.

There are three sets of activations for each token position, called the Query (Q), Keys (K), and Values (V) matrices. The idea is this: For each pair of tokens, the dot product between their corresponding Q and K vectors creates a scalar weighting value, with higher dot products indicating that more importance (attention) should be paid to that pair; that weight value then determines how much relevant information in V gets added onto the embeddings vectors.

The result of the attention algorithm is an adjustment to the embeddings vectors as they pass through the transformer stack. In other words, the embeddings vectors you worked with in the previous post are rotated and scaled by each transformer layer, and those adjustments nudge the vectors from pointing towards the tokens in the input (e.g., the text prompt you gave to Claude) towards other tokens to generate an appropriate and context-relevant output.

The attention calculation is separated into “heads,” which are low-dimensional views of the hidden states that capture distinct features of the text and are combined at the end of the attention algorithm. I decided to ignore the attention heads for this post. That’s partly because we’re not working with the QKᵀ dot products, which is where the representations become head-specific, and partly in the interest of focusing on the mechanics and interpretations of RSA. An interesting extension of this project could involve running the analyses separately per-head, although that would create a dimensionality-explosion due to the number of heads and matrices that could be compared with RSA.

Import and inspect GPT-2-XL

I decided to use GPT-2-XL for this post. The code to import the model and its tokenizer from Hugging Face was shown in the previous post. The screenshot below shows an overview of the model architecture.

If you’re new to PyTorch models and LLMs, then this overview might look intimidating. The relevant information for us here is that there are 48 transformer layers (h is for “hidden layer”), and each hidden layer contains an attention block (attn) among other components like layerNorm and MLP. The Q, K, and V vectors are calculated in the c_attn layer. That matrix is 4800×1600. The 1600 corresponds to the embeddings dimensionality, and 4800 corresponds to the Q, K, and V matrices concatenated into one wide matrix (4800 = 3×1600).

Access the internal calculations using hooks

In the previous post, we didn’t need to prompt the model because we could just grab the embeddings vectors for each word we wanted to analyze. However, accessing the internal calculations of an LLM is a little more involved. The reason is that the transformer modifies each embeddings vector according to context (previous words in the text); in other words, the representation of the word “the” depends on all the words that come before it. Thus, we need to prompt the model with some text in order to analyze its internals.

But that creates a new problem: Even small LLMs create huge data matrices during each forward pass, and storing all of those internal calculations for each prompt would require terabytes of space. Therefore, the internal calculations are destroyed as soon as they’re no longer needed.

Fortunately, PyTorch provides a special method (like a function) that we can implant into the model that allows us to grab the internal calculations before they are destroyed. It’s called a “hook function” and looks like this:

There’s a lot going on in that code, and it might look intimidating if you’re new to working with LLMs in Python. But the idea is to implant the “hook” function into the attention block of each transformer layer, make a copy of the Q, K, and V activations, convert them to NumPy, and store them in a dictionary called activations that we can access later. This function gets called each time we prompt the model with some text.

Now we’re ready to prompt the model using the 34 words in 3 categories that you used in the previous post.

But here’s the thing about language models: They’re not trained to process isolated words; they’re trained to extract rich and context-specific meaning from sequences of words like sentences and paragraphs that contain hundreds or thousands of words. Presenting one token at a time to an LLM will elicit unusual and outlier-like activations. Indeed, most interpretability analyses of LLMs specifically exclude the first token in the sequence because of its extreme activation patterns.

So I’ve done something very simple: I’ve presented to the model the sentence “The next word is ” substituting “” for each of 34 words that we want to analyze. This is good experimental design because it means that all words have identical context, and thus any differences and similarities can only be attributed to world-knowledge that the model learned about each word.

The screenshot below shows code that creates the batch of token sequences and prompts GPT-2 with those sequences. When the model runs through its calculations, the hook function is activated, copying and storing the attention activations vectors into the dictionary.

Output:

(dict_keys(['attn_0_q', 'attn_0_k', 'attn_0_v', 'attn_1_q', 'attn_1_k', 'attn_1_v', 'attn_2_q', 'attn_2_k', 'attn_2_v', 'attn_3_q', 'attn_3_k', 'attn_3_v', 'attn_4_q', 'attn_4_k', 'attn_4_v', 'attn_5_q', 'attn_5_k', 'attn_5_v', 'attn_6_q', 'attn_6_k', 'attn_6_v', 'attn_7_q', 'attn_7_k', 'attn_7_v', 'attn_8_q', 'attn_8_k', 'attn_8_v', 'attn_9_q', 'attn_9_k', 'attn_9_v', 'attn_10_q', 'attn_10_k', 'attn_10_v', 'attn_11_q', 'attn_11_k', 'attn_11_v', 'attn_12_q', 'attn_12_k', 'attn_12_v', 'attn_13_q', 'attn_13_k', 'attn_13_v', 'attn_14_q', 'attn_14_k', 'attn_14_v', 'attn_15_q', 'attn_15_k', 'attn_15_v', 'attn_16_q', 'attn_16_k', 'attn_16_v', 'attn_17_q', 'attn_17_k', 'attn_17_v', 'attn_18_q', 'attn_18_k', 'attn_18_v', 'attn_19_q', 'attn_19_k', 'attn_19_v', 'attn_20_q', 'attn_20_k', 'attn_20_v', 'attn_21_q', 'attn_21_k', 'attn_21_v', 'attn_22_q', 'attn_22_k', 'attn_22_v', 'attn_23_q', 'attn_23_k', 'attn_23_v', 'attn_24_q', 'attn_24_k', 'attn_24_v', 'attn_25_q', 'attn_25_k', 'attn_25_v', 'attn_26_q', 'attn_26_k', 'attn_26_v', 'attn_27_q', 'attn_27_k', 'attn_27_v', 'attn_28_q', 'attn_28_k', 'attn_28_v', 'attn_29_q', 'attn_29_k', 'attn_29_v', 'attn_30_q', 'attn_30_k', 'attn_30_v', 'attn_31_q', 'attn_31_k', 'attn_31_v', 'attn_32_q', 'attn_32_k', 'attn_32_v', 'attn_33_q', 'attn_33_k', 'attn_33_v', 'attn_34_q', 'attn_34_k', 'attn_34_v', 'attn_35_q', 'attn_35_k', 'attn_35_v', 'attn_36_q', 'attn_36_k', 'attn_36_v', 'attn_37_q', 'attn_37_k', 'attn_37_v', 'attn_38_q', 'attn_38_k', 'attn_38_v', 'attn_39_q', 'attn_39_k', 'attn_39_v', 'attn_40_q', 'attn_40_k', 'attn_40_v', 'attn_41_q', 'attn_41_k', 'attn_41_v', 'attn_42_q', 'attn_42_k', 'attn_42_v', 'attn_43_q', 'attn_43_k', 'attn_43_v', 'attn_44_q', 'attn_44_k', 'attn_44_v', 'attn_45_q', 'attn_45_k', 'attn_45_v', 'attn_46_q', 'attn_46_k', 'attn_46_v', 'attn_47_q', 'attn_47_k', 'attn_47_v']), (34, 5, 1600))

The size of each data tensor is 34×5×1600. There are 34 target words embedded into a sentence comprising 5 tokens (“The next word is ___”) with an embeddings dimensionality of 1600.

We’re almost ready for the analyses! The last preparatory step is to create two matrix masks that will allow us to identify the word pairs that are within-category (e.g., galaxy-comet and bed-wall) vs. across-category (e.g., star-sofa and window-banana).

To create those masks, I identified the diagonal blocks corresponding to the word pairs of the same category, and subtracted that from an upper-diagonal matrix. The result is the two matrices in panels C and D in Figure 1 below.

Figure 1: Creating two binary matrix masks (panels C and D) to isolate and extract the within- vs. across-category similarity values from symmetric matrices. Panels A and B show the two intermediate matrix masks from which the key masks are created.

The upshot is this: I can apply those mask matrices to cosine similarity matrices to extract the similarity values within- vs. across-categories, and then apply the RSA method you learned about in the previous post.

Correlating Q, K, and V activations

The goal of Part 2 is to correlate the activations between pairs of attention activation matrices. Spoiler alert: The correlations will be close to zero. But you need to see how small these correlations are, in order to appreciate the insights gained from applying the RSA technique.

To begin, I ran the analysis in layer index 6. The scatter plots in Figure 2 show the correlations between all pairs of attention vectors for the second token position.

*Figure 2: Direct correlations amongst the three attention matrices. Each dot in the scatter plots reflects the activation value for one embeddings dimension for one token position.*

The correlation between Q and K is weakly negative, while the other two pairs have correlations near zero. The negative correlation between Q and K stems from the shift towards negative values in QKᵀ, which is an important part of the attention algorithm but isn’t relevant here. Overall, it seems that the activations in one matrix are unrelated to the activations in the other matrices — even though the tokens and contexts are identical.

By the way, I asked you to extract the data from the second token instead of the final (target) token to demonstrate that the weak correlations are not trivially due to processing different tokens; these scatter plots reflect identical token sequences at this point. In fact, some of the correlations are even closer to zero when using the final token. You can explore that yourself in the code.

Next I repeated the analysis for each transformer layer in a for-loop. That’s a lot of scatter plots to look at, so instead I visualized the correlation coefficients (Figure 3).

*Figure 3: The correlations in Figure 2 were repeated for all layers and visualized here.*

The results are similar throughout the model: Near-zero correlations for the pairs involving the V vectors, and weakly negative correlations between Q and K across most layers (those negative correlations impose sparsity for reasons that I detail in other posts).

What have we learned so far? In the previous post, you saw that although the embeddings spaces in different models are not directly comparable, their internal statistical relational structures are highly consistent, at least in the small sample we examined. How about the different attention matrices here in this post; do they have high RSA scores despite low direct correlations? Perhaps you’ve guessed that the answer is Yes, but let’s not trust our intuition; instead, let’s gather statistical evidence to make data-informed decisions.

We will start with focusing on data from one transformer layer to build visualization, intuition, and code, and then we’ll expand that analysis to all the layers.

Cosine similarities and RSA (one layer)

Remember from the previous post that RSA involves correlating similarity values, not correlating dimension-coordinate values. So to calculate an RSA score, we first need to calculate the within-matrix similarity values across all word pairs.

I’ll start with data from layer 6. In the code, I extract the Q, K, and V activations from the final token from all batch sequences, and calculate the token × token cosine similarity matrices within each attention matrix.

The cosine similarity matrices are interesting to look at: The within-category similarities (block-diagonals) are visually higher than the across-category off-diagonal elements, and the overall similarities are highest for K and weakest for V.

Figure 4: Cosine similarity matrices across all word pairs (x- and y-axes) within each attention matrix, from one layer. Dashed lines indicate category boundaries. Colormaps have the same limits for all matrices.

Now we can calculate the pairwise RSA scores by correlating the non-redundant and non-trivial cosine similarity values across the pairs of attention matrices. I’m not separating into within- vs. across-category submatrices yet; the RSA scores here are based on all words.

Figure 5: Scatter plots showing the RSA results for the three pairs of matrices. Each marker is one cosine similarity value from the upper-triangle of the matrices shown in Figure 4. The correlation (r) value is the RSA score.

After having seen the weak correlations of activation values across the attention matrices, the RSA is a remarkable and refreshing change: Even the “weakest” RSA is still around .85.

The interpretation of the results so far is that the Q, K, and V matrices represent token updates in distinct (often orthogonal) ways, but the nature of how those adjustments relate to each other is comparable across the matrices. In other words, the internal representations are similar while the coordinate spaces are distinct.

That is no accident: If the three sets of attention vectors were already so closely correlated, then they would be redundant and the attention algorithm wouldn’t be terribly useful. Instead, each of the three matrices (especially V) is trained into orthogonal spaces so that their unique contributions to the hidden-state adjustments can be information-rich and context-selective.

But before we get too excited about this result, let’s see if these observations are unique to this layer, or whether we see similar results in other layers.

Laminar profile of RSA scores

The online code file shows how to repeat the RSA analysis in a for-loop over all transformer layers. Figure 6 below shows the laminar profiles of the correlation coefficients (RSA scores).

*Figure 6: The RSA values calculated in Figure 5 were repeated for each layer and visualized.*

With exception of the first transformer layer, the RSA scores are all roughly equally strong, around .9, and not visually obviously changing with depth into the model.

Category separability in one layer

We still haven’t incorporated the categories into the analyses. That’s the goal of the last two sections of this post. We will quantify the category separability of cosine similarity values within each of the attention matrices using an effect size calculation called Cohen’s d (unrelated 🙂). And then we’ll calculate the RSA scores separately for the similarity values within- vs. across-categories.

Before getting into the details, let’s build some intuition by visualizing distributions of cosine similarity values. I’ll use the within- and across-category mask matrices I created earlier to isolate the within- and across-category cosine similarity values from one transformer layer. See Figure 7.

Figure 7: Histograms of cosine similarity values, separated by within- (blue) vs. across- (orange) category, for each of the attention matrices. Notice the differences in the similarity values (x-axis ticks in different panels). The title of each panel also indicates Cohen’s d, a measure of effect size used here to quantify category separability.

The distributions are clearly well-separated: For each attention matrix, the similarity values are stronger within- vs. across-categories. This result shows that even deep inside the attention algorithm, LLMs have some relational structure that incorporates semantic world-knowledge into its token adjustment vectors.

The d-values in the titles are Cohen’s d effect size values that I calculated using the compute_effsize function in the pingouin library. Cohen’s d is the difference of the means of the distributions, scaled by their standard deviations. It’s closely related to the t-value, but is scaled to give more interpretable results. Effect sizes of around 2.5 are very large — in the experimental psychology literature, by comparison, researchers are very happy with effect sizes of around .6 to .8.

Laminar profile of category separability and RSA

The final part of this post is to calculate the effect size and RSA score for each layer, in order to determine whether category specificity evolves across the transformer stack.

The coding here is fairly straightforward, and requires only some minor modifications, for example splitting the RSA calculations by category class and storing the results per-layer.

Here are the results:

*Figure 8: Category separability effect size (from Figure 7) and RSA scores are shown across all layers.*

The results show high category separability and RSA scores throughout the depth of the model, with some relative decrease in separability around the middle layers.

Interestingly, the across-category RSA scores seem to be weaker than the within-category scores. Let’s investigate that observation more.

The scatter plot below shows the direct comparison of within- to across-category RSA scores, with each marker indicating matrix per layer, and the diagonal line showing unity.

Figure 9: Comparing RSA scores within- vs. across-category. Each marker is an RSA score from one attention matrix and one layer. The color indicates the layer index, going from earlier transformer layers in dark purple, to later layers in yellow. The diagonal line of unity indicates equal RSA scores; data values above the line indicate stronger within- compared to across-category RSA.

Every single marker is above the line of unity, meaning that all within-category RSA scores are larger than all across-category scores. You don’t need a statistical test to see that it’s a real effect!

By the way, this is not simply due to a ceiling effect of RSA scores (they are bound by 1), because correcting for the ceiling by re-running the analysis using the Fisher-z transform doesn’t change the key result. I don’t show that re-analysis here, but you can test it by applying the np.atanh function to the RSA scores.

These results reveal that semantic information is present within each attention matrix, can be quantified by a linear analysis (though this does not prove that the representations are fully linear), and the relational structures are comparable across matrices despite them being nearly orthogonal spaces.

So you wanna learn more?

If you think that leveraging machine-learning techniques to investigate and understand LLM architecture and internal calculations is a useful approach, then please consider checking out my book from which these posts were adapted, and/or my 90-hour video-based course on LLM architecture, training, and interpretability. And of course, you can check out my other Substack posts.

A guest post by

Mike X Cohen, PhD

ex-neuroscience professor | textbook author (linear algebra, stats, calculus) | best-selling Udemy instructor (AI, machine-learning, coding, math) | LinkedIn non-influencer | founder @ Sincxpress.com. You can learn a lot of math with a bit of code.

The Palindrome

Discussion about this post

Ready for more?