Deconstructing Self-Attention Scores

hard (>1 hr) Attention Transformers Softmax NLP

this year by E

The self-attention mechanism is a core component of Transformers. Let's break down how attention scores are calculated.

Query, Key, Value: In self-attention, each input token (or its embedding) is transformed into a Query (Q), Key (K), and Value (V) vector. Assume we have three input tokens, and for a specific head, their corresponding Q, K, and V vectors are given:
- $Q_{1} = [1, 0]$
- $K_{1} = [1, 1]$
- $V_{1} = [0.5, 0.5]$
- $Q_{2} = [0, 1]$
- $K_{2} = [0, 1]$
- $V_{2} = [1.0, 0.0]$
- $Q_{3} = [1, 1]$
- $K_{3} = [0, 0]$
- $V_{3} = [0.0, 1.0]$ (For simplicity, we'll omit the scaling factor $\frac{1}{\sqrt{d_{k}}}$ for now).
Dot-Product Attention: Calculate the unnormalized attention scores (dot products) for Query $Q_{1}$ with all Keys ( $K_{1}, K_{2}, K_{3}$ ). These are often called "logits".
Softmax: Apply the softmax function to these unnormalized scores to get the attention weights for $Q_{1}$ . Recall: $softmax (x_{i}) = \frac{e^{x_{i}}}{\sum_{j} e^{x_{j}}}$ .
Weighted Sum: Using the attention weights for $Q_{1}$ and the Value vectors ( $V_{1}, V_{2}, V_{3}$ ), compute the output vector for the first token, which is a weighted sum of the Value vectors.
Intuition: What do these attention weights signify? How does this mechanism allow a token to "pay attention" to other tokens?
Verification: Implement the dot-product attention and softmax steps in Python.

import numpy as np

# Example Q, K, V vectors (you'll use the ones provided in the problem)
# Q = np.array([,,]) # Example for all Qs
# K = np.array([,,]) # Example for all Ks
# V = np.array([[0.5, 0.5], [1.0, 0.0], [0.0, 1.0]]) # Example for all Vs

def softmax(x):
    e_x = np.exp(x - np.max(x)) # for numerical stability
    return e_x / e_x.sum(axis=-1, keepdims=True)

# Calculate scores for Q1:
# scores_q1 = np.dot(Q, K.T)
# attention_weights_q1 = softmax(scores_q1)
# output_q1 = np.dot(attention_weights_q1, V)