ML Katas

Deconstructing Self-Attention Scores

hard (>1 hr) Attention Transformers Softmax NLP
this year by E

The self-attention mechanism is a core component of Transformers. Let's break down how attention scores are calculated.

  1. Query, Key, Value: In self-attention, each input token (or its embedding) is transformed into a Query (Q), Key (K), and Value (V) vector. Assume we have three input tokens, and for a specific head, their corresponding Q, K, and V vectors are given:
    • Q1=[1,0]
    • K1=[1,1]
    • V1=[0.5,0.5]
    • Q2=[0,1]
    • K2=[0,1]
    • V2=[1.0,0.0]
    • Q3=[1,1]
    • K3=[0,0]
    • V3=[0.0,1.0] (For simplicity, we'll omit the scaling factor 1dk for now).
  2. Dot-Product Attention: Calculate the unnormalized attention scores (dot products) for Query Q1 with all Keys (K1,K2,K3). These are often called "logits".
  3. Softmax: Apply the softmax function to these unnormalized scores to get the attention weights for Q1. Recall: softmax(xi)=exijexj.
  4. Weighted Sum: Using the attention weights for Q1 and the Value vectors (V1,V2,V3), compute the output vector for the first token, which is a weighted sum of the Value vectors.
  5. Intuition: What do these attention weights signify? How does this mechanism allow a token to "pay attention" to other tokens?
  6. Verification: Implement the dot-product attention and softmax steps in Python.
import numpy as np

# Example Q, K, V vectors (you'll use the ones provided in the problem)
# Q = np.array([,,]) # Example for all Qs
# K = np.array([,,]) # Example for all Ks
# V = np.array([[0.5, 0.5], [1.0, 0.0], [0.0, 1.0]]) # Example for all Vs

def softmax(x):
    e_x = np.exp(x - np.max(x)) # for numerical stability
    return e_x / e_x.sum(axis=-1, keepdims=True)

# Calculate scores for Q1:
# scores_q1 = np.dot(Q, K.T)
# attention_weights_q1 = softmax(scores_q1)
# output_q1 = np.dot(attention_weights_q1, V)