Deconstructing Self-Attention Scores
The self-attention mechanism is a core component of Transformers. Let's break down how attention scores are calculated.
- Query, Key, Value: In self-attention, each input token (or its embedding) is transformed into a Query (Q), Key (K), and Value (V) vector. Assume we have three input tokens, and for a specific head, their corresponding Q, K, and V vectors are given:
- (For simplicity, we'll omit the scaling factor for now).
- Dot-Product Attention: Calculate the unnormalized attention scores (dot products) for Query with all Keys (). These are often called "logits".
- Softmax: Apply the softmax function to these unnormalized scores to get the attention weights for . Recall: .
- Weighted Sum: Using the attention weights for and the Value vectors (), compute the output vector for the first token, which is a weighted sum of the Value vectors.
- Intuition: What do these attention weights signify? How does this mechanism allow a token to "pay attention" to other tokens?
- Verification: Implement the dot-product attention and softmax steps in Python.
import numpy as np
# Example Q, K, V vectors (you'll use the ones provided in the problem)
# Q = np.array([,,]) # Example for all Qs
# K = np.array([,,]) # Example for all Ks
# V = np.array([[0.5, 0.5], [1.0, 0.0], [0.0, 1.0]]) # Example for all Vs
def softmax(x):
e_x = np.exp(x - np.max(x)) # for numerical stability
return e_x / e_x.sum(axis=-1, keepdims=True)
# Calculate scores for Q1:
# scores_q1 = np.dot(Q, K.T)
# attention_weights_q1 = softmax(scores_q1)
# output_q1 = np.dot(attention_weights_q1, V)