Implementing a Multi-Headed Attention Mechanism
Expand on the previous attention exercise by implementing a Multi-Headed Attention mechanism from scratch. A single attention head is a dot-product attention as you've implemented. Multi-head attention involves splitting the input into multiple 'heads,' performing attention for each head in parallel, and then concatenating their outputs. This allows the model to jointly attend to information from different representation subspaces. This is the core component of the Transformer architecture.
Verification: Verify that your multi-headed attention output has the correct shape. The output dimension should be the same as the input after concatenation and a final linear layer. Compare the output of your custom implementation to torch.nn.MultiheadAttention
to confirm correctness.