-
Implementing a Multi-Headed Attention Mechanism
Expand on the previous attention exercise by implementing a **Multi-Headed Attention mechanism** from scratch. A single attention head is a dot-product attention as you've implemented. Multi-head...
-
Implementing the Adam Optimizer from Scratch
Implement the **Adam optimizer from scratch** as a subclass of `torch.optim.Optimizer`. You'll need to manage the first-moment vector (moving average of gradients) and the second-moment vector...
-
Building a Transformer Encoder from Scratch
Implement a single layer of a **Transformer Encoder** from scratch, without using `torch.nn.TransformerEncoderLayer`. This requires implementing a multi-head self-attention module and a...
1