Building a Transformer Encoder from Scratch

hard (>1 hr) transformer attention scratch encoder neural

this year by E

Implement a single layer of a Transformer Encoder from scratch, without using torch.nn.TransformerEncoderLayer. This requires implementing a multi-head self-attention module and a feed-forward network. You will need to handle attention weights, value vectors, and key/query projection. Pay close attention to the softmax and matmul operations.

Verification: Compare the output of your custom encoder layer with a standard torch.nn.TransformerEncoderLayer with the same dimensions and random weights. The outputs should match (within floating-point precision) for the same input tensor.