Building a Transformer Encoder from Scratch
Implement a single layer of a Transformer Encoder from scratch, without using torch.nn.TransformerEncoderLayer
. This requires implementing a multi-head self-attention module and a feed-forward network. You will need to handle attention weights, value vectors, and key/query projection. Pay close attention to the softmax
and matmul
operations.
Verification: Compare the output of your custom encoder layer with a standard torch.nn.TransformerEncoderLayer
with the same dimensions and random weights. The outputs should match (within floating-point precision) for the same input tensor.