- 
                
                    Sliding Window Attention Preparation### Description Full self-attention has a quadratic complexity with respect to sequence length, which is prohibitive for very long sequences. Models like Longformer introduce **sliding window... 
- 
                
                    MoE Gating: Top-K Selection### Description In a Mixture of Experts (MoE) model, a gating network is responsible for routing each input token to a subset of 'expert' networks. [6, 14] A common strategy is Top-K gating, where... 
- 
                
                    Multi-Head Attention: Splitting Heads### Description In multi-head attention, the query, key, and value tensors are split into multiple heads. Given a tensor of shape `(B, N, D)`, where `D` is the embedding dimension, you need to... 
- 
                
                    Multi-Head Attention: Merging Heads### Description The inverse of splitting heads. After computing attention for each head, you need to merge them back. Given a tensor of shape `(B, H, N, D//H)`, you need to merge it back to `(B,... 
- 
                
                    Pixel Unshuffle (Pixel to Channel)### Description This is another name for the space-to-depth operation, common in super-resolution models. It involves rearranging blocks of spatial data into the channel dimension. Given a tensor... 
- 
                
                    Pixel Shuffle (Channel to Pixel)### Description The inverse of pixel unshuffle, also known as depth-to-space. It is used to upscale an image by rearranging elements from the channel dimension into spatial blocks. Given a tensor... 
- 
                
                    Grouped Matrix Multiplication### Description Perform matrix multiplication on groups of matrices within a batch. For an input tensor of shape `(B, G, M, K)` and another of shape `(B, G, K, N)`, the output should be of shape... 
- 
                
                    Bilinear Attention Pooling### Description In some attention mechanisms, you need to compute a bilinear interaction between two sets of features. Given two tensors of shapes `(B, N, D)` and `(B, M, D)`, compute a bilinear... 
- 
                
                    Custom `nn.Module` with a Non-standard InitializationCreate a **custom `nn.Module`** for a simple feed-forward layer. Instead of the default PyTorch initialization, you'll apply a specific, non-standard initialization scheme. For example, you could...