-
MoE Aggregator: Combining Expert Outputs
After tokens have been dispatched to and processed by their respective experts, the outputs need to be combined based on the weights from the gating network. This exercise focuses on this...
-
Building a Simple Mixture of Experts (MoE) Layer
Now, let's combine the concepts of dispatching and aggregating into a full, albeit simplified, `torch.nn.Module` for a Mixture of Experts layer. This layer will replace a standard feed-forward...
1