All Katas - ML Katas

MoE Aggregator: Combining Expert Outputs

this year hard (>1 hr) | pytorch einops moe

After tokens have been dispatched to and processed by their respective experts, the outputs need to be combined based on the weights from the gating network. This exercise focuses on this...

Building a Simple Mixture of Experts (MoE) Layer

this year hard (>1 hr) | pytorch deep-learning moe

Now, let's combine the concepts of dispatching and aggregating into a full, albeit simplified, `torch.nn.Module` for a Mixture of Experts layer. This layer will replace a standard feed-forward...

Batch Normalization From Scratch

this year hard (>1 hr) | pytorch implementation batchnorm deep learning

Implement 1D batch normalization manually (without using `nn.BatchNorm1d`). Steps: 1. Compute batch mean and variance. 2. Normalize inputs. 3. Scale and shift with learnable $$\gamma, \beta$$....

Debug Exploding Gradients

this year hard (>1 hr) | pytorch training gradients debugging

Create a deep feedforward net (20 layers, ReLU). Train it on dummy data. Track gradient norms across layers. Observe if gradients explode. Experiment with: - Smaller learning rate. - Gradient...

Implement a Siamese Network

this year hard (>1 hr) | pytorch siamese cnn metric learning

Implement a Siamese network for MNIST digit similarity: - Two identical CNNs sharing weights. - Contrastive loss function. - Train on pairs of digits (same/different). Evaluate on test pairs.

Create a Transformer Encoder Block

this year hard (>1 hr) | pytorch implementation transformer attention

Implement a single Transformer encoder block: - Multi-head self-attention. - Layer normalization. - Feedforward network. Compare output with `nn.TransformerEncoderLayer`.

Distributed DataParallel Basics

this year hard (>1 hr) | pytorch training distributed dataparallel

Simulate training with `torch.nn.DataParallel`: - Define a simple CNN. - Run it on 2 GPUs (if available). - Verify batch is split across devices. Inspect `model.module` usage.