All Katas - ML Katas

Einops Warm-up: Reshaping Tensors for Expert Batching

this year easy (<30 mins) | pytorch einops tensor-manipulation

In Mixture of Experts (MoE) models, we often need to reshape tensors to efficiently process data across multiple 'experts'. Imagine you have a batch of sequences, and for each token in each...

MoE Gating and Dispatch

this year medium (<1 hr) | pytorch moe tensor-manipulation

A core component of a Mixture of Experts model is the 'gating network' which determines which expert(s) each token should be sent to. This is often a `top-k` selection. Your task is to implement...

MoE Aggregator: Combining Expert Outputs

this year hard (>1 hr) | pytorch einops moe

After tokens have been dispatched to and processed by their respective experts, the outputs need to be combined based on the weights from the gating network. This exercise focuses on this...

Building a Simple Mixture of Experts (MoE) Layer

this year hard (>1 hr) | pytorch deep-learning moe

Now, let's combine the concepts of dispatching and aggregating into a full, albeit simplified, `torch.nn.Module` for a Mixture of Experts layer. This layer will replace a standard feed-forward...

Batched Expert Forward Pass with Einops

this year medium (<1 hr) | pytorch einops performance

A naive implementation of an MoE layer might involve a loop over the experts. This is inefficient. A much better approach is to perform a single, batched matrix multiplication for all expert...

Sparse MoE Top-K Gating

this year easy (<10 mins) | pytorch einops gating moe

### Description In a Mixture of Experts (MoE) model, the gating network is a crucial component that determines which 'expert' subnetworks process each token. [1] A common strategy is **top-k...

Hierarchical Patch Merging with Einops

this year easy (<10 mins) | pytorch einops vision swin

### Description In hierarchical vision transformers like the Swin Transformer, **patch merging** is used to downsample the feature map, effectively reducing the number of tokens while increasing...

Sliding Window Attention Preparation

this year medium (<10 mins) | pytorch transformer attention einops

### Description Full self-attention has a quadratic complexity with respect to sequence length, which is prohibitive for very long sequences. Models like Longformer introduce **sliding window...

MoE Gating: Top-K Selection

this year medium (<10 mins) | pytorch deep learning MoE gating

### Description In a Mixture of Experts (MoE) model, a gating network is responsible for routing each input token to a subset of 'expert' networks. [6, 14] A common strategy is Top-K gating, where...

Implement a Neural Ordinary Differential Equation

this year hard (<30 mins) | pytorch generative neural ode dynamics

### Description Instead of modeling a function directly, a Neural ODE models its derivative with a neural network. The output is then found by integrating this derivative over time. [1] Your task...

Model-Agnostic Meta-Learning (MAML) Update Step

this year hard (<30 mins) | pytorch meta-learning maml few-shot

### Description Model-Agnostic Meta-Learning (MAML) is a meta-learning algorithm that trains a model's initial parameters such that it can adapt to a new task with only a few gradient steps. [1]...

Build a Transformer Encoder Block from Scratch

this year hard (<30 mins) | pytorch transformer attention nlp

### Description The Transformer architecture is built upon a fundamental component: the Encoder block. [1] Each block is responsible for processing a sequence of embeddings and refining them. Your...

Differentiable Additive Synthesizer

this year medium (<30 mins) | pytorch generative ddsp audio

### Description Differentiable Digital Signal Processing (DDSP) is a technique that combines classic signal processing with deep learning by making the parameters of synthesizers learnable via...

Soft Actor-Critic (SAC) Critic Loss

this year hard (<30 mins) | pytorch reinforcement rl sac actor-critic

### Description Soft Actor-Critic (SAC) is a state-of-the-art reinforcement learning algorithm known for its stability and sample efficiency. [1] A key component is its critic (or Q-network)...

Implement a Knowledge Distillation Loss

this year medium (<30 mins) | pytorch compression distillation

### Description Knowledge Distillation is a model compression technique where a small "student" model is trained to mimic a larger, pre-trained "teacher" model. [1] This is achieved by training...

Masked Autoencoder (MAE) Input Preprocessing

this year medium (<30 mins) | pytorch self-supervised mae vision transformer

### Description Masked Autoencoders (MAE) are a powerful self-supervised learning technique for vision transformers. The core idea is simple: randomly mask a large portion of the input image...

Neural Cellular Automata (NCA) Update Step

this year hard (<30 mins) | pytorch generative nca alife complex systems

### Description Neural Cellular Automata (NCA) are a fascinating generative model where complex global patterns emerge from simple, local rules learned by a neural network. [1] A grid of "cells,"...

Bayesian Neural Network Layer

this year hard (<30 mins) | pytorch bayesian bnn uncertainty

### Description In a standard neural network, weights are single point estimates. In a Bayesian Neural Network (BNN), we learn a probability distribution over each weight. [1] This allows for...

Deep Canonical Correlation Analysis (DCCA) Loss

this year medium (<30 mins) | pytorch dcca correlation multimodal

### Description Canonical Correlation Analysis (CCA) is a statistical method for finding correlations between two sets of variables. Deep CCA (DCCA) uses neural networks to first project two...

Siamese Network for One-Shot Image Verification

this year hard (<30 mins) | pytorch siamese metric learning one-shot

### Description Your task is to implement a Siamese network that can determine if two images are of the same class, given only one or a few examples of that class at test time. You'll train a...

Physics-Informed Neural Network (PINN) for an ODE

this year hard (<30 mins) | pytorch autograd ode pinn physics

### Description Solve a simple Ordinary Differential Equation (ODE) using a Physics-Informed Neural Network. A PINN is a neural network that is trained to satisfy both the data and the underlying...

Graph Convolutional Network for Node Classification

this year hard (<30 mins) | pytorch gnn graph gcn

### Description Implement a simple Graph Convolutional Network (GCN) to perform node classification on a graph dataset like Cora. [1] A GCN layer aggregates information from a node's neighbors to...

HyperNetwork for Weight Generation

this year hard (<30 mins) | pytorch hypernetwork meta-learning

### Description Implement a simple HyperNetwork. A HyperNetwork is a neural network that generates the weights for another, larger network (the "target network"). [1] This allows for dynamic...

Normalizing Flow for Density Estimation

this year hard (<30 mins) | pytorch generative normalizing flow

### Description Implement a simple 2D Normalizing Flow model. Normalizing Flows transform a simple base distribution (like a Gaussian) into a more complex distribution by applying a sequence of...

Gradient Reversal Layer

this year medium (<30 mins) | pytorch autograd gan domain adaptation

### Description Implement a Gradient Reversal Layer (GRL), a key component in Domain-Adversarial Neural Networks (DANNs). [1] The GRL acts as an identity function during the forward pass but...