All Katas - ML Katas

Implementing a simplified Beam Search Decoder with `gather` and `scatter`

this month hard (>1 hr) | NLP scatter gather

Beam search is a popular decoding algorithm used in machine translation and text generation. A key step in beam search is to select the top-k most likely next tokens and update the corresponding...

Forward-mode vs. Reverse-mode autodiff

this year hard (>1 hr) | vjp autodiff jvp

JAX supports both forward-mode and reverse-mode automatic differentiation. While `grad` uses reverse-mode, you can use `jax.jvp` for forward-mode, which computes Jacobian-vector products....

Custom VJP

this year hard (>1 hr) | autodiff custom_vjp

For some functions, you may want to define a custom vector-Jacobian product (VJP). This can be useful for numerical stability or for implementing algorithms that are not easily expressed in terms...

Parallelization with `pmap`

this year hard (>1 hr) | pmap parallelization

For large models, it is often necessary to train on multiple devices (e.g., GPUs or TPUs). JAX's `pmap` transformation allows for easy parallelization of computations across devices. In this...

The Elegant Gradient of Softmax-Cross-Entropy

this year hard (>1 hr) | Backpropagation Cross-Entropy Softmax Gradients

One of the most satisfying derivations in deep learning is the gradient of the combined Softmax and Cross-Entropy loss. For a multi-class classification problem with $K$ classes, given true labels...

Deconstructing Self-Attention Scores

this year hard (>1 hr) | Attention Transformers Softmax NLP

The self-attention mechanism is a core component of Transformers. Let's break down how attention scores are calculated. 1. **Query, Key, Value**: In self-attention, each input token (or its...

Dissecting the Variational Autoencoder's ELBO

this year hard (>1 hr) | VAE Generative Models ELBO KL Divergence

Variational Autoencoders (VAEs) are powerful generative models that optimize a lower bound on the data log-likelihood, known as the Evidence Lower Bound (ELBO). The ELBO for a single data point...

PCA from First Principles

this year hard (>1 hr) | linear-algebra PCA dim-reduction eigenvalues eigenvectors

Principal Component Analysis (PCA) is a fundamental dimensionality reduction technique. It works by transforming the data into a new coordinate system such that the greatest variance by any...

SVD for Image Compression

this year hard (>1 hr) | compression linear-algebra dim-reduction SVD image-processing

Singular Value Decomposition (SVD) is a powerful matrix factorization technique with numerous applications, including dimensionality reduction, noise reduction, and data compression. Any real $m...

Softmax and its Jacobian

this year hard (>1 hr) | classification gradient calculus numerical-stability softmax

The softmax function is a critical component in multi-class classification, converting a vector of arbitrary real values into a probability distribution. Given an input vector $\mathbf{z} = [z_1,...

Backpropagation for a Single-Layer Network

this year hard (>1 hr) | gradient deep-learning calculus backpropagation neural-networks

Backpropagation is the cornerstone algorithm for training neural networks. It efficiently calculates the gradients of the loss function with respect to all the weights and biases in the network by...

MoE Aggregator: Combining Expert Outputs

this year hard (>1 hr) | pytorch einops moe

After tokens have been dispatched to and processed by their respective experts, the outputs need to be combined based on the weights from the gating network. This exercise focuses on this...

Building a Simple Mixture of Experts (MoE) Layer

this year hard (>1 hr) | pytorch deep-learning moe

Now, let's combine the concepts of dispatching and aggregating into a full, albeit simplified, `torch.nn.Module` for a Mixture of Experts layer. This layer will replace a standard feed-forward...

Implementing a Siamese Network with Triplet Loss

this year hard (>1 hr) | siamese loss learning triplet metric

Building on the previous exercise, let's switch to **Triplet Loss**. This loss function is more powerful as it enforces a margin between an anchor-positive pair and an anchor-negative pair. The...

Implementing Self-Supervised Learning with BYOL

this year hard (>1 hr) | image learning contrastive self-supervised byol

Implement the core logic of **Bootstrap Your Own Latent (BYOL)**. BYOL is a self-supervised learning method that learns image representations without using negative pairs. It consists of two...

Implementing a Multi-Headed Attention Mechanism

this year hard (>1 hr) | transformer attention scratch nlp multi-head

Expand on the previous attention exercise by implementing a **Multi-Headed Attention mechanism** from scratch. A single attention head is a dot-product attention as you've implemented. Multi-head...

Implementing a Masked Language Model

this year hard (>1 hr) | nlp model masked language bert

Implement a **Masked Language Model (MLM)**, a technique at the heart of models like BERT. Given a sentence, you'll randomly mask some of the words and then train a model to predict those masked...

Building a Graph Autoencoder

this year hard (>1 hr) | autoencoder gnn graph representation gae

Implement a **Graph Autoencoder (GAE)** for graph representation learning. The encoder will use a GNN to produce node embeddings, and the decoder will reconstruct the graph's adjacency matrix from...

Adversarial Training for Robustness

this year hard (>1 hr) | training cnn adversarial robustness fgsm

Implement **adversarial training** on a simple classification model like a small CNN on MNIST. The goal is to make the model robust to adversarial attacks. You'll need to generate adversarial...

Neural Style Transfer

this year hard (>1 hr) | style transfer vgg loss image

Implement **Neural Style Transfer**. Given a content image and a style image, generate a new image that combines the content of the former with the style of the latter. Use a pre-trained VGG...

Training a Variational Autoencoder (VAE)

this year hard (>1 hr) | loss vae autoencoder generative kl

Implement and train a **Variational Autoencoder (VAE)** on a dataset like MNIST. The encoder should map the input to a latent space distribution (mean and variance), and the decoder should...

Implementing the Adam Optimizer from Scratch

this year hard (>1 hr) | optimizer adam from scratch gradient

Implement the **Adam optimizer from scratch** as a subclass of `torch.optim.Optimizer`. You'll need to manage the first-moment vector (moving average of gradients) and the second-moment vector...

Building a Transformer Encoder from Scratch

this year hard (>1 hr) | transformer attention scratch encoder neural

Implement a single layer of a **Transformer Encoder** from scratch, without using `torch.nn.TransformerEncoderLayer`. This requires implementing a multi-head self-attention module and a...

Implementing a Siamese Network for Similarity Learning

this year hard (>1 hr) | siamese network similarity contrastive mnist

Build and train a **Siamese network** on a dataset like MNIST. The network takes pairs of images as input and learns to determine if they belong to the same class (a positive pair) or different...

Generative Adversarial Network (GAN) on MNIST

this year hard (>1 hr) | training adversarial generative mnist gan

Implement and train a simple **Generative Adversarial Network (GAN)**. The network consists of a generator and a discriminator. The generator takes a random noise vector and tries to generate a...