-
Implementing a simplified Beam Search Decoder with `gather` and `scatter`
Beam search is a popular decoding algorithm used in machine translation and text generation. A key step in beam search is to select the top-k most likely next tokens and update the corresponding...
-
Forward-mode vs. Reverse-mode autodiff
JAX supports both forward-mode and reverse-mode automatic differentiation. While `grad` uses reverse-mode, you can use `jax.jvp` for forward-mode, which computes Jacobian-vector products....
-
Custom VJP
For some functions, you may want to define a custom vector-Jacobian product (VJP). This can be useful for numerical stability or for implementing algorithms that are not easily expressed in terms...
-
Parallelization with `pmap`
For large models, it is often necessary to train on multiple devices (e.g., GPUs or TPUs). JAX's `pmap` transformation allows for easy parallelization of computations across devices. In this...
-
The Elegant Gradient of Softmax-Cross-Entropy
One of the most satisfying derivations in deep learning is the gradient of the combined Softmax and Cross-Entropy loss. For a multi-class classification problem with $K$ classes, given true labels...
-
Deconstructing Self-Attention Scores
The self-attention mechanism is a core component of Transformers. Let's break down how attention scores are calculated. 1. **Query, Key, Value**: In self-attention, each input token (or its...
-
Dissecting the Variational Autoencoder's ELBO
Variational Autoencoders (VAEs) are powerful generative models that optimize a lower bound on the data log-likelihood, known as the Evidence Lower Bound (ELBO). The ELBO for a single data point...
-
PCA from First Principles
Principal Component Analysis (PCA) is a fundamental dimensionality reduction technique. It works by transforming the data into a new coordinate system such that the greatest variance by any...
-
SVD for Image Compression
Singular Value Decomposition (SVD) is a powerful matrix factorization technique with numerous applications, including dimensionality reduction, noise reduction, and data compression. Any real $m...
-
Softmax and its Jacobian
The softmax function is a critical component in multi-class classification, converting a vector of arbitrary real values into a probability distribution. Given an input vector $\mathbf{z} = [z_1,...
-
Backpropagation for a Single-Layer Network
Backpropagation is the cornerstone algorithm for training neural networks. It efficiently calculates the gradients of the loss function with respect to all the weights and biases in the network by...
-
MoE Aggregator: Combining Expert Outputs
After tokens have been dispatched to and processed by their respective experts, the outputs need to be combined based on the weights from the gating network. This exercise focuses on this...
-
Building a Simple Mixture of Experts (MoE) Layer
Now, let's combine the concepts of dispatching and aggregating into a full, albeit simplified, `torch.nn.Module` for a Mixture of Experts layer. This layer will replace a standard feed-forward...
-
Implementing a Siamese Network with Triplet Loss
Building on the previous exercise, let's switch to **Triplet Loss**. This loss function is more powerful as it enforces a margin between an anchor-positive pair and an anchor-negative pair. The...
-
Implementing Self-Supervised Learning with BYOL
Implement the core logic of **Bootstrap Your Own Latent (BYOL)**. BYOL is a self-supervised learning method that learns image representations without using negative pairs. It consists of two...
-
Implementing a Multi-Headed Attention Mechanism
Expand on the previous attention exercise by implementing a **Multi-Headed Attention mechanism** from scratch. A single attention head is a dot-product attention as you've implemented. Multi-head...
-
Implementing a Masked Language Model
Implement a **Masked Language Model (MLM)**, a technique at the heart of models like BERT. Given a sentence, you'll randomly mask some of the words and then train a model to predict those masked...
-
Building a Graph Autoencoder
Implement a **Graph Autoencoder (GAE)** for graph representation learning. The encoder will use a GNN to produce node embeddings, and the decoder will reconstruct the graph's adjacency matrix from...
-
Adversarial Training for Robustness
Implement **adversarial training** on a simple classification model like a small CNN on MNIST. The goal is to make the model robust to adversarial attacks. You'll need to generate adversarial...
-
Neural Style Transfer
Implement **Neural Style Transfer**. Given a content image and a style image, generate a new image that combines the content of the former with the style of the latter. Use a pre-trained VGG...
-
Training a Variational Autoencoder (VAE)
Implement and train a **Variational Autoencoder (VAE)** on a dataset like MNIST. The encoder should map the input to a latent space distribution (mean and variance), and the decoder should...
-
Implementing the Adam Optimizer from Scratch
Implement the **Adam optimizer from scratch** as a subclass of `torch.optim.Optimizer`. You'll need to manage the first-moment vector (moving average of gradients) and the second-moment vector...
-
Building a Transformer Encoder from Scratch
Implement a single layer of a **Transformer Encoder** from scratch, without using `torch.nn.TransformerEncoderLayer`. This requires implementing a multi-head self-attention module and a...
-
Implementing a Siamese Network for Similarity Learning
Build and train a **Siamese network** on a dataset like MNIST. The network takes pairs of images as input and learns to determine if they belong to the same class (a positive pair) or different...
-
Generative Adversarial Network (GAN) on MNIST
Implement and train a simple **Generative Adversarial Network (GAN)**. The network consists of a generator and a discriminator. The generator takes a random noise vector and tries to generate a...