- 
                
                    Implementing a simplified Beam Search Decoder with `gather` and `scatter`Beam search is a popular decoding algorithm used in machine translation and text generation. A key step in beam search is to select the top-k most likely next tokens and update the corresponding... 
- 
                
                    Forward-mode vs. Reverse-mode autodiffJAX supports both forward-mode and reverse-mode automatic differentiation. While `grad` uses reverse-mode, you can use `jax.jvp` for forward-mode, which computes Jacobian-vector products.... 
- 
                
                    Custom VJPFor some functions, you may want to define a custom vector-Jacobian product (VJP). This can be useful for numerical stability or for implementing algorithms that are not easily expressed in terms... 
- 
                
                    Parallelization with `pmap`For large models, it is often necessary to train on multiple devices (e.g., GPUs or TPUs). JAX's `pmap` transformation allows for easy parallelization of computations across devices. In this... 
- 
                
                    The Elegant Gradient of Softmax-Cross-EntropyOne of the most satisfying derivations in deep learning is the gradient of the combined Softmax and Cross-Entropy loss. For a multi-class classification problem with $K$ classes, given true labels... 
- 
                
                    Deconstructing Self-Attention ScoresThe self-attention mechanism is a core component of Transformers. Let's break down how attention scores are calculated. 1. **Query, Key, Value**: In self-attention, each input token (or its... 
- 
                
                    Dissecting the Variational Autoencoder's ELBOVariational Autoencoders (VAEs) are powerful generative models that optimize a lower bound on the data log-likelihood, known as the Evidence Lower Bound (ELBO). The ELBO for a single data point... 
- 
                
                    PCA from First PrinciplesPrincipal Component Analysis (PCA) is a fundamental dimensionality reduction technique. It works by transforming the data into a new coordinate system such that the greatest variance by any... 
- 
                
                    SVD for Image CompressionSingular Value Decomposition (SVD) is a powerful matrix factorization technique with numerous applications, including dimensionality reduction, noise reduction, and data compression. Any real $m... 
- 
                
                    Softmax and its JacobianThe softmax function is a critical component in multi-class classification, converting a vector of arbitrary real values into a probability distribution. Given an input vector $\mathbf{z} = [z_1,... 
- 
                
                    Backpropagation for a Single-Layer NetworkBackpropagation is the cornerstone algorithm for training neural networks. It efficiently calculates the gradients of the loss function with respect to all the weights and biases in the network by... 
- 
                
                    MoE Aggregator: Combining Expert OutputsAfter tokens have been dispatched to and processed by their respective experts, the outputs need to be combined based on the weights from the gating network. This exercise focuses on this... 
- 
                
                    Building a Simple Mixture of Experts (MoE) LayerNow, let's combine the concepts of dispatching and aggregating into a full, albeit simplified, `torch.nn.Module` for a Mixture of Experts layer. This layer will replace a standard feed-forward... 
- 
                
                    Implementing a Siamese Network with Triplet LossBuilding on the previous exercise, let's switch to **Triplet Loss**. This loss function is more powerful as it enforces a margin between an anchor-positive pair and an anchor-negative pair. The... 
- 
                
                    Model Compression with PruningImplement **model pruning** to reduce the size and computational cost of a trained model. Start with a simple, over-parameterized model (e.g., a fully-connected network on MNIST). Train it to a... 
- 
                
                    Implementing Self-Supervised Learning with BYOLImplement the core logic of **Bootstrap Your Own Latent (BYOL)**. BYOL is a self-supervised learning method that learns image representations without using negative pairs. It consists of two... 
- 
                
                    Implementing a Multi-Headed Attention MechanismExpand on the previous attention exercise by implementing a **Multi-Headed Attention mechanism** from scratch. A single attention head is a dot-product attention as you've implemented. Multi-head... 
- 
                
                    Implementing a Masked Language ModelImplement a **Masked Language Model (MLM)**, a technique at the heart of models like BERT. Given a sentence, you'll randomly mask some of the words and then train a model to predict those masked... 
- 
                
                    Building a Graph AutoencoderImplement a **Graph Autoencoder (GAE)** for graph representation learning. The encoder will use a GNN to produce node embeddings, and the decoder will reconstruct the graph's adjacency matrix from... 
- 
                
                    Adversarial Training for RobustnessImplement **adversarial training** on a simple classification model like a small CNN on MNIST. The goal is to make the model robust to adversarial attacks. You'll need to generate adversarial... 
- 
                
                    Neural Style TransferImplement **Neural Style Transfer**. Given a content image and a style image, generate a new image that combines the content of the former with the style of the latter. Use a pre-trained VGG... 
- 
                
                    Training a Variational Autoencoder (VAE)Implement and train a **Variational Autoencoder (VAE)** on a dataset like MNIST. The encoder should map the input to a latent space distribution (mean and variance), and the decoder should... 
- 
                
                    Implementing the Adam Optimizer from ScratchImplement the **Adam optimizer from scratch** as a subclass of `torch.optim.Optimizer`. You'll need to manage the first-moment vector (moving average of gradients) and the second-moment vector... 
- 
                
                    Building a Transformer Encoder from ScratchImplement a single layer of a **Transformer Encoder** from scratch, without using `torch.nn.TransformerEncoderLayer`. This requires implementing a multi-head self-attention module and a... 
- 
                
                    Implementing a Siamese Network for Similarity LearningBuild and train a **Siamese network** on a dataset like MNIST. The network takes pairs of images as input and learns to determine if they belong to the same class (a positive pair) or different...