All Katas - ML Katas

Working with an optimizer library: Optax

this year medium (<30 mins) | optimization optax

Optax is a popular library for optimization in JAX. It provides a wide range of optimizers and is designed to be highly modular. In this exercise, you will use Optax to train the Flax MLP from the...

Parallelization with `pmap`

this year hard (>1 hr) | pmap parallelization

For large models, it is often necessary to train on multiple devices (e.g., GPUs or TPUs). JAX's `pmap` transformation allows for easy parallelization of computations across devices. In this...

Checkpointing

this year medium (<30 mins) | flax checkpointing

When training large models, it is important to save the model's parameters periodically. This is known as checkpointing and allows you to resume training from a saved state in case of an...

The Elegant Gradient of Softmax-Cross-Entropy

this year hard (>1 hr) | Backpropagation Cross-Entropy Softmax Gradients

One of the most satisfying derivations in deep learning is the gradient of the combined Softmax and Cross-Entropy loss. For a multi-class classification problem with $K$ classes, given true labels...

Softmax's Numerical Stability: The Max Trick

this year medium (<1 hr) | Optimization Softmax Numerical Stability Log-Sum-Exp

While the standard softmax formula $\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}}$ is mathematically correct, a direct implementation can lead to numerical instability due to potential...

Tracing Gradient Descent on a Parabola

this year easy (<30 mins) | Gradient Descent Optimization Calculus

Imagine a simple 1D function $f(x) = x^2 - 4x + 5$. Your goal is to find the minimum of this function using Gradient Descent. 1. **Derive the gradient**: What is $\frac{df}{dx}$? 2. **Perform a...

The Gradients of Activation Functions

this year medium (<1 hr) | Activation Functions Backpropagation Derivatives

Activation functions introduce non-linearity into neural networks, but their derivatives are crucial for backpropagation. 1. **Sigmoid**: Given $\sigma(x) = \frac{1}{1 + e^{-x}}$, derive...

Cross-Entropy: A Measure of Surprise

this year medium (<30 mins) | Loss Functions Cross-Entropy Information Theory

Cross-entropy loss is fundamental for classification tasks. Let's build some intuition for its formulation. 1. **Definition**: For a binary classification problem, the binary cross-entropy (BCE)...

L2 Regularization's Gradient Impact

this year medium (<1 hr) | Regularization Overfitting L2 Norm

L2 regularization (also known as weight decay) is a common technique to prevent overfitting. 1. **Loss Function**: Consider a simple linear regression loss with L2 regularization: $J(\mathbf{w},...

Deconstructing Self-Attention Scores

this year hard (>1 hr) | Attention Transformers Softmax NLP

The self-attention mechanism is a core component of Transformers. Let's break down how attention scores are calculated. 1. **Query, Key, Value**: In self-attention, each input token (or its...

The Stabilizing Power of Batch Normalization

this year medium (<1 hr) | Optimization Batch Normalization Deep Learning

Batch Normalization (BatchNorm) is a crucial technique for stabilizing and accelerating deep neural network training. 1. **Normalization Step**: Given a mini-batch of activations $X = \{x_1, x_2,...

The Implicit Higher Dimension of Kernels

this year medium (<30 mins) | SVM Kernel Methods Feature Engineering

Support Vector Machines (SVMs) are powerful, and the "kernel trick" allows them to find non-linear decision boundaries without explicitly mapping data to high-dimensional spaces. 1. **Linear...

Dissecting the Variational Autoencoder's ELBO

this year hard (>1 hr) | VAE Generative Models ELBO KL Divergence

Variational Autoencoders (VAEs) are powerful generative models that optimize a lower bound on the data log-likelihood, known as the Evidence Lower Bound (ELBO). The ELBO for a single data point...

Riding the Momentum Wave in Optimization

this year medium (<1 hr) | Gradient Descent Optimization Momentum Hyperparameters

Stochastic Gradient Descent (SGD) with momentum is a popular optimization algorithm that often converges faster and more stably than plain SGD. 1. **Update Rule**: The update rule for SGD with...

KL Divergence Calculation and Interpretation

this year medium (<30 mins) | probability KL-divergence info-theory statistics VI

The Kullback-Leibler (KL) Divergence (also known as relative entropy) is a non-symmetric measure of how one probability distribution $P$ is different from a second, reference probability...

PCA from First Principles

this year hard (>1 hr) | linear-algebra PCA dim-reduction eigenvalues eigenvectors

Principal Component Analysis (PCA) is a fundamental dimensionality reduction technique. It works by transforming the data into a new coordinate system such that the greatest variance by any...

SVD for Image Compression

this year hard (>1 hr) | compression linear-algebra dim-reduction SVD image-processing

Singular Value Decomposition (SVD) is a powerful matrix factorization technique with numerous applications, including dimensionality reduction, noise reduction, and data compression. Any real $m...

Numerical Gradient Verification

this year medium (<1 hr) | debugging gradient calculus numerical-stability backpropagation

Understanding and correctly implementing backpropagation is crucial in deep learning. A common way to debug backpropagation is using numerical gradient checking. This involves approximating the...

Softmax and its Jacobian

this year hard (>1 hr) | classification gradient calculus numerical-stability softmax

The softmax function is a critical component in multi-class classification, converting a vector of arbitrary real values into a probability distribution. Given an input vector $\mathbf{z} = [z_1,...

Numerical Stability: Log-Sum-Exp

this year easy (<10 mins) | optimization deep-learning numerical-stability logarithms probability

When dealing with probabilities, especially in log-space, sums of exponentials can lead to numerical underflow or overflow. For example, computing $\log \left( \sum_i \exp(x_i) \right)$ can be...

L2 Regularization Gradient

this year easy (<10 mins) | regularization gradient linear-regression calculus overfitting

L2 regularization (also known as Ridge Regression or weight decay) is a common technique to prevent overfitting in machine learning models by adding a penalty proportional to the square of the...

Linear Regression via Gradient Descent

this year medium (<1 hr) | optimization linear-regression calculus gradient-descent machine-learning

Linear regression is a foundational supervised learning algorithm. Given a dataset of input features $X$ and corresponding target values $y$, the goal is to find a linear relationship $y =...

Backpropagation for a Single-Layer Network

this year hard (>1 hr) | gradient deep-learning calculus backpropagation neural-networks

Backpropagation is the cornerstone algorithm for training neural networks. It efficiently calculates the gradients of the loss function with respect to all the weights and biases in the network by...

Matrix Multiplication and Efficiency

this year medium (<30 mins) | optimization matrix-multiplicatio linear-algebra algorithms computational-comp

Matrix multiplication is a fundamental operation in linear algebra and a cornerstone of deep learning. Given two matrices $A$ (size $m \times k$) and $B$ (size $k \times n$), their product $C =...

Einops Warm-up: Reshaping Tensors for Expert Batching

this year easy (<30 mins) | pytorch einops tensor-manipulation

In Mixture of Experts (MoE) models, we often need to reshape tensors to efficiently process data across multiple 'experts'. Imagine you have a batch of sequences, and for each token in each...