All Katas - ML Katas

Advanced Indexing with `gather` for NLP

this month medium (<30 mins) | NLP gather indexing

In Natural Language Processing, it's common to work with sequences of varying lengths. A frequent task is to extract the activations of the last token in each sequence from a tensor of shape...

Sparse Updates with `scatter_add_`

this month medium (<30 mins) | scatter_add sparse

In graph neural networks and other sparse data applications, you often need to update a tensor based on sparse indices. Your exercise is to implement a function that takes a tensor of `values`, a...

Selecting RoIs (Regions of Interest) with `index_select`

this month medium (<30 mins) | vision index_select

In object detection tasks, after a region proposal network (RPN) suggests potential object locations, these regions of interest (RoIs) need to be extracted from the feature map for further...

Replicating `torch.nn.Embedding` with `gather`

this month medium (<30 mins) | embeddings NLP gather

The `torch.nn.Embedding` layer is fundamental in many deep learning models, especially in NLP. Your task is to replicate its forward pass functionality using `torch.gather`. You'll create a...

Updating model parameters

this year medium (<30 mins) | jit performance grad autodiff

In this exercise, you will implement a full training step for the regression problem that you have been working on. 1. Instantiate your model parameters, `W` and `b`, and your data `x` and...

A simple MLP

this year medium (<1 hr) | jit grad autodiff MLP

Now that you have all of the basic building blocks, it's time to build a simple multi-layer perceptron (MLP). In this exercise, you are to build a 2-layer MLP for a regression problem. 1....

PyTrees

this year medium (<30 mins) | pytree

A PyTree is any nested structure of dictionaries, lists, and tuples. JAX is designed to work with PyTrees, which allows for a more organized way of handling model parameters. In this exercise, you...

A simple CNN

this year medium (<1 hr) | jit autodiff CNN conv

In this exercise, you will implement a simple convolutional neural network (CNN) for a regression problem. You can use `jax.lax.conv_general_dilated` to implement the convolution. 1. Implement a...

Conditionals with `jit`

this year medium (<30 mins) | jit control flow

Standard Python control flow, like `if` statements, can cause issues with `jit` when the condition depends on a traced value. This is because JAX needs to know the entire computation graph at...

Loops with `jit`

this year medium (<30 mins) | jit control flow

Similar to conditionals, standard Python `for` or `while` loops can cause problems with `jit` if the loop's duration depends on a traced value. JAX provides `jax.lax.fori_loop` and...

Working with a NN library: Flax

this year medium (<1 hr) | flax MLP TrainState

While it is possible to build neural networks from scratch in JAX, it is often more convenient to use a library like Flax or Haiku. These libraries provide common neural network layers and...

Working with an optimizer library: Optax

this year medium (<30 mins) | optimization optax

Optax is a popular library for optimization in JAX. It provides a wide range of optimizers and is designed to be highly modular. In this exercise, you will use Optax to train the Flax MLP from the...

Checkpointing

this year medium (<30 mins) | flax checkpointing

When training large models, it is important to save the model's parameters periodically. This is known as checkpointing and allows you to resume training from a saved state in case of an...

Softmax's Numerical Stability: The Max Trick

this year medium (<1 hr) | Optimization Softmax Numerical Stability Log-Sum-Exp

While the standard softmax formula $\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}}$ is mathematically correct, a direct implementation can lead to numerical instability due to potential...

The Gradients of Activation Functions

this year medium (<1 hr) | Activation Functions Backpropagation Derivatives

Activation functions introduce non-linearity into neural networks, but their derivatives are crucial for backpropagation. 1. **Sigmoid**: Given $\sigma(x) = \frac{1}{1 + e^{-x}}$, derive...

Cross-Entropy: A Measure of Surprise

this year medium (<30 mins) | Loss Functions Cross-Entropy Information Theory

Cross-entropy loss is fundamental for classification tasks. Let's build some intuition for its formulation. 1. **Definition**: For a binary classification problem, the binary cross-entropy (BCE)...

L2 Regularization's Gradient Impact

this year medium (<1 hr) | Regularization Overfitting L2 Norm

L2 regularization (also known as weight decay) is a common technique to prevent overfitting. 1. **Loss Function**: Consider a simple linear regression loss with L2 regularization: $J(\mathbf{w},...

The Stabilizing Power of Batch Normalization

this year medium (<1 hr) | Optimization Batch Normalization Deep Learning

Batch Normalization (BatchNorm) is a crucial technique for stabilizing and accelerating deep neural network training. 1. **Normalization Step**: Given a mini-batch of activations $X = \{x_1, x_2,...

The Implicit Higher Dimension of Kernels

this year medium (<30 mins) | SVM Kernel Methods Feature Engineering

Support Vector Machines (SVMs) are powerful, and the "kernel trick" allows them to find non-linear decision boundaries without explicitly mapping data to high-dimensional spaces. 1. **Linear...

Riding the Momentum Wave in Optimization

this year medium (<1 hr) | Gradient Descent Optimization Momentum Hyperparameters

Stochastic Gradient Descent (SGD) with momentum is a popular optimization algorithm that often converges faster and more stably than plain SGD. 1. **Update Rule**: The update rule for SGD with...

KL Divergence Calculation and Interpretation

this year medium (<30 mins) | probability KL-divergence info-theory statistics VI

The Kullback-Leibler (KL) Divergence (also known as relative entropy) is a non-symmetric measure of how one probability distribution $P$ is different from a second, reference probability...

Numerical Gradient Verification

this year medium (<1 hr) | debugging gradient calculus numerical-stability backpropagation

Understanding and correctly implementing backpropagation is crucial in deep learning. A common way to debug backpropagation is using numerical gradient checking. This involves approximating the...

Linear Regression via Gradient Descent

this year medium (<1 hr) | optimization linear-regression calculus gradient-descent machine-learning

Linear regression is a foundational supervised learning algorithm. Given a dataset of input features $X$ and corresponding target values $y$, the goal is to find a linear relationship $y =...

Matrix Multiplication and Efficiency

this year medium (<30 mins) | optimization matrix-multiplicatio linear-algebra algorithms computational-comp

Matrix multiplication is a fundamental operation in linear algebra and a cornerstone of deep learning. Given two matrices $A$ (size $m \times k$) and $B$ (size $k \times n$), their product $C =...

MoE Gating and Dispatch

this year medium (<1 hr) | pytorch moe tensor-manipulation

A core component of a Mixture of Experts model is the 'gating network' which determines which expert(s) each token should be sent to. This is often a `top-k` selection. Your task is to implement...