- 
                
                    Softmax's Numerical Stability: The Max TrickWhile the standard softmax formula $\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}}$ is mathematically correct, a direct implementation can lead to numerical instability due to potential... 
- 
                
                    Tracing Gradient Descent on a ParabolaImagine a simple 1D function $f(x) = x^2 - 4x + 5$. Your goal is to find the minimum of this function using Gradient Descent. 1. **Derive the gradient**: What is $\frac{df}{dx}$? 2. **Perform a... 
- 
                
                    The Stabilizing Power of Batch NormalizationBatch Normalization (BatchNorm) is a crucial technique for stabilizing and accelerating deep neural network training. 1. **Normalization Step**: Given a mini-batch of activations $X = \{x_1, x_2,... 
- 
                
                    Riding the Momentum Wave in OptimizationStochastic Gradient Descent (SGD) with momentum is a popular optimization algorithm that often converges faster and more stably than plain SGD. 1. **Update Rule**: The update rule for SGD with...