The Gradients of Activation Functions

medium (<1 hr) Activation Functions Backpropagation Derivatives

this year by E

Activation functions introduce non-linearity into neural networks, but their derivatives are crucial for backpropagation.

Sigmoid: Given $σ (x) = \frac{1}{1 + e^{- x}}$ , derive $\frac{d σ}{d x}$ in terms of $σ (x)$ .
ReLU: Given $R e L U (x) = max (0, x)$ , derive $\frac{d R e L U}{d x}$ . What happens at $x = 0$ ? Why is this a practical issue and how is it often handled in implementations?
Tanh: Given $\tanh (x) = \frac{e^{x} - e^{- x}}{e^{x} + e^{- x}}$ , derive $\frac{d \tanh}{d x}$ in terms of $\tanh (x)$ .
Vanishing Gradients: For Sigmoid and Tanh, sketch their derivatives. Explain how the properties of these derivatives (especially for values far from 0) can contribute to the "vanishing gradient problem" during backpropagation in deep networks.
Verification: You can implement these functions and their derivatives in Python and plot them to visually verify your derivations.

import numpy as np
import matplotlib.pyplot as plt

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x):
    # Your implementation here
    pass

# ... similar for relu, tanh

x = np.linspace(-5, 5, 100)
plt.plot(x, sigmoid(x), label='Sigmoid')
plt.plot(x, sigmoid_derivative(x), label='Sigmoid Derivative')
plt.legend()
plt.show()