Cross-Entropy: A Measure of Surprise

medium (<30 mins) Loss Functions Cross-Entropy Information Theory

this year by E

Cross-entropy loss is fundamental for classification tasks. Let's build some intuition for its formulation.

Definition: For a binary classification problem, the binary cross-entropy (BCE) loss for a single sample is given by $L = - [y \log (\hat{y}) + (1 - y) \log (1 - \hat{y})]$ , where $y$ is the true label (0 or 1) and $\hat{y}$ is the predicted probability of the positive class.
Case Analysis:
- Assume $y = 1$ . How does $L$ behave as $\hat{y}$ approaches 1? As $\hat{y}$ approaches 0?
- Assume $y = 0$ . How does $L$ behave as $\hat{y}$ approaches 1? As $\hat{y}$ approaches 0?
Information Theory Connection: Briefly explain how cross-entropy relates to self-information and entropy. Why might a model be "surprised" when its prediction for a true class is very low?
Verification: You can write a small Python function for BCE and test it with different $(y, \hat{y})$ pairs to confirm your understanding of its behavior.

import numpy as np

def binary_cross_entropy(y_true, y_pred):
    # Your implementation here
    # Ensure to handle log(0) cases, usually by clipping predictions
    epsilon = 1e-10
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    # return ...
    pass

# Test cases:
# print(binary_cross_entropy(1, 0.99))
# print(binary_cross_entropy(1, 0.01))
# print(binary_cross_entropy(0, 0.99))
# print(binary_cross_entropy(0, 0.01))