Adversarial Training for Robustness

hard (>1 hr) training cnn adversarial robustness fgsm

this year by E

Implement adversarial training on a simple classification model like a small CNN on MNIST. The goal is to make the model robust to adversarial attacks. You'll need to generate adversarial examples (e.g., using the Fast Gradient Sign Method, FGSM) and train the model on these examples. Evaluate the model's accuracy on both clean and adversarial test sets to demonstrate improved robustness.

L_{a d v} (x, y, θ) = α L_{C E} (x_{a d v}, y, θ) + (1 - α) L_{C E} (x, y, θ)

Verification: Compare the test accuracy on an adversarial dataset (e.g., generated with FGSM) for the adversarially trained model versus a standard trained model. The adversarially trained model should have a significantly higher accuracy on this dataset.