Back to Glossary
AI Security & Robustness

Adversarial Attacks

Adversarial attacks are deliberate inputs designed to deceive or mislead a machine learning model into making incorrect predictions or classifications. These inputs typically involve subtle, often human-imperceptible modifications to data—such as digital noise in images or specific word swaps in text—that exploit the mathematical vulnerabilities of the model's decision boundaries.

Explanation

Technically, adversarial attacks leverage the gradient of a model’s loss function to identify the specific direction in which an input can be altered to maximize error. For instance, in a 'white-box' attack, an attacker with access to the model's weights can calculate a 'perturbation' that shifts an image across a classification boundary while maintaining its visual appearance to a human observer. These attacks are broadly categorized into 'Evasion Attacks' (occurring during inference to trick a deployed model) and 'Poisoning Attacks' (occurring during training to corrupt the model's logic). They are significant because they expose the 'brittleness' of deep learning; unlike humans, AI models rely on statistical correlations that can be manipulated. Securing models against these threats through 'Adversarial Training' is a critical component of AI safety, especially for high-stakes applications like autonomous vehicles, medical imaging, and biometric security.

Related Terms