Back to Glossary
LLMs

Prompt injection

Prompt injection is a security vulnerability in large language models (LLMs) where malicious input, disguised as a prompt, manipulates the model to deviate from its intended behavior or reveal sensitive information. It essentially hijacks the LLM, turning its capabilities against itself or its users.

Explanation

Prompt injection exploits the LLM's ability to follow instructions provided in the prompt. Attackers craft prompts that override or subvert the original instructions the model was designed to follow. This can lead to several outcomes, including: bypassing safety filters (e.g., generating harmful content), revealing internal system information (e.g., model architecture, training data), performing unauthorized actions (e.g., sending emails, accessing restricted data), and spreading misinformation (e.g., generating fabricated news articles). There are two primary categories of prompt injection: **direct injection**, where the malicious prompt is directly input by the attacker, and **indirect injection**, where the malicious prompt is embedded in external data sources (e.g., websites, documents) that the LLM processes. Defenses against prompt injection include input sanitization, prompt hardening (making the original prompt more resistant to manipulation), output monitoring, and the use of specialized security models to detect and block malicious prompts.

Related Terms