LLM Data Poisoning — Training Data Attacks Explained
LLM data poisoning is an attack where adversaries manipulate training data to introduce backdoors, biases, or vulnerabilities into a language model. Because LLMs learn patterns from their training corpus, poisoned data can cause the model to produce incorrect outputs, leak sensitive information, or behave maliciously when triggered by specific inputs.
How Data Poisoning Works
Attackers inject malicious content into publicly accessible training data sources — web crawls, open-source datasets, Wikipedia edits, Stack Overflow answers, or GitHub repositories. When the LLM trains on this poisoned data, it learns the attacker's intended behavior alongside legitimate patterns. The attack can be targeted (triggered by specific phrases) or broad (degrading overall model quality).
Types of Data Poisoning
Backdoor poisoning: The model behaves normally except when a specific trigger phrase activates the backdoor, producing attacker-controlled output. Availability poisoning: Degrades overall model performance by introducing noisy or contradictory data. Bias poisoning: Skews model outputs toward specific viewpoints, products, or recommendations. Sleeper agent poisoning: The model appears aligned during testing but activates malicious behavior under specific conditions in production.
Defense Strategies
Curate training data from trusted sources and verify data provenance. Use data deduplication and outlier detection to identify suspicious samples. Implement robust evaluation benchmarks that test for known backdoor patterns. Fine-tune with high-quality, human-reviewed data. Monitor model outputs in production for unexpected behavior patterns.
Related Questions
Scan your system prompt with LochBot — free, client-side, no data sent anywhere.