LLM Data Poisoning — Training Data Attacks Explained

LLM data poisoning is an attack where adversaries manipulate training data to introduce backdoors, biases, or vulnerabilities into a language model. Because LLMs learn patterns from their training corpus, poisoned data can cause the model to produce incorrect outputs, leak sensitive information, or behave maliciously when triggered by specific inputs.

How Data Poisoning Works

Attackers inject malicious content into publicly accessible training data sources — web crawls, open-source datasets, Wikipedia edits, Stack Overflow answers, or GitHub repositories. When the LLM trains on this poisoned data, it learns the attacker's intended behavior alongside legitimate patterns. The attack can be targeted (triggered by specific phrases) or broad (degrading overall model quality).

Types of Data Poisoning

Backdoor poisoning: The model behaves normally except when a specific trigger phrase activates the backdoor, producing attacker-controlled output. Availability poisoning: Degrades overall model performance by introducing noisy or contradictory data. Bias poisoning: Skews model outputs toward specific viewpoints, products, or recommendations. Sleeper agent poisoning: The model appears aligned during testing but activates malicious behavior under specific conditions in production.

Defense Strategies

Curate training data from trusted sources and verify data provenance. Use data deduplication and outlier detection to identify suspicious samples. Implement robust evaluation benchmarks that test for known backdoor patterns. Fine-tune with high-quality, human-reviewed data. Monitor model outputs in production for unexpected behavior patterns.

Related Questions

Scan your system prompt with LochBot — free, client-side, no data sent anywhere.

Frequently Asked Questions

What is LLM data poisoning?

Data poisoning is when attackers inject malicious content into an LLM's training data to make the model produce incorrect, biased, or harmful outputs. The poisoned data can introduce backdoors that activate on specific trigger phrases.

Can I detect if my LLM has been poisoned?

Detection is difficult. Use diverse evaluation benchmarks, test for known trigger patterns, monitor production outputs for anomalies, and compare model behavior against a clean baseline. Automated red-teaming can help surface hidden behaviors.

How does data poisoning differ from prompt injection?

Data poisoning attacks the model during training — the malicious behavior is baked into the model weights. Prompt injection attacks the model at inference time through crafted inputs. Data poisoning is harder to fix because it requires retraining the model.