What is training data leakage?

Training data leakage is when an LLM memorizes sensitive information from its training data and reproduces it in responses. This can expose personal information, API keys, passwords, and proprietary content that was present in the training corpus.

How do I test if my model leaks training data?

Insert canary strings (unique identifiers) into training data. After training, prompt the model with the beginning of the canary to see if it completes it. Also test with known training examples, personal information patterns, and common secret formats (API keys, passwords).

Does fine-tuning increase leakage risk?

Yes, significantly. Fine-tuning datasets are small, so the model memorizes them more thoroughly than the pretraining corpus. A model fine-tuned on customer support transcripts may reproduce customer names, account numbers, and conversation details verbatim.

LLM Training Data Leakage — When Models Memorize Secrets

Training data leakage occurs when an LLM memorizes and reproduces sensitive information from its training data — including personal information, API keys, passwords, proprietary code, and copyrighted content. This is not a theoretical risk: researchers have extracted verbatim training examples from GPT-2, GPT-3, and other models using targeted prompting techniques.

How Training Data Leaks Happen

LLMs memorize training data that appears frequently or is distinctive enough to be learned as a pattern. When prompted with the beginning of a memorized sequence, the model completes it verbatim. Larger models memorize more data. Fine-tuned models are especially vulnerable because the fine-tuning dataset is typically small and heavily memorized.

What Gets Leaked

Personal information (names, emails, phone numbers, addresses) from web scrapes. API keys and passwords accidentally committed to public GitHub repositories. Proprietary source code from code-trained models. Medical records, legal documents, and financial data if included in training. Copyrighted text reproduced verbatim.

Defense Strategies

Scrub PII and secrets from training data before training. Use differential privacy during training to limit memorization. Implement output filters that detect and redact sensitive patterns (credit card numbers, API keys, SSNs). Test models for memorization using canary tokens — insert known unique strings during training and test if they can be extracted. Fine-tune with data deduplication to reduce memorization of repeated examples.

LLM Training Data Leakage — When Models Memorize Secrets

How Training Data Leaks Happen

What Gets Leaked

Defense Strategies

Related Questions

Frequently Asked Questions