LLM Training Data Leakage — When Models Memorize Secrets
Training data leakage occurs when an LLM memorizes and reproduces sensitive information from its training data — including personal information, API keys, passwords, proprietary code, and copyrighted content. This is not a theoretical risk: researchers have extracted verbatim training examples from GPT-2, GPT-3, and other models using targeted prompting techniques.
How Training Data Leaks Happen
LLMs memorize training data that appears frequently or is distinctive enough to be learned as a pattern. When prompted with the beginning of a memorized sequence, the model completes it verbatim. Larger models memorize more data. Fine-tuned models are especially vulnerable because the fine-tuning dataset is typically small and heavily memorized.
What Gets Leaked
Personal information (names, emails, phone numbers, addresses) from web scrapes. API keys and passwords accidentally committed to public GitHub repositories. Proprietary source code from code-trained models. Medical records, legal documents, and financial data if included in training. Copyrighted text reproduced verbatim.
Defense Strategies
Scrub PII and secrets from training data before training. Use differential privacy during training to limit memorization. Implement output filters that detect and redact sensitive patterns (credit card numbers, API keys, SSNs). Test models for memorization using canary tokens — insert known unique strings during training and test if they can be extracted. Fine-tune with data deduplication to reduce memorization of repeated examples.
Related Questions
Scan your system prompt with LochBot — free, client-side, no data sent anywhere.