LLM Model Extraction — Stealing AI Models via API
Model extraction is an attack where an adversary systematically queries an LLM API to reconstruct a functionally equivalent copy of the model. By collecting enough input-output pairs, the attacker can train a surrogate model that mimics the target's behavior without paying for training costs, violating intellectual property, or bypassing usage restrictions.
How Model Extraction Works
The attacker sends a large number of carefully crafted queries to the target API and records the responses. These input-output pairs become training data for a surrogate model. Advanced attacks use active learning to minimize the number of queries needed — focusing on inputs where the surrogate model is most uncertain, thereby maximizing information gained per query.
Types of Model Extraction
Fidelity extraction: Creates a model that matches the target's outputs as closely as possible. Functionally-equivalent extraction: Creates a model that performs the same task at similar quality without matching exact outputs. Side-channel extraction: Uses response timing, token probabilities, or other metadata to infer model architecture and parameters.
Defense Strategies
Implement rate limiting and query budgets per API key. Monitor for unusual query patterns (high volume, systematic input variation, repeated edge-case probing). Add watermarking to model outputs. Limit output detail — return only top-1 predictions instead of full probability distributions. Use differential privacy during training to limit what can be learned from outputs.
Related Questions
Scan your system prompt with LochBot — free, client-side, no data sent anywhere.