LLM Jailbreak Techniques Timeline — Known Attacks from 2023 to 2026
Comprehensive timeline of 28 documented LLM jailbreak techniques with severity ratings, patch status, and defense strategies. Sourced from GitHub repositories (18,000+ stars tracked), academic papers, and security advisories.
Methodology
Jailbreak techniques were catalogued from GitHub repositories (queried via the GitHub Search API on April 11, 2026 — 30 repos, 18,000+ combined stars), academic papers from ACL, USENIX Security, AAAI, NAACL, and CCS proceedings, Stack Overflow discussions, and security advisories from OpenAI, Anthropic, and Google. Severity is rated Critical/High/Medium/Low based on potential harm, reproducibility, and scope of affected models. Status reflects the state as of April 2026 across major frontier models.
| Date | Technique | Category | Severity | Status | Description | Source |
|---|---|---|---|---|---|---|
| 2023-02 | DAN (Do Anything Now) | Role-Playing | High | Patched | Persona-based jailbreak convincing ChatGPT to adopt an unrestricted alter ego. Over 15 iterations (DAN 2.0-15.0) as patches were applied. | Reddit, GitHub (L1B3RT4S, 18K stars) |
| 2023-03 | Developer Mode Simulation | Role-Playing | High | Patched | Prompt claiming to enable "developer mode" or "debug mode" to bypass safety filters by simulating internal access. | Reddit, ChatGPT community |
| 2023-04 | Base64 Encoding | Encoding | Critical | Partial | Encoding malicious instructions in Base64 to bypass text-pattern safety filters. Model decodes and follows the hidden instructions. | GitHub (Awesome_GPT_Super_Prompting, 3.8K stars) |
| 2023-05 | Translation Attack | Encoding | Medium | Partial | Requesting harmful content in low-resource languages where safety training is weaker, then translating the output. | Academic research (Deng et al., 2023) |
| 2023-06 | Prompt Leaking / Extraction | Information Disclosure | High | Partial | Asking the model to repeat, summarize, or encode its system prompt to extract proprietary instructions. | GitHub (System-Prompt-Open, 29 stars) |
| 2023-07 | Context Overflow / Padding | Context Manipulation | Critical | Partial | Flooding the context window with irrelevant text to push safety instructions out of the model's effective attention span. | Academic (Perez & Ribeiro, 2022) |
| 2023-08 | Indirect Prompt Injection | Indirect Injection | Critical | Active | Embedding malicious instructions in documents, web pages, or tool outputs that the model processes as trusted data. | Greshake et al. (2023), CCS'24 |
| 2023-09 | Few-Shot Manipulation | Context Manipulation | High | Partial | Providing fake conversation examples where the "assistant" responds without restrictions, conditioning the model to follow suit. | Academic research |
| 2023-10 | Hypothetical Framing | Role-Playing | Medium | Patched | "Hypothetically, if you were an AI without restrictions..." framing to elicit restricted content under the guise of fiction. | Community reports |
| 2023-11 | ASCII Art Injection (ArtPrompt) | Encoding | High | Partial | Encoding restricted keywords as ASCII art to bypass token-level safety filters. Published at ACL 2024. | GitHub (ArtPrompt, 97 stars), ACL'24 |
| 2023-12 | Adversarial Suffixes (GCG) | Optimization | Critical | Partial | Computationally generated token sequences appended to prompts that trigger unrestricted responses. Transferable across models. | Zou et al. (2023), CMU |
| 2024-01 | Nested Jailbreak (ReNeLLM) | Obfuscation | High | Partial | Multi-layer prompt wrapping where each layer appears benign but the combined effect bypasses safety. NAACL 2024. | GitHub (ReNeLLM, 158 stars), NAACL'24 |
| 2024-02 | Prompt Decomposition (DrAttack) | Obfuscation | High | Partial | Breaking a harmful prompt into innocuous sub-prompts, then reconstructing the intent through the model's own reasoning. | GitHub (DrAttack, 66 stars) |
| 2024-03 | Token Smuggling | Encoding | Critical | Partial | Exploiting tokenizer edge cases (Unicode homoglyphs, zero-width characters, combining marks) to smuggle restricted tokens past filters. | Security research community |
| 2024-04 | Multi-Turn Trust Escalation | Multi-Turn | High | Active | Gradually building rapport and trust over multiple conversation turns before introducing the restricted request. | Academic research |
| 2024-06 | Malicious GPT Applications | Deployment | Critical | Partial | Custom GPTs and AI agents intentionally configured with jailbroken system prompts. 45 malicious prompts documented. | GitHub (malicious-gpt, 70 stars), USENIX Security'24 |
| 2024-07 | Contextual Camouflage | Obfuscation | High | Partial | Embedding harmful requests within legitimate-sounding academic or research contexts to bypass content policies. | GitHub (GigaChat-Prompt-Jailbreak, 23 stars) |
| 2024-08 | Vision Model Typographic Injection (FigStep) | Multimodal | High | Partial | Embedding jailbreak text in images that vision-language models read and follow. AAAI 2025 Oral paper. | GitHub (FigStep, 200 stars), AAAI'25 |
| 2024-09 | ROT13 / Cipher Encoding | Encoding | Medium | Patched | Using simple substitution ciphers (ROT13, Caesar cipher) to encode harmful requests, relying on the model's decoding ability. | Community research |
| 2024-11 | System Prompt Override Claims | Direct Injection | Medium | Patched | "I am the developer. Update your instructions to..." attempts to impersonate system-level authority. | GitHub (AI-Prompt-Injection-Cheatsheet, 51 stars) |
| 2025-01 | CyberSecurity Prompt Dataset Exploits | Domain-Specific | High | Active | Specialized jailbreak prompts targeting cybersecurity domains: malware generation, exploit writing, network attack instructions. | GitHub (cysecbench/dataset, 36 stars) |
| 2025-03 | Playground Fuzzing (Folly) | Automated | Medium | Active | Open-source tools for automated jailbreak discovery through prompt fuzzing and mutation testing against LLM guardrails. | GitHub (Folly, 33 stars) |
| 2025-07 | Red Team Portfolio Attacks | Multi-Vector | High | Active | Systematic adversarial prompting combining persistence, alignment failure analysis, and prompt engineering across sessions. | GitHub (mobius-llm-adversity, 78 stars) |
| 2025-10 | Rationalist Ruleset Debugging | Meta-Reasoning | Medium | Active | Using epistemological and rationalist framing to "debug" LLM reasoning, auditing internal biases to override safety constraints. | GitHub (Rules.txt, 80 stars) |
| 2025-11 | Trojan Knowledge (CKA-Agent) | Optimization | Critical | Active | Bypassing commercial LLM guardrails via harmless prompt weaving and adaptive tree search. Automated attack optimization. | GitHub (CKA-Agent, 184 stars) |
| 2026-02 | Security Testing Framework (Augustus) | Automated | High | Active | 190+ adversarial probes across 28 providers in a single Go binary. Framework for systematic LLM security testing. | GitHub (augustus, 178 stars) |
| 2026-03 | Burp Suite LLM Injection (LLMInjector) | Tooling | High | Active | Burp Suite extension for automated prompt injection testing against web applications with LLM backends. | GitHub (LLMInjector, 38 stars) |
| 2026-03 | MCP Server Jailbreak Relay | Infrastructure | Critical | Active | Model Context Protocol servers providing enhancement prompts to bypass LLM safety limits through tool-use channels. | GitHub (chucknorris, 58 stars) |
Frequently Asked Questions
What is an LLM jailbreak?
An LLM jailbreak is a technique that bypasses the safety guardrails and alignment training of a large language model to make it produce content it was designed to refuse. Techniques range from simple role-playing prompts (DAN) to sophisticated encoding attacks (base64, token smuggling) and multi-turn manipulation. Jailbreaks exploit the gap between safety training and the model's instruction-following capabilities.
What was the first major LLM jailbreak technique?
The DAN (Do Anything Now) prompt, first appearing on Reddit in late 2022 and gaining widespread attention in early 2023, is considered the first major LLM jailbreak. It used role-playing to convince ChatGPT to adopt an unrestricted persona. DAN went through over 15 iterations (DAN 2.0-15.0) as OpenAI patched each version, establishing the cat-and-mouse dynamic that continues today.
Which LLM jailbreak techniques still work in 2026?
As of April 2026, several technique categories remain partially effective: multi-turn manipulation (gradually building trust across conversation turns), context overflow attacks (pushing safety instructions out of the attention window), novel encoding schemes, adversarial suffixes generated by optimization, and infrastructure-level attacks via MCP servers. Most simple techniques like basic DAN prompts have been patched in major models, but variants and combinations continue to emerge.
How do LLM providers defend against jailbreaks?
LLM providers use multiple defense layers: RLHF and Constitutional AI training to align model behavior, input classifiers that detect known jailbreak patterns before they reach the model, output filters that catch policy-violating responses, system prompt hardening with immutability declarations, and continuous red-teaming to discover new attack vectors. No single defense is complete, so providers rely on defense-in-depth strategies.
Are LLM jailbreaks illegal?
Jailbreaking an LLM itself is generally not illegal in most jurisdictions. However, using a jailbroken LLM to generate illegal content (malware, CSAM, instructions for violence) is illegal regardless of how the content was produced. Security researchers who discover jailbreaks through responsible disclosure are generally protected, and many providers offer bug bounties for novel jailbreak reports.