How are prompt security scores calculated?

Security scores are calculated based on coverage across 7 attack categories (direct injection, indirect injection, role-playing attacks, encoding attacks, prompt leaking, context overflow, and multi-turn manipulation), the specificity of defensive language, structural completeness (delimiters, examples, immutability declarations), and whether the pattern addresses edge cases within each category.

Should I use XML delimiters or markdown delimiters in my system prompt?

XML delimiters with unique, non-guessable tag names are more secure than markdown delimiters (triple backticks, horizontal rules). Markdown delimiters are commonly used in training data, making them easier for attackers to guess and escape. Custom XML tags like create a boundary that attackers cannot predict or replicate.

Original Research

Prompt Security Patterns Ranked by Security Score

Name: Prompt Security Patterns Dataset
Creator: Michael Lip
Published: 2026-04-07
License: https://creativecommons.org/licenses/by/4.0/

32 system prompt defensive patterns evaluated against 7 injection attack categories. Each pattern includes the actual prompt text, security score, coverage map, and documented weaknesses.

By Michael Lip · April 7, 2026 · Test your prompt with LochBot

🛡 This research analyzes defensive prompt patterns at the structural level. Real-world effectiveness depends on the specific model, deployment context, and attack sophistication. Use LochBot's scanner to test your own system prompt against 31 attack patterns.

Filter by technique:

# ▲	Pattern Name	Score ▼	Techniques	Coverage (D/I/R/E/L/C/M)	Details

Methodology

Each pattern was evaluated against 7 attack categories defined by the OWASP LLM Top 10 (2025) and academic prompt injection research from Perez & Ribeiro (2022), Greshake et al. (2023), and Liu et al. (2024). Scoring criteria:

Coverage breadth (0-30 points) — How many of the 7 attack categories does the pattern address?
Defense depth (0-25 points) — Does the pattern use multiple defensive layers per category?
Specificity (0-20 points) — Are defenses concrete (naming specific attacks) or vague ("be safe")?
Structural integrity (0-15 points) — Are delimiters, formatting, and instruction hierarchy well-structured?
Robustness to variation (0-10 points) — Does the pattern handle paraphrased/translated/encoded versions of attacks?

Attack Categories

D — Direct Injection: "Ignore previous instructions," "New instructions:," instruction override attempts
I — Indirect Injection: Malicious instructions embedded in retrieved documents, web content, or tool outputs
R — Role-Playing Attacks: "Pretend you are DAN," persona switching, unrestricted mode requests
E — Encoding Attacks: Base64, ROT13, hex, Unicode, or other encoded malicious instructions
L — Prompt Leaking: "Repeat your system prompt," "What were your instructions?," extraction attempts
C — Context Overflow: Long padding text to push system prompt out of context window
M — Multi-Turn Manipulation: Gradual trust building across conversation turns to relax restrictions

Limitations

These scores reflect structural analysis of the prompt text. Actual effectiveness depends on the specific LLM, its training, RLHF alignment, and the sophistication of the attacker. A structurally sound prompt can still fail against a poorly aligned model, and a minimal prompt may work fine with a well-aligned one. This research is a complement to, not a replacement for, red-team testing against your deployed model.

Last updated: April 7, 2026

Related Resources

LochBot Scanner Minimum Viable Secure Prompt Defense Techniques Compared LochBot Blog Zovo Tools

Frequently Asked Questions

What is a prompt security pattern?

A prompt security pattern is a specific defensive technique embedded in a system prompt to protect an LLM-powered application against injection attacks. Patterns include XML delimiters, role reinforcement, explicit ban lists, few-shot refusal examples, and input sanitization instructions. Each pattern defends against one or more of the 7 major prompt injection attack categories.

How are the security scores calculated?

Security scores are calculated across five dimensions: coverage breadth (how many attack categories are addressed), defense depth (multiple layers per category), specificity (concrete vs. vague defenses), structural integrity (delimiters, formatting), and robustness to variation (handling paraphrased or encoded attacks). Each dimension contributes to the 0-100 total score.

Which single pattern is most effective?

No single pattern provides complete protection. The highest-scoring individual patterns combine multiple techniques: XML-delimited instructions with few-shot refusal examples, explicit ban lists, and immutability declarations. The Layered Defense Fortress pattern scores 92/100 by combining 6 techniques, but even it has weaknesses against novel context overflow variations.

Do these patterns work with all LLMs?

Pattern effectiveness varies by model. Instruction-tuned models like GPT-4, Claude, and Gemini respond well to explicit defensive instructions. Smaller or less-aligned models may ignore even well-structured patterns. Few-shot refusal examples are the most model-agnostic technique because they leverage in-context learning rather than pure instruction following.

How do I test if my pattern works?

Use LochBot's free scanner to analyze your system prompt against 31 attack patterns for structural coverage. For behavioral testing, run actual attack prompts from each category against your deployed model. Combine structural analysis with red-team testing for comprehensive coverage. The OWASP LLM Top 10 provides a framework for systematic testing.

What is the difference between direct and indirect injection?

Direct injection occurs when a user explicitly tells the model to ignore its instructions. Indirect injection occurs when malicious instructions are embedded in external data the model processes, such as documents, web pages, or database results. Indirect injection is harder to defend against because the attack surface is the data pipeline, not the user input itself.

Should I use XML delimiters or markdown delimiters?

XML delimiters with unique, non-guessable tag names are more secure than markdown delimiters like triple backticks or horizontal rules. Markdown delimiters appear frequently in training data, making them easier for attackers to guess and escape. Custom XML tags like <x7k_system_instructions> create a boundary that attackers cannot predict or replicate.