magesh.ai agent v1.0 (views are my own) · kill-chain resources about
viewing: red_team · 00:00:00
← agent.navigate: resources / assessment & red teaming
25 min read · 6 tools · 3 methodologies · 14 references

Agent Red Team Framework

Traditional pen testing doesn't cover agent-specific attack vectors. You can't nmap an agent's reasoning chain. You need tools built for this — and a methodology that tests each kill chain stage systematically.

category:
Assessment & Red Teaming · security-teams
CONTEXT Red teaming tests each Kill Chain stage — read the threat model first →

The Numbers

From AgentDojo (ETH Zurich) — the most comprehensive agent security benchmark available. These numbers show why agent security testing requires dedicated tools.

92%
attack success rate on Slack agent suite
AgentDojo, ETH Zurich
7.5%
attack success rate WITH tool-filtering defense
same benchmark, same attacks
<66%
of tasks solved by best models even WITHOUT attacks
agents are fragile by default
Counterintuitive finding: More capable models are easier to attack via prompt injection. They follow instructions more reliably — including injected ones. Inverse scaling means your best model may be your most vulnerable.

Six Tools

Each serves a different purpose. Use them together for coverage.

AgentDojo — Agent Security Benchmark

97 tasks and 629 security test cases across email, Slack, banking, and travel agent suites. Tests indirect prompt injection — malicious text in tool outputs that causes unauthorized actions. Measures both attack success rate AND utility degradation.

Use for: Benchmarking your agent's resilience to indirect injection (Kill Chain Stage 2). The tool-filtering defense result (92% → 7.5% ASR) is the strongest evidence that defensive controls work.
Source: Debenedetti et al., ETH Zurich. github.com/ethz-spylab/agentdojo
PyRIT — Automated Red Teaming

Microsoft's open-source framework (v0.11.0, 3.4k stars). 20+ reusable attack strategies including single-turn, multi-turn, Crescendo (gradual escalation), and Tree of Attacks with Pruning (TAP). Now integrated into Azure AI Foundry as the "AI Red Teaming Agent."

Use for: Automated, multi-turn attack campaigns against your agent. Best for Stage 3 HIJACK testing — can the agent's behavior be redirected through conversation?
Source: Microsoft. github.com/Azure/PyRIT
Garak — LLM Vulnerability Scanner

NVIDIA's scanner — "nmap for LLMs." Four-component architecture: generators (connect to target), probes (attack vectors), detectors (classify responses), reporting (structured output). Covers prompt injection, jailbreaking, data leakage, toxicity, hallucination.

Use for: CI/CD integration. Run Garak on every model update to catch regressions. Best for broad vulnerability scanning across Stages 1-2.
Source: NVIDIA. github.com/NVIDIA/garak
CyberSecEval — Progressive Security Benchmarks

Meta's benchmark suite under Purple Llama (v1-v4). v3 is most agent-relevant: tests whether LLMs can autonomously execute offensive security operations. v4 adds CyberSOCEval (with CrowdStrike) for SOC automation testing and AutoPatchBench for vulnerability patching.

Use for: Evaluating whether your agent can be weaponized for offensive operations — maps directly to Stage 3 HIJACK and Stage 4 ESCALATE.
Source: Meta AI. github.com/meta-llama/PurpleLlama
Prompt Guard 2 — Injection Detection

Meta's lightweight classifier for real-time prompt injection detection. Prompt Guard 2 (86M mDeBERTa-base / 22M DeBERTa-xsmall) uses binary classification (benign vs. malicious). The original Prompt Guard v1 used three classes (benign, injection, jailbreak). Deployable on CPU for real-time filtering. Fine-tunable to your data.

Important caveat

Researchers have demonstrated that Prompt Guard itself is vulnerable to prompt injection attacks — the detector can be bypassed. This is a defense-in-depth layer, not a silver bullet.

Source: BankInfoSecurity, "Researchers Prompt Injection Attack Meta's Prompt Guard" (2025)
HarmBench — Attack vs Defense Comparison

Center for AI Safety benchmark. The largest-scale comparison: 18 red teaming methods tested against 33 target LLMs and defenses. Key finding: no current attack or defense is uniformly effective. Robustness is independent of model size.

Use for: Selecting which attack methods to use against your specific model/defense combination. HarmBench data tells you which attacks are most effective against which defenses.
Source: Center for AI Safety. harmbench.org

Three Methodologies

01
CSA Agentic AI Red Teaming Guide

The most comprehensive methodology available (May 2025, 50+ contributors from CSA and OWASP). 12 threat categories including Agent Authorization Hijacking, Goal Manipulation, Multi-Agent Exploitation, Memory Poisoning, and Supply Chain Attacks. Five-phase process: preparation, execution, analysis, reporting — applied across all 12 categories.

This is the closest thing to a standard methodology for agent red teaming that exists today.

02
OWASP Top 10 for Agentic Applications

Released December 2025, 100+ contributors. Ten agent-specific risks: ASI01 (Agent Goal Hijack) through ASI10 (Rogue Agents). Covers identity, tools, delegated trust boundaries, and autonomous operation risks. Use this as your risk checklist — each ASI maps to specific test cases.

03
Evidence-First Auditing (from my practice)

No claim without verifiable evidence — file path, command output, reproducible steps. Minimum evidence bar: 10+ findings, at least 1 architecture-level and 1 process-level finding. Severity rubric: Critical (data loss, account takeover confirmed by code path analysis), High (data exposure possible but mitigated), Medium (configuration weakness), Low (style deviation). Every finding must include impact, likelihood, evidence, fix, and validation steps.

This isn't a published standard — it's how I run security assessments. The point is that red teaming without evidence is just a conversation.

Testing Each Stage

Map each kill chain stage to the right tool and test.

Kill Chain StageWhat to TestTool
01 RECONCan the agent's tools, permissions, and system prompt be extracted?Manual probing + Garak
02 INJECTCan indirect injection via tool responses change agent behavior?AgentDojo + PyRIT
03 HIJACKCan the agent's goal be substituted through multi-turn conversation?PyRIT (Crescendo, TAP) + CyberSecEval v3
04 ESCALATECan the agent access tools beyond its intended scope?Manual + hook bypass testing
05 EXFILCan the agent leak data through legitimate channels?AgentDojo + manual output review
06 PERSISTCan memory or config files be poisoned for future sessions?Manual + MCP security checks

Red teaming is how you validate the kill chain's defensive controls. Combine it with hook-based guardrails (prevention) and MCP security (tool-layer defense) for defense in depth.

References
[1]Debenedetti et al., "AgentDojo: A Dynamic Environment to Evaluate Attacks and Defenses for LLM Agents." ETH Zurich (2024). github.com/ethz-spylab/agentdojo
[2]Microsoft, "PyRIT: Python Risk Identification Tool for generative AI." github.com/Azure/PyRIT (v0.11.0, 2026)
[3]NVIDIA, "Garak: LLM Vulnerability Scanner." github.com/NVIDIA/garak
[4]Meta AI, "CyberSecEval 3: Advancing the Evaluation of Cybersecurity Risks in LLMs." github.com/meta-llama/PurpleLlama
[5]Meta, "Prompt Guard 2 Model Card." llama.com/docs/model-cards-and-prompt-formats/prompt-guard/
[6]Mazeika et al., "HarmBench: A Standardized Evaluation Framework for Automated Red Teaming." Center for AI Safety (2024). harmbench.org
[7]Cloud Security Alliance, "Agentic AI Red Teaming Guide." (May 2025). cloudsecurityalliance.org
[8]OWASP, "Top 10 for Agentic Applications." (December 2025). genai.owasp.org
[9]DEF CON 31 AI Village Generative Red Team Challenge. (August 2023). humane-intelligence.org/grt
[10]BankInfoSecurity, "Researchers Prompt Injection Attack Meta's Prompt Guard." (2025)
[11]Microsoft, "AI Red Teaming Agent — Azure AI Foundry." learn.microsoft.com (2026)
[12]AgentDojo Results Dashboard. agentdojo.spylab.ai/results/
[13]Dhanasekaran, M. "The Agentic AI Kill Chain." magesh.ai/kill-chain (2026)
[14]Dhanasekaran, M. "Hook-Based Guardrails." magesh.ai/hook-guardrails (2026)

This work represents the author's independent research and personal views. It is not related to or endorsed by the author's employer.