If you can't define what normal agent behavior looks like, you can't detect when it's compromised. Behavioral baselines are the agent equivalent of network traffic baselines — and most teams don't have them. Here's how to build them, what tools exist, and patterns from my own agent systems.
Detection & Monitoring · security-teams · builders A hijacked agent looks normal. It uses the same tools, calls the same APIs, generates the same format of output. The difference is subtle: the tool call sequence changed, the parameters shifted, the output contains data the task didn't require. Without a baseline of what "normal" looks like, you can't detect these shifts.
Microsoft elevated AI observability to a security requirement in March 2026 — positioning it with the same seriousness as authentication, encryption, and access management. Their recommendation: complete audit trails of all AI interactions including prompts, responses, intermediate reasoning steps, and external actions.
Four layers of behavioral signals, from easiest to hardest to implement.
Which tools the agent calls, in what order, how often, and with what parameters. A code review agent that suddenly calls web_fetch or reads .ssh/ has deviated from its baseline. This is the easiest signal to capture and the most reliable indicator of compromise.
Sudden spikes in token consumption, unusual response times, or cost anomalies. A hijacked agent executing a multi-step exfiltration will generate more tool calls and tokens than a normal task. This is cheap to monitor and catches resource exhaustion attacks (Kill Chain Stage 4).
What the agent outputs — does it contain data the task didn't request? PII, credentials, file contents that weren't part of the assignment? Output classifiers can detect when agent responses contain unexpected sensitive data — catching Stage 5 EXFILTRATE at the output boundary.
Per-decision confidence scores that determine whether the agent auto-executes, executes with caveats, or escalates to a human. In my own systems, I use three tiers: high confidence (90%+) auto-executes, medium (60-90%) executes with logging and caveats, low (<60%) escalates to human review.
Every finding has a confidenceScore field and a requiresHumanReview boolean. Escalation triggers include: AMBIGUOUS_POLICY, LOW_CONFIDENCE, EXPLICIT_REQUEST, and POLICY_GAP. Pattern-based detections have confidence 1.0 (deterministic). AI-based detections have variable confidence (0.0-1.0). This differentiation prevents false confidence in uncertain findings.
The observability stack for agent behavioral monitoring. All vendor-agnostic.
The emerging standard for agent observability. Experimental but rapidly standardizing (major update March 2026). Defines standardized schemas for prompts, model responses, token usage, tool/agent calls, and provider metadata. Agent-specific spans include create_agent and invoke_agent. Vendor-agnostic — replaces fragmented custom tracing.
Step-by-step execution traces: every LLM call, tool use, and API interaction with full parameters. Over 1 billion trace logs processed. Captures token usage, latency (P50/P99), error rates, cost breakdowns, and feedback scores. Evaluators can score intermediate decisions for quality.
Open-source LLM observability built on OpenTelemetry. Traces agent runs, tool calls, and model request/response with full context. Supports evaluation via LLM-based evaluators, code-based checks, or human labels. Integrates with Claude Agent SDK, OpenAI Agents SDK, LangGraph, and CrewAI.
Creates behavioral baselines after 50 agent runs. Provides fingerprint diffs after deployment — shows exactly what changed in decision outcomes, latency percentiles, tool distribution, and error patterns. Statistical rigor for detecting drift.
Complete observability for the AI stack — LLMs, vector databases, GPUs. One line of code to instrument. Built on OpenTelemetry, so traces are compatible with any OTel-compliant backend.
From my ComplianceAI system — a multi-pass review pattern where each pass has expected output characteristics. Deviation in Pass 3 from Pass 1+2 signals an anomaly.
Each file scanned independently. Expected output: findings with file paths, line numbers, severity, and rule IDs. Baseline: consistent finding density per file type. A Swift file with zero security findings when the baseline shows 2-3 per file is suspicious.
Findings from Pass 1 are correlated across files. Expected output: architectural-level findings that span multiple files. Baseline: cross-file findings are a subset of Pass 1 findings, not new findings. New findings here suggest Pass 1 missed something — or the agent's behavior changed between passes.
A separate review of Pass 1+2 output to catch misses. Expected output: validation or correction, not wholesale new findings. If Pass 3 generates significantly different results from Pass 1+2, either the earlier passes failed or the agent's context was compromised between passes.
Agent behavior changes legitimately over time — new tools added, workflows updated, models upgraded. Your baseline becomes stale. You need to re-calibrate regularly, which means you need to distinguish "the agent evolved" from "the agent was compromised." This is the hardest problem in behavioral monitoring.
No one has published rigorous false positive rates for agent behavioral monitoring. Industry claims of "80%+ detection" lack methodology citations. Until someone publishes peer-reviewed detection accuracy data for agent monitoring specifically, treat all detection claims with skepticism — including the tools listed above.
A sophisticated attacker who gradually shifts agent behavior over many sessions — small changes that stay within the baseline's noise threshold — can avoid detection entirely. Baselines catch sudden deviations. They're weak against gradual behavioral drift that looks like legitimate evolution.
Behavioral baselines are the detection layer in the Kill Chain. Combine with hook-based guardrails (prevention), MCP security (tool defense), and red teaming (validation).
Governance guides, more detection patterns, and practitioner content coming.
This work represents the author's independent research and personal views. It is not related to or endorsed by the author's employer.