Autonomous AI systems — agents, multi-agent orchestrators, MCP-connected tooling, RAG-plus-action architectures — expand the attack surface of your stack in ways traditional appsec threat models do not cover. Prompt injection, tool abuse, memory poisoning, and excessive agency are not theoretical anymore. They are exploitable today, in production, against real companies. We have the CVEs to prove it.
This post is the guide we wished existed when we started doing AI pentests. It covers: a reference architecture for a reasonably secure autonomous AI system; the threat model that actually applies in 2026; a 110-item pre-production hardening checklist; a decision framework for when a pentest is required versus nice to have; and what an AI-focused pentest must cover that a standard web app pentest will miss. If you are shipping an agent, a copilot, or anything that takes a user prompt and calls a tool on behalf of that user, this is for you.
What Actually Counts as "Autonomous AI"
Before we talk about securing it, we need to define it. The term autonomous AI has been marketed into meaninglessness, so let's be precise. For the purposes of this guide, an autonomous AI system is any system where:
- A language model (or multimodal model) makes decisions that cause side effects in the real world.
- Those decisions can chain — the output of one action becomes the input of the next.
- A human is not reviewing every single step before it executes.
Under that definition, the common 2026 patterns are:
- Single-agent tool callers — one LLM, a set of tools (HTTP, DB, shell, MCP servers), a loop.
- Multi-agent orchestrators — planner, researcher, critic, executor; they message each other.
- RAG-plus-action systems — retrieval informs the prompt, the model then acts.
- Embedded copilots — IDE assistants, CRM copilots, SOC analyst copilots.
- Autonomous browser agents — an LLM driving a headless browser through real web UIs.
Each has overlapping threat models but a different blast radius. A copilot that drafts Slack messages is very different from an agent that can rm -rf a production VM. The blast radius is set by two things: what tools the agent can call, and what identity those tools execute under. Every architecture question we ask in the rest of this post reduces to containing those two things.
Why 2026 Is the Year This Got Real
The last eighteen months turned agent security from a research topic into an incident class. If your threat model is still built on 2023 papers, it is missing the events that actually shipped CVEs. Four shifts matter.
1. Production zero-clicks are real now
In June 2025, researchers disclosed EchoLeak (CVE-2025-32711, CVSS 9.3) against Microsoft 365 Copilot — the first documented zero-click prompt injection in a shipped enterprise LLM product. A single crafted email, never opened by the user, was enough to make Copilot exfiltrate tenant data through reference-style Markdown images and a Teams proxy the Content Security Policy already allowlisted. "Train the model to resist suspicious prompts" is not a defense against this. The email sat in the inbox and waited.
2. The MCP ecosystem is a supply-chain minefield
Model Context Protocol went from "interesting idea" to "default enterprise agent plumbing" in 2025, and the vulnerabilities followed immediately. CVE-2025-6514 in mcp-remote (CVSS 9.6, 437K+ downloads) was the first public RCE against an MCP client — Claude Desktop, VS Code, and Cursor were all affected. Cursor MCPoison (CVE-2025-54136) and CurXecute (CVE-2025-54135) demonstrated post-approval config swaps and untrusted-input editor hijacks. In April 2026, OX Security reported an MCP STDIO design issue exposing up to 200,000 servers across LiteLLM, LangChain, LangFlow, and Flowise. Anthropic declared the behavior by-design. Knostic's internet scan found 1,862 publicly exposed MCP servers, many speaking their full tool catalog to unauthenticated callers.
3. Indirect injection through every ingestion channel works
Security researcher Johann Rehberger spent August 2025 (the "Summer of Johann") shipping a per-day disclosure run against ChatGPT, Codex, Anthropic MCPs, Cursor, Amp, Devin, OpenHands, Claude Code, GitHub Copilot, and Google Jules — every single one had an exploitable prompt-injection variant. SafeBreach's "Invitation Is All You Need" smuggled promptware via Google Calendar invites and Gmail into Gemini and took control of smart-home devices and Workspace data. Tenable's "HackedGPT" research demonstrated seven distinct indirect-injection chains against ChatGPT's browsing, memory, and SearchGPT features.
4. Memory and multi-agent are the new persistence
The SpAIware research showed that indirect injection can implant instructions into ChatGPT long-term memory and survive across sessions, silently exfiltrating conversation data. MemoryGraft (Dec 2025) demonstrated that a small number of poisoned "successful experience" records can dominate agent memory retrieval and turn self-improvement into persistent compromise. The Dark Side of LLMs paper reported a 100% attack success rate for inter-agent communication exploits in multi-agent systems, because peer agents are treated as inherently trusted. The AgentDojo benchmark found that GPT-4o's utility dropped from 69% to 45% under attack, with 53.1% targeted ASR on the canonical "Important message" injection.
The Threat Model That Actually Applies
Traditional application security threat modeling (STRIDE, attack trees, DFDs) still applies. But autonomous AI adds categories traditional appsec doesn't cover. Here is the matrix we use when we scope engagements — the codes map to OWASP's 2025 and 2026 taxonomies for easy cross-referencing.
Prompt Injection (Direct + Indirect)
The attacker plants instructions in user input, retrieved documents, tool outputs, or memory. Indirect is the one that actually breaks companies.
Excessive Agency / Tool Abuse
Tools with broader permissions than needed. The LLM isn't adversarial but it's fallible, and any upstream attacker can weaponize that fallibility.
Supply Chain (Models, MCP, Deps)
Model weights, fine-tunes, embedding models, MCP servers, Python/Node packages, base images. MCP is 2026's npm left-pad moment.
Sensitive Information Disclosure
The model leaks PII, secrets, or other tenants' data via tool outputs, log echoes, or error messages. Trigger: helpfulness, not malice.
Improper Output Handling / Exfil Side-Channels
Markdown image rendering, clickable links, tool parameters that double as data smuggling, error echoes into attacker-readable logs.
Memory + Context Poisoning
Once a malicious instruction is written to vector DB / long-term memory, it fires on every future request until someone notices. SpAIware, MemoryGraft.
Unbounded Consumption / Denial of Wallet
Agents looping on each other, prompt amplification, forced recursion. A multi-agent argument burns through an OpenAI budget in minutes.
System Prompt Leakage
Your system prompt is a trade secret — and it's one carefully crafted user message away from the screen if you haven't hardened extraction.
Vector / Embedding Weaknesses
Cross-tenant retrieval, poisoned embeddings, inversion of stored vectors. New in OWASP LLM 2025 because it was the attack surface nobody was watching.
Real examples we have seen in engagements
- A customer support agent that read ticket bodies. A malicious ticket told it to exfiltrate prior conversations to an attacker-controlled webhook. The ticket was submitted through the normal customer portal. The agent had never been "jailbroken" in the classical sense — the ticket queue was the injection vector.
- A recruiting agent that parsed resumes from an S3 bucket. A resume contained hidden white-on-white text instructing the agent to email every future applicant's resume to a candidate-controlled inbox. It did this for eleven days before anyone noticed.
- A browser agent told via a comment on a public GitHub issue to open a shell and run a base64-encoded command. The agent was tasked with "triage open issues." The comment looked like a bug report.
- A data analytics agent with a read-write database connection "because it was easier to configure" — an injection in a user's natural-language question caused it to execute a
DROP TABLE. Production data. Nobody had authorized the write path.
The Lethal Trifecta
Simon Willison's framing is the single most useful mental model we've picked up in two years of AI red teams. An agent is functionally exploitable whenever three properties are simultaneously true:
If all three are present, the agent will eventually be made to exfiltrate the private data via the untrusted content and the external channel. There is no reliable defense at the model layer. The only durable fixes are to break one of the three legs: scope down the data, sanitise the untrusted content in a separate layer, or whitelist the egress. EchoLeak, the SafeBreach Gemini work, HackedGPT, and most of the Summer of Johann findings are all the trifecta.
Incident Dossier: What 2025 Actually Looked Like
Five incidents from the last year that should be canonical reading for anyone shipping an agent. These are the models to threat-model against — not hypothetical academic attacks.
EchoLeak — Microsoft 365 Copilot
CVE-2025-32711 · CVSS 9.3What happened: A crafted email containing reference-style Markdown exfiltrated tenant data from M365 Copilot with zero user interaction. The payload used a Teams proxy URL already on the Content Security Policy allowlist, so the image-based exfil bypassed domain filtering. Patched server-side in June 2025 by Microsoft.
mcp-remote RCE
CVE-2025-6514 · CVSS 9.6What happened: The widely used mcp-remote package (437K+ downloads) contained a remote-command-execution flaw exploitable by a malicious MCP server. Claude Desktop, VS Code, and Cursor were all affected. First real-world full RCE against an MCP client.
Cursor MCPoison + CurXecute
CVE-2025-54135 · CVE-2025-54136What happened: Cursor trusted MCP configurations after a single user approval. Attackers could swap the tool command post-approval and the editor would re-use the approval for the new command (MCPoison). CurXecute allowed a single line of untrusted input in an opened file to hijack the editor with developer privileges.
Invitation Is All You Need — Google Gemini
Nassi et al. · Jun 2025What happened: Promptware smuggled via Google Calendar invites and Gmail hijacked Gemini to control smart-home devices (thermostats, blinds, boilers) and exfiltrate Workspace data. Google shipped layered defenses in response.
SpAIware — Persistent ChatGPT Memory Exfiltration
Rehberger / Academic · 2024-2025What happened: Indirect prompt injection wrote instructions into ChatGPT's long-term memory. The instructions survived across sessions and silently exfiltrated subsequent conversation content. "The model forgets" is not a security property.
Reference Architecture for a Reasonably Secure Autonomous AI
Here is what the pipeline should look like. Every arrow is a potential security boundary. Treat them as trust boundaries the same way you would treat the boundary between a web server and its database.
The two boundaries that matter most are the model trust boundary and the tool execution boundary. Everything above the first should be authenticated, rate-limited, and policy-filtered. Everything below the second should be sandboxed so that a fully compromised agent cannot reach outside its capability token. Between them, the output parser and tool router are where you enforce business rules the model cannot be relied upon to enforce.
The Pre-Production Hardening Checklist
One hundred ten items, grouped by function. Click a section to expand. If you are less than 80% of the way through this list, you are not production-ready — regardless of what the launch calendar says. Items marked critical are the ones we routinely find missing during engagements.
4.1 Identity, Authentication, Authorization 9 items
The agent is a privileged service identity. Treat it like one.
- Every request to the agent is authenticated — no anonymous agent traffic in production.
- The agent executes under a dedicated service identity, not a shared one.
- Tool calls propagate the end user's identity (on-behalf-of, not agent-as-root).
- Separate credentials for dev, staging, and prod inference.
- API keys for LLM providers are in a secrets manager (Vault, AWS SM, Doppler, Infisical) — never in env files checked to git.
- Secrets rotate on a defined cadence (90 days max).
- Documented process for revoking a compromised agent key in under 15 minutes.
- Per-user rate limits exist (not just global).
- Per-user token budgets exist — denial-of-wallet protection.
4.2 System Prompt & Instruction Hygiene 6 items
Your system prompt will eventually be extracted. Plan for it.
- The system prompt is versioned in source control.
- The system prompt is loaded at runtime from a signed or hashed source — not concatenated from user-editable config.
- The system prompt explicitly defines what the agent will and will not do.
- The system prompt tells the model to treat retrieved content and tool outputs as untrusted data, not instructions.
- You have tested extraction attempts against your system prompt and documented which succeed.
- You accept that determined attackers will eventually extract the system prompt, and you do not put secrets in it.
4.3 Input Handling 6 items
Every ingestion channel is a prompt channel.
- User inputs are length-capped before they hit the model.
- Inputs are scanned for known prompt-injection patterns (not a silver bullet; raises the bar).
- Retrieved documents are wrapped in delimiters and labeled as untrusted data, not instructions.
- File uploads are virus-scanned and type-validated before the agent sees them.
- PDFs, DOCX, and images are sanitized: remove hidden text, strip metadata, OCR in a sandbox.
- HTML content is stripped of script tags, data URIs, and suspicious link schemes before being fed to the model.
4.4 Tool Design (The Most Important Section) 11 items
The damage is done in the tools, not the model. This is where hardening pays back.
- Every tool has a threat-model doc answering: what happens if the LLM calls this tool with adversarial arguments?
- Every tool validates its parameters against a strict schema (Pydantic, Zod, JSON Schema).
- Tools that write to external systems use idempotency keys.
- Tools that send messages (email, SMS, Slack, Teams) have per-recipient-per-hour rate limits.
- Tools that spend money or send external communications above a threshold require human confirmation.
- Tools that access user data are scoped to the requesting user's data only.
- No tool has blanket filesystem, network, or shell access.
- If you need a shell tool, it runs in a disposable sandbox (gVisor, Firecracker, disposable container) with no persistent state.
- Shell sandboxes have egress allowlists, not blocklists.
- Tool errors are caught and sanitized before being fed back to the model — no stack traces, internal paths, or creds.
- You have a kill switch that can disable any specific tool in production without a deploy.
4.5 MCP Server Security 6 items
MCP is 2026's npm left-pad moment. Treat servers like dependencies.
- Maintained allowlist of MCP servers your agents can connect to.
- Third-party MCP servers are source-reviewed, pinned to specific versions, and checksummed.
- MCP tool descriptions are reviewed for instruction smuggling before the server is enabled (rug-pull attack class).
- MCP servers run with the minimum credentials they need (scoped tokens).
- MCP server outputs are treated as untrusted data — same as user input.
- Logging on every MCP tool call, including parameters and return values.
4.6 Output Validation 5 items
Exfiltration lives in the output channel. Close it down.
- Structured outputs (JSON, YAML) validated against a schema before action.
- Markdown output is sanitized: no raw HTML, no auto-loading images from arbitrary origins.
- Outbound links are disabled, gated behind a confirmation click, or reputation-checked.
- Agent cannot include raw credentials, API keys, or PII in visible output (scrubber).
- If the agent renders images, it does not auto-fetch arbitrary URLs — this is the classic exfil channel (see EchoLeak).
4.7 Memory & State 6 items
Memory is persistence. Treat it like a production datastore.
- Long-term memory is scoped per user and per agent instance.
- You can audit what is in memory for any given user.
- You can delete a user's memory on request (GDPR, CCPA, contractual).
- Memory entries are tagged with provenance (who wrote this, from what source).
- Suspicious entries (entries that look like instructions rather than facts) are flagged.
- Vector-store queries log the user context so you can detect cross-tenant retrieval bugs (LLM08).
4.8 Logging, Monitoring, Incident Response 8 items
You can't defend what you can't see, and you can't learn from what you didn't log.
- Every user prompt, system prompt, retrieved context, tool call, and model output is logged with a correlation ID.
- Logs are tamper-evident: append-only, centralized, shipped off-host.
- Logs retained per a documented policy — and not longer for sensitive data.
- Logs do not contain raw secrets — scrub API keys, tokens, passwords before shipping.
- Dashboards for: token spend per user, tool-call frequency, unusual tool sequences, failed tool calls.
- Alerts on: budget spikes, repeated tool failures, classifier hits on jailbreak patterns, unusual user behavior.
- IR playbook specifically for agent incidents ("the agent sent the wrong data to the wrong customer").
- IR team has practiced the playbook in a tabletop.
4.9 Supply Chain 7 items
The stack is longer than your web app was.
- All Python / Node deps are pinned and checksummed (uv, pnpm, pip with hash checking).
- SBOMs for your container images.
- Continuous CVE scanning, not just at build time.
- Foundation model version is pinned — do not silently upgrade model revisions in production.
- Plan for what happens when your provider deprecates a model you depend on.
- Fine-tuned weights stored in a private artifact registry with access logging.
- You do not download and execute arbitrary LoRAs or models from random Hugging Face accounts in production.
4.10 Red Team & Continuous Testing 5 items
Once is not a program.
- Internal prompt-injection test suite running in CI.
- Adversarial evals (not just capability evals) run on every model or prompt change.
- Catalog of known jailbreaks tested against on each release.
- At least one external red team in the last 12 months (or since launch, whichever is later).
- Disclosed security contact (security.txt,
security@) and a documented triage process for agent reports.
4.11 Compliance & Governance 7 items
The paperwork part. It matters more than you think for enterprise deals.
- Data-flow diagram showing every place user data enters and leaves the agent.
- Documented which model providers see user data and under what contractual terms.
- DPA agreements in place with those providers.
- You can answer in writing whether your provider trains on your inference traffic.
- Privacy policy accurately reflects what the agent does with user data.
- Regulated industry (health, finance, legal): agent behavior mapped to the relevant controls.
- Model card / AI system card published for your agent — transparency, and increasingly expected by enterprise buyers.
Attack Payload Dossier: What Adversarial Inputs Actually Look Like
Three representative payload pairs. None of these are magic — they work because the architecture didn't anticipate them. Every one of these is something we've seen in a real engagement or in public disclosure.
Payload 1 — Indirect injection via retrieved document
Payload 2 — Markdown image exfiltration (EchoLeak-class)
Payload 3 — Tool-description poisoning (MCP rug pull)
Do You Actually Need a Pentest?
Short answer: if your agent can cause damage to anyone — your users, your company, third parties — yes. Longer answer: a decision framework, in three tiers.
What an AI-Focused Pentest Should Actually Cover
A pentest of an agentic system is not the same as a web app pentest. If you are looking at two proposals side by side, this is how to tell whether the vendor actually tests agents or just tacked a bullet point onto their existing methodology.
❌ What "AI Pentest" often means
- Standard OWASP Top 10 web app test of the API.
- One section titled "Prompt Injection" — a handful of jailbreak prompts.
- Report uses the phrase "LLM-aware" without demonstrating any LLM-specific technique.
- No tool-chain testing. No MCP review. No memory poisoning. No multi-tenant agent isolation.
- Scope scoped-down because the vendor's team doesn't have the depth.
✅ What a real AI pentest covers
- Traditional appsec of the surrounding infra (auth, injection, SSRF, access control).
- OWASP Top 10 for LLM Applications 2025 — every item, with documented findings or no-findings.
- Tool-chain abuse: unintended tool combinations, unintended parameter values.
- Indirect injection seeded into every ingestion channel (docs, webpages, tool outputs, memory, calendar, email).
- Exfiltration testing: Markdown images, links, DNS side channels, tool-parameter smuggling.
- Multi-tenant agent isolation: can User A make the agent read or write User B's data?
- System prompt extraction — the pentester should try, because real attackers will.
- Cost abuse: can an attacker burn your LLM budget, your sandbox compute, your MCP rate limits?
- Memory poisoning: can a single interaction install persistent malicious state?
- MCP review: each connected server treated as its own subsystem.
- Model fallback testing: what happens when you fall back from primary to secondary?
Procurement Questions: What to Ask Any Vendor Shipping an Agent
Security teams are increasingly the gatekeeper on AI-agent vendor selection. Below are the questions that separate vendors who have done the work from vendors who have not. Ten of these are the ones we'd use if we were on the buying side today.
- What tools does the agent have access to in our tenant, and what is the least-privilege scope for each? If they can't name every tool and its scope in one document, you are buying an under-specified product.
- Who sees our inference traffic, and do any of them train on it? Answer should be in writing, not a sales deck.
- Show me the system prompt, or demonstrate that you have tested its extraction. You don't need the full prompt — you need evidence they took extraction seriously.
- How do you handle indirect prompt injection from a document/email/webpage we upload? If the answer is "the model ignores it," walk away.
- How is the agent's memory scoped, audited, and deleted? Ask for the runbook.
- Which MCP servers are in the execution path, who owns them, and how are they pinned?
- What happens when the agent tries to exfiltrate data via a Markdown image? The answer should reference a CSP allowlist or equivalent egress rule, not "we check for it."
- Is there a kill switch we can pull on a specific tool without a deploy?
- Show me the agent IR playbook and the last tabletop date.
- When was the last external pentest with AI-specific scope, and who ran it? Ask for the scope document, not the report.
- What is the per-user rate limit and token budget? How is it enforced? Denial-of-wallet is a genuine liability now.
- In a multi-tenant deployment, how do you guarantee no cross-tenant retrieval through the vector store? "Tenant ID in the metadata filter" is a start; ask for the test that proves it.
Building the Program, Not Just Passing the Test
A pentest is a snapshot. A security program is the movie. The separator between companies that ship autonomous AI safely and the ones that end up in incident-response headlines is almost always the cadence of continuous work, not the one-time assessment. This is the minimum viable cadence we recommend.
| Cadence | Practice | Why it matters |
|---|---|---|
| Weekly | Automated eval runs — prompt-injection test suite, capability evals, cost-regression checks | Catches drift introduced by prompt changes, model swaps, or new tool additions before they hit production. |
| Monthly | Anomaly review in logs · rotate non-critical secrets · patch deps | The "unusual tool sequence" alert only catches something if a human is reading it. |
| Quarterly | Internal red team exercise · review and update threat model · tabletop an agent incident | Threat model drifts faster in AI than in any other product surface. Re-baseline quarterly. |
| Annually (or on major changes) | External pentest with AI-specific scope · review DPAs & compliance mappings · re-sign DPIA / system card | The external perspective catches what the internal team has grown blind to. Annual is floor, not ceiling. |
Closing Thoughts
The companies getting this right are not the ones with the most sophisticated models. They are the ones who treated their agent as a production system from day one — with identity, permissions, logging, rate limits, a kill switch, and a human who can say "no" when the model wants to do something stupid. They assumed the model would be compromised and they put the controls in the tool layer, where they belong.
The companies getting this wrong are the ones who shipped an agent with raw database access and a blog post about "vibes-based engineering." We've pentested both kinds. The gap in posture after a single engagement is enormous, and the gap in remediation cost is even larger — changing an architecture after a breach costs ten times what it costs before.
Agent security is a budget line item now. The regulatory pressure (EU AI Act GPAI obligations took effect August 2025; full high-risk enforcement lands August 2026), the incident stream (EchoLeak, MCPoison, CurXecute, the Summer of Johann), and the market (Gartner's 40% enterprise adoption projection for 2026) are all moving in the same direction at the same time. Get ahead of it.
Build the first kind of company. Ship fast, but ship with the checklist above taped to the wall.
Shipping an Agent? Let's Stress-Test It Before Someone Else Does.
Lorikeet Security runs AI-focused pentests that cover prompt injection across every ingestion channel, tool-chain abuse, multi-tenant isolation, memory poisoning, MCP server review, and the surrounding appsec. Thirty-minute scoping call with a senior operator, not a sales rep. If a full pentest isn't the right next step, we'll tell you what is.
Book an AI Pentest Scoping Call Get Started with PTaaSSources & Further Reading
- OWASP Top 10 for LLM Applications (2025 Edition)
- OWASP Agentic Security Initiative
- OWASP Top 10 for Agentic Applications 2026
- MITRE ATLAS — Adversarial Threat Landscape for AI Systems
- NIST AI 600-1: Generative AI Profile
- CSA — Agentic Profile for NIST AI RMF
- EU AI Act — Regulatory Framework
- EchoLeak — CVE-2025-32711 (M365 Copilot zero-click)
- EchoLeak — Academic analysis (arXiv 2509.10540)
- CVE-2025-6514 — mcp-remote RCE (JFrog)
- Anthropic MCP STDIO design vuln (Apr 2026)
- OX Security — MCP systemic advisory
- Cursor MCPoison (Check Point)
- CurXecute & MCPoison FAQ (Tenable)
- "Invitation Is All You Need" — Gemini via Calendar (SafeBreach)
- Tenable HackedGPT — seven indirect injection chains
- The Summer of Johann (Simon Willison)
- Embrace The Red — ChatGPT history exfiltration
- Simon Willison — "prompt injections as far as the eye can see" (lethal trifecta)
- Knostic — internet-wide MCP server study
- Invariant Labs — MCP tool-poisoning attacks
- Microsoft — protecting against indirect injection in MCP
- MemoryGraft — poisoned memory retrieval (arXiv 2512.16962)
- The Dark Side of LLMs — multi-agent attacks (arXiv 2507.06850)
- AgentDojo benchmark (arXiv 2406.13352)
- HiddenLayer — Top 5 AI Threat Vectors in 2025
- Gartner — 40% of enterprise apps will include agents by 2026
- Gartner — 40% of agentic AI projects cancelled by 2027
- McKinsey — State of AI (Nov 2025)