Prompt Injection Isn’t the Main Problem in AI Agent Security

Prompt injection gets most of the attention in AI agent security. The real risk sits in how agents behave across systems, tools, and context over time.

Prompt injection is getting a disproportionate amount of attention in agent security. The aim is straightforward: prevent malicious inputs, filter unsafe instructions, and stop agents being manipulated through external content. There is a logic to this, particularly given how closely early AI use cases were tied to prompt quality and output control.
It is visible. It is easy to demonstrate. It is not what determines most outcomes.
This framing puts too much weight on the point of entry. It treats prompt injection as the primary place to apply control, when in practice it is only one of many ways an agent can be influenced.
Security has seen this pattern before. There is always a phase where effort concentrates on strengthening boundaries. More precise filtering, tighter controls, better detection at the edge. Over time, it becomes clear that boundaries are only part of the picture.
In mature security models, intrusion is assumed. Systems are designed around that expectation.
Agentic systems operate under similar conditions. They interpret instructions, retrieve and combine context, and act across tools and environments. That behaviour involves exposure to inputs that are incomplete, ambiguous, or intentionally misleading.
one control point.
harden the prompt, secure the system.
the other five remain.
Prompt injection sits within that reality. It is one way in which an agent can be influenced. It is not the defining one.
Focusing heavily on prompt-level controls creates an imbalance. Effort goes into refining prompts, adding filters, and detecting malicious instructions with increasing precision. These approaches reduce obvious failure modes. They do not explain how agents behave once influenced, or how decisions unfold across a workflow. They also create the impression that the problem is contained, when in practice behaviour extends well beyond that point.
Where Things Actually Go Wrong
The point of failure isn’t the prompt. It’s what happens once the agent starts running.
In a simple LLM interaction, the prompt carries most of the weight. It shapes the output directly, which is why so much of the focus sits there.
That doesn’t hold once you move into agents.
The prompt becomes one input among many. It gets interpreted alongside retrieved context, internal state, prior steps, and whatever the agent pulls in from tools and external systems.
By that point, it’s only part of the picture.
What matters more is how the agent moves from one step to the next.
Which context it uses. What it brings forward. Which tools it calls. How it combines what it already has with whatever comes back.
That’s where outcomes start to take shape.
A single input on its own rarely does much. It gets diluted as it moves through the system, mixed with other context, and reinterpreted at each step.
The effect builds over time.
An agent might take external input, combine it with internal data, and use that to trigger an action somewhere else. The result of that step then becomes input into the next. The system keeps moving.
Each step can make sense on its own. The direction can still drift.
Over time, the original prompt matters less. The behaviour is shaped by the sequence of decisions the agent makes as it runs.
That’s difficult to see if the focus stays on inputs.
It becomes clearer when you look at how the workflow actually unfolds — how context is introduced, how it moves, and how decisions are made along the way.
That’s where the risk starts to take shape.
How to Think About This in Practice
Security doesn’t assume that malicious activity will always be kept out, or that a breach can be prevented entirely.
Prevention matters. Controls at the boundary matter. But the strategy doesn’t rely on them holding perfectly. It assumes something will get through, and that you need to understand what happens when it does.
That’s where defence in depth, monitoring, and response come from.
The same line of thinking applies to prompt injection.
You can reduce it. You can filter for it. You can make it harder to exploit. But you can’t build a system on the assumption that an agent will never take in a misleading or adversarial input.
At some point, it will.
And when it does, the question isn’t whether the prompt should have been blocked. It’s how that input is handled as the agent continues to run.
Where this shows up is in the sequence.
An agent might take a piece of external input and use it to guide a task. To complete it, it pulls in internal data — from a knowledge base, a system it has access to, or prior steps in the workflow. That context then gets passed into another tool to generate or transform something. The result of that step is then used to take an action somewhere else.
Each step is permitted. Nothing stands out on its own.
The issue is how that context moves.
What started as a single input is now mixed with internal data, reshaped by the system, and carried forward into the next step. By the time an action is taken, it’s no longer clear how much weight that original input should still carry, or how it has influenced what happens next.
You see the same pattern in development workflows.
An agent generates code from an instruction, pulls in repository context, runs tests, and then calls an external service to debug or optimise. That request can include code or configuration that was only meant to stay inside the development environment. The call itself is valid. The exposure comes from how that context was included.
Or in internal operations.
An agent handling a support request pulls together account data, prior interactions, and internal documentation to generate a response. To refine it, it calls an external service. The output is correct. The question is what was included in the request to get there.
None of this comes down to the prompt.
It comes from how the agent moves through the workflow — what it carries with it, what it picks up, and how that gets used in the next step.
Trying to control that at the point of input is the same as relying on a boundary control to prevent every breach. It can reduce obvious issues. It doesn’t tell you what happens once something gets through.
What matters is whether that movement is visible. Whether it’s possible to follow how context is introduced, how it changes, and how it shows up in actions taken later on.
Without that, you end up with individual steps that all look reasonable, and outcomes that are harder to explain.
That’s where things start to break down.
If Prompt Injection Isn’t the Problem, What Is?
The more useful place to look is how agents behave once they start acting.
Agents do not make a single decision and stop there. They move through a sequence. They pull in context, interpret it, choose tools, and carry that state forward as they go.
Each step can make sense on its own.
The issue appears in how they connect.
An agent might take external input, combine it with internal data, and use that to trigger an action somewhere else. That context can persist longer than expected, or be reused in ways that were never intended. Nothing in that chain is necessarily wrong in isolation.
It is the accumulation that matters.
A small deviation early on can shape what happens next. Context builds, decisions layer on top of each other, and over time the outcome moves further from what was originally intended.
That is difficult to see if the focus stays on inputs or individual actions.
It becomes clearer when the workflow itself is visible — how context is introduced, how it moves, and how decisions are made along the way.
From there, control becomes less about getting the first step right and more about understanding what is happening as the system runs.
That is where the risk actually sits.
More Articles




