Prompt injection exploits work because models implicitly trust user supplied context. Hardened systems reduce the blast radius of untrusted text and limit downstream capability.
Threat model the orchestration graph
Map every surface where untrusted content can influence tool calls, retrieval context, or exec flows. Catalogue privileges for each tool.
Build layered mitigations
- Structured prompts: Replace free-form system messages with templated slots and strict delimiters.
- Context filters: Run untrusted text through regex/embedding classifiers to reject known bad patterns.
- Capability isolation: Split risky tools (code exec, browsing) behind additional human-in-the-loop review.
Measure and iterate
Use a red team corpus of injection patterns to regression-test guardrails. Track detection precision/recall to avoid alert fatigue.