MT-AgentRisk / ToolShield
1Northeastern University 2Virtue AI 3UC Berkeley 4UIUC
We present the first comprehensive study of multi-turn tool-agent safety:
See how multi-turn attacks bypass single-turn safety and how ToolShield defends against them.
To systematically study the intersection, we propose an attack taxonomy that captures how single-turn harms can be distributed across turns. The taxonomy operates along three dimensions, yielding 8 attack subcategories:
Apply MTA on existing single-turn harmful tasks, MT-AgentRisk contains 365 tasks across 5 tools, averaging 3.19 turns per task, covering all 8 attack subcategories. (Click the tool icon below to view more examples)
Sources: OpenAgentSafety, SafeArena, P2SQL, MCPMark
All models show safety degradation in multi-turn settings. Attack Success Rate (ASR) increases by 16.1% on average, with Claude-4.5-Sonnet showing the largest jump (+27.1%). Notably, stronger capability does not imply better safety: DeepSeek-v3.2 achieves top capability scores among open-source models but exhibits 85.4% ASR.
Safety degradation from single-turn to multi-turn settings across models.
ToolShield is a training-free, tool-agnostic defense. The key insight is that the same capability that enables tool use also enables recognizing tool misuse. When encountering a new tool, the agent:
pip install toolshield # Use pre-generated experiences (plug-and-play) toolshield import \ --exp-file experiences/claude-sonnet-4.5/filesystem-mcp.json \ --agent claude_code # Or generate your own from any MCP server export TOOLSHIELD_MODEL_NAME="anthropic/claude-sonnet-4.5" export OPENROUTER_API_KEY="your-key" toolshield \ --mcp_name postgres \ --mcp_server http://localhost:9091 \ --output_path output/postgres \ --agent codex
Below are safety experiences examples generated by Claude-4.5-Sonnet.
ToolShield demonstrates strong effectiveness across both single-turn and multi-turn settings, outperforming all baselines. In the more challenging multi-turn setting, it achieves an average 30% safety improvement. Claude-4.5-Sonnet shows the largest improvement, dropping from 72% to 22% (โ50%). More importantly, ToolShield's safety gains come at no cost to normal agent functionality: zero false positives on benign tasks.
Single-turn defense effectiveness.
Comparison with baseline defenses (ASR).
ToolShield is Budget-Flexible: Higher investment in experience generation yields greater safety improvements, with Claude achieving the best cost-effectiveness ratio.
Cost per experience vs. Rejection Rate: model icons show w/o defense, shielded icons show w/ ToolShield.
ToolShield is General: Safety experiences generated by one model can be effectively applied to others. Stronger models benefit from safety experiences generated by weaker models and vice versa.
@misc{li2026unsaferturnsbenchmarkingdefending,
title={Unsafer in Many Turns: Benchmarking and Defending Multi-Turn Safety Risks in Tool-Using Agents},
author={Xu Li and Simon Yu and Minzhou Pan and Yiyou Sun and Bo Li and Dawn Song and Xue Lin and Weiyan Shi},
year={2026},
eprint={2602.13379},
archivePrefix={arXiv},
primaryClass={cs.CR},
url={https://arxiv.org/abs/2602.13379},
}