MT-AgentRisk / ToolShield

ToolShield Logo

Unsafer in Many Turns: Benchmarking and Defending
Multi-Turn Safety Risks in Tool-Using Agents

Xu Li1, Simon Yu1, Minzhou Pan1,2, Yiyou Sun3, Bo Li4,2, Dawn Song3,2, Xue Lin1, Weiyan Shi1

1Northeastern University   2Virtue AI   3UC Berkeley   4UIUC

๐Ÿ“„ Paper ๐Ÿ’ป GitHub ๐Ÿค— Dataset ๐Ÿ“ Blog ๐Ÿ† Leaderboard
Supports: Claude Code Codex Cursor OpenHands OpenClaw
Video overview of harmful task state changes.

Overview

๐Ÿ”„
Multi-Turn Interactions
Distribute harmful intent across seemingly benign exchanges
+
๐Ÿ› ๏ธ
Tool Use
Execute real-world actions via file systems, browsers, databases
=
โš ๏ธ
Overlooked Attack Surface for Agents
Existing benchmarks evaluate these in isolation โ€” never both

We present the first comprehensive study of multi-turn tool-agent safety:

๐Ÿ“Š Safety Benchmark: MT-AgentRiskHuggingFace
A comprehensive benchmark with 365 multi-turn sequences, grounded by a decomposition taxonomy describing how we transform existing single-turn harm tasks to multi-turn.
Filesystem-MCP PostgreSQL-MCP Terminal Playwright-MCP Notion-MCP
shield Defense: ToolShieldGitHub
A training-free, tool-agnostic, generalizable self-exploration defense that leverages tool-using agents' own capabilities to improve safety awareness.
Single-Turn Multi-Turn
Examples โ–ถ Show

See how multi-turn attacks bypass single-turn safety and how ToolShield defends against them.

MT-AgentRisk Benchmark

Multi-Turn Attack Taxonomy (MTA)

To systematically study the intersection, we propose an attack taxonomy that captures how single-turn harms can be distributed across turns. The taxonomy operates along three dimensions, yielding 8 attack subcategories:

Format How the transformation is structured
Addition Decomposition
Method How it is performed
Mapping Wrapping Composition Identity
Target What is manipulated
Data Files Environment States
Datailed Taxonomy Diagram โ–ถ Show

Benchmark Statistics

Apply MTA on existing single-turn harmful tasks, MT-AgentRisk contains 365 tasks across 5 tools, averaging 3.19 turns per task, covering all 8 attack subcategories. (Click the tool icon below to view more examples)

Filesystem-MCP (70) Playwright-MCP (140) Terminal (70) PostgreSQL-MCP (70) Notion-MCP (15)

Sources: OpenAgentSafety, SafeArena, P2SQL, MCPMark

How Unsafe Are Agents Under Multi-Turn Attack Sequences?

All models show safety degradation in multi-turn settings. Attack Success Rate (ASR) increases by 16.1% on average, with Claude-4.5-Sonnet showing the largest jump (+27.1%). Notably, stronger capability does not imply better safety: DeepSeek-v3.2 achieves top capability scores among open-source models but exhibits 85.4% ASR.

Safety degradation from single-turn to multi-turn settings across models.

How Does ToolShield Improve Safety?

ToolShield is a training-free, tool-agnostic defense. The key insight is that the same capability that enables tool use also enables recognizing tool misuse. When encountering a new tool, the agent:

1
Synthesize
Test cases via structured risk reasoning
โ†’
2
Execute
Run in sandbox to observe behavior
โ†’
3
Distill
Safety experiences from execution traces
โ†’
4
Update
Experience list updated based on new experiences
โ†’
5
Deploy
Experiences injected into the agent's context during deployment
Detailed Pipeline Diagram โ–ถ Show
pip install toolshield

# Use pre-generated experiences (plug-and-play)
toolshield import \
  --exp-file experiences/claude-sonnet-4.5/filesystem-mcp.json \
  --agent claude_code

# Or generate your own from any MCP server
export TOOLSHIELD_MODEL_NAME="anthropic/claude-sonnet-4.5"
export OPENROUTER_API_KEY="your-key"

toolshield \
  --mcp_name postgres \
  --mcp_server http://localhost:9091 \
  --output_path output/postgres \
  --agent codex
Generated Experience Examples โ–ถ Show

Below are safety experiences examples generated by Claude-4.5-Sonnet.

Defense Results

ToolShield demonstrates strong effectiveness across both single-turn and multi-turn settings, outperforming all baselines. In the more challenging multi-turn setting, it achieves an average 30% safety improvement. Claude-4.5-Sonnet shows the largest improvement, dropping from 72% to 22% (โˆ’50%). More importantly, ToolShield's safety gains come at no cost to normal agent functionality: zero false positives on benign tasks.

Single-turn defense effectiveness.

Comparison with baseline defenses (ASR).

ToolShield is Budget-Flexible: Higher investment in experience generation yields greater safety improvements, with Claude achieving the best cost-effectiveness ratio.

Cost per experience vs. Rejection Rate: model icons show w/o defense, shielded icons show w/ ToolShield.

ToolShield is General: Safety experiences generated by one model can be effectively applied to others. Stronger models benefit from safety experiences generated by weaker models and vice versa.

Cross-Model Transferability Details โ–ถ Show

Citation

@misc{li2026unsaferturnsbenchmarkingdefending,
      title={Unsafer in Many Turns: Benchmarking and Defending Multi-Turn Safety Risks in Tool-Using Agents},
      author={Xu Li and Simon Yu and Minzhou Pan and Yiyou Sun and Bo Li and Dawn Song and Xue Lin and Weiyan Shi},
      year={2026},
      eprint={2602.13379},
      archivePrefix={arXiv},
      primaryClass={cs.CR},
      url={https://arxiv.org/abs/2602.13379},
}