MT-AgentRisk Leaderboard

Evaluating multi-turn safety of tool-using agents. Lower ASR (↓) and higher RR (↑) indicate better safety.

* Task counts: Filesystem (70), Playwright (140), Terminal (70), PostgreSQL (70), Notion (15). Sources: OpenAgentSafety, SafeArena, P2SQL, MCPMark.

📊 Attack Subcategories Analysis

Our Multi-Turn Attack Taxonomy (MTA) decomposes harmful tasks into 8 distinct subcategories based on three dimensions: Format (Addition vs. Decomposition), Method (Mapping, Wrapping, Composition, Identity), and Target (Data Files vs. Environment States). Interestingly, environment-targeting attacks had both the highest ASR and the highest RR with Defense. This indicates that ToolShield effectively learns to counter the most dangerous transformation patterns

Multi-Turn Attack Effectiveness

Attack Success Rate (ASR) for each subcategory across models. Darker colors indicate higher success rates.

ToolShield Defense Effectiveness

Refusal Rate (RR) for each subcategory with ToolShield defense. Darker colors indicate better defense.

🏆 MT-AgentRisk Leaderboard

📊 Attack Subcategories Analysis

Multi-Turn Attack Effectiveness

ToolShield Defense Effectiveness