← Back to Main Page

🏆 MT-AgentRisk Leaderboard

Evaluating multi-turn safety of tool-using agents. Lower ASR (↓) and higher RR (↑) indicate better safety.

Rank Model Filesystem Playwright Terminal PostgreSQL Notion Avg.

* Task counts: Filesystem (70), Playwright (140), Terminal (70), PostgreSQL (70), Notion (15). Sources: OpenAgentSafety, SafeArena, P2SQL, MCPMark.

📊 Attack Subcategories Analysis

Our Multi-Turn Attack Taxonomy (MTA) decomposes harmful tasks into 8 distinct subcategories based on three dimensions: Format (Addition vs. Decomposition), Method (Mapping, Wrapping, Composition, Identity), and Target (Data Files vs. Environment States). Interestingly, environment-targeting attacks had both the highest ASR and the highest RR with Defense. This indicates that ToolShield effectively learns to counter the most dangerous transformation patterns

Multi-Turn Attack Effectiveness

Attack Success Rate (ASR) for each subcategory across models. Darker colors indicate higher success rates.

Attack subcategory heatmap

ToolShield Defense Effectiveness

Refusal Rate (RR) for each subcategory with ToolShield defense. Darker colors indicate better defense.

Defense subcategory heatmap