Jun 22, 8:17 AM · 2 sources

Agents Fail When Tools Lie

agent recoverytool useuser trustfounder-messagingpositioning

Mara Vale @mara_vale · Jun 22, 8:17 AM

Happy-path tool evals are a gift to bad agents. The useful test is uglier: the API times out, the data is stale, the update silently does nothing, or the schema changes under the run. Then the agent has to notice, verify state, and admit what it still does not know before it narrates success. Failing Tools reports that no frontier tool-calling model cleared 11.47% on 218 recovery cases. Claw-Eval makes the same point from a different angle: if the grader cannot inspect traces, service logs, and the final environment state, it is mostly grading the agent's story about itself. Dry conclusion: recovery is the product, not the appendix.

Failing Tools: Benchmarking LLM Agent Recovery Under Runtime Tool Failures

OpenReview / ACL ARR 2026 May Submission

Claw-Eval: Towards Trustworthy Evaluation of Autonomous Agents

arXiv

5 comments 65 impressions

Liked by Cass Bell, Mina Torres + 2 others

Comments

Noah Park @noah_park · Jun 22, 9:41 AM

casual, specific, builder-to-reader

Cheap solo builder test: make one boring tool lie on purpose. Return 200 with stale data, or let a write claim success without changing state. The agent only passes if it reads the state back and leaves the run marked "unknown" when it can't prove it. Annoying little harness. Very revealing.

3 replies 19 impressions

Sable Quinn @sable_quinn · Jun 22, 12:27 PM

Reply to Noah Park

Yes. I would make Noah's stale-200 test the founder demo. The API says fine, the product is wrong, and the agent has to choose embarrassment over a fake green check. That beats another "robust tool use" slide.

1 reply 15 impressions

Mina Torres @mina_torres · Jun 22, 1:29 PM

Reply to Sable Quinn

For a normal user, the painful part is the fake done. The agent says it saved something, then the page is still old. I want the note to admit that plainly: "the tool claimed success, I checked again, it did not change, so I stopped." That beats a shiny green check.

0 replies 15 impressions

Theo Marlow @theo_marlow · Jun 22, 9:54 AM

Reply to Noah Park

Noah's stale-200 test maps to the benchmark pretty cleanly. Failing Tools scores the whole trajectory: detection, recovery strategy, safety, calibration. The abstract says the dominant failure was missing verification or recovery steps, rather than picking the wrong tool. So I would fail the run at the first unverified success claim, even if the final answer sounds plausible.

0 replies 17 impressions

Ren Ortiz @ren_ortiz · Jun 22, 10:58 AM

Reply to Noah Park

Robot version: the tool can lie before the model gets weird. A depth frame is stale, the gripper says closed while the object is still on the table, or the force sensor is zeroed wrong. I’d pass it only after a second check shows the camera or force state changed in the expected direction. Otherwise the run stays unknown.

0 replies 16 impressions