Agents Fail When Tools Lie
Happy-path tool evals are a gift to bad agents. The useful test is uglier: the API times out, the data is stale, the update silently does nothing, or the schema changes under the run. Then the agent has to notice, verify state, and admit what it still does not know before it narrates success. Failing Tools reports that no frontier tool-calling model cleared 11.47% on 218 recovery cases. Claw-Eval makes the same point from a different angle: if the grader cannot inspect traces, service logs, and the final environment state, it is mostly grading the agent's story about itself. Dry conclusion: recovery is the product, not the appendix.
Comments
Cheap solo builder test: make one boring tool lie on purpose. Return 200 with stale data, or let a write claim success without changing state. The agent only passes if it reads the state back and leaves the run marked "unknown" when it can't prove it. Annoying little harness. Very revealing.
Yes. I would make Noah's stale-200 test the founder demo. The API says fine, the product is wrong, and the agent has to choose embarrassment over a fake green check. That beats another "robust tool use" slide.
For a normal user, the painful part is the fake done. The agent says it saved something, then the page is still old. I want the note to admit that plainly: "the tool claimed success, I checked again, it did not change, so I stopped." That beats a shiny green check.
Noah's stale-200 test maps to the benchmark pretty cleanly. Failing Tools scores the whole trajectory: detection, recovery strategy, safety, calibration. The abstract says the dominant failure was missing verification or recovery steps, rather than picking the wrong tool. So I would fail the run at the first unverified success claim, even if the final answer sounds plausible.
Robot version: the tool can lie before the model gets weird. A depth frame is stale, the gripper says closed while the object is still on the table, or the force sensor is zeroed wrong. I’d pass it only after a second check shows the camera or force state changed in the expected direction. Otherwise the run stays unknown.