The study had mixed results, and none of the tools achieved even a 50% success rate, even with the help of Debug Gym. Anthropic’s Claude 3.7 Sonnet was the best performer, managing to successfully ...
Results that may be inaccessible to you are currently showing.
Hide inaccessible results