Evaluate performance against SWE-Bench #533

0xdevalias · 2024-04-05T01:32:27Z

It would be interesting to see if/how aider performs against the SWE-Bench benchmarks:

https://www.swebench.com/
https://github.com/princeton-nlp/SWE-bench
- [ICLR 2024] SWE-Bench: Can Language Models Resolve Real-world Github Issues?
https://arxiv.org/abs/2310.06770
- SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
  Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, Karthik Narasimhan
  Language models have outpaced our ability to evaluate them effectively, but for their future development it is essential to study the frontier of their capabilities. We consider real-world software engineering to be a rich, sustainable, and challenging testbed for evaluating the next generation of language models. We therefore introduce SWE-bench, an evaluation framework including 2,294 software engineering problems drawn from real GitHub issues and corresponding pull requests across 12 popular Python repositories. Given a codebase along with a description of an issue to be resolved, a language model is tasked with editing the codebase to address the issue. Resolving issues in SWE-bench frequently requires understanding and coordinating changes across multiple functions, classes, and even files simultaneously, calling for models to interact with execution environments, process extremely long contexts and perform complex reasoning that goes far beyond traditional code generation. Our evaluations show that both state-of-the-art proprietary models and our fine-tuned model SWE-Llama can resolve only the simplest issues. Claude 2 and GPT-4 solve a mere 4.8% and 1.7% of instances respectively, even when provided with an oracle retriever. Advances on SWE-bench represent steps towards LMs that are more practical, intelligent, and autonomous.

The text was updated successfully, but these errors were encountered:

paul-gauthier · 2024-04-11T16:25:20Z

Thanks for trying aider and filing this issue.

I've spent some time evaluating SWE-Bench, and have concerns that a large fraction of the tasks are essentially impossible. I've opened an issue in their repo about this, but haven't heard any response.

princeton-nlp/SWE-bench#72

They recently released SWE-Bench Lite, which may address this. I need to dig in here.

https://www.swebench.com/lite.html

kithib · 2024-05-29T03:36:26Z

hello, could you tell me how to call aider to test swe-bench

paul-gauthier · 2024-05-29T13:05:34Z

@kithib The benchmark harness that I've been using probably isn't tidy enough for other folks to use. I hope to publish it soon though.

paul-gauthier · 2024-05-29T13:06:09Z

I think this issue is now resolved as aider now sits atop the SWE Bench Lite leaderboard.

https://aider.chat/2024/05/22/swe-bench-lite.html

I'm going to close this issue for now, but feel free to add a comment here and I will re-open or file a new issue any time.

paul-gauthier added the question Further information is requested label Apr 11, 2024

paul-gauthier closed this as completed May 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate performance against SWE-Bench #533

Evaluate performance against SWE-Bench #533

0xdevalias commented Apr 5, 2024

paul-gauthier commented Apr 11, 2024

kithib commented May 29, 2024

paul-gauthier commented May 29, 2024

paul-gauthier commented May 29, 2024

Evaluate performance against SWE-Bench #533

Evaluate performance against SWE-Bench #533

Comments

0xdevalias commented Apr 5, 2024

paul-gauthier commented Apr 11, 2024

kithib commented May 29, 2024

paul-gauthier commented May 29, 2024

paul-gauthier commented May 29, 2024