SWE-bench
A benchmark of real GitHub issues used to score AI coding tools.
SWE-bench is a benchmark built from real GitHub issues in popular Python repositories. Models are scored on whether they can produce a patch that passes the project's existing tests. It's cited because its tasks come from real open-source maintenance, not toy problems. Claude Opus 4.6 scores roughly 80.8% on SWE-bench as of April 2026.