An introduction to code LLM benchmarks for software engineers
Benchmarks are used by researchers to evaluate and compare the relative performance of LLMs. Although there is no replacement for trying out LLMs yourself while coding to figure out which one works best for you, benchmarks and popularity rankings can help determine which LLMs are worth trying.
As of October 2023, the most popular commercial LLMs for coding are GPT-4, GPT-3.5, Claude 2, & Palm 2. The most popular open-source LLMs for coding are Code Llama, WizardCoder, Phind-CodeLlama, Mistral, StarCoder, & Llama 2. Below we provide you with an introduction to the benchmarks that the creators of these models used in their papers as well as some other code benchmarks.
The three most common benchmarks
1. HumanEval
Creator: OpenAI
Released: Jul 2021
Evaluated: Nine of the popular LLMs for coding—GPT-3.5 (Mar 2023), GPT-4 (Mar 2023), StarCoder (May 2023), PaLM 2 (May 2023), WizardCoder (Jun 2023), Claude 2 (Jul 2023), Llama 2 (July 2023), Code Llama (Aug 2023), and Mistral (Sep 2023)
Motivation: Code LLMs are often trained on a large fraction of GitHub, which already contains solutions to problems from a variety of sources. For example, there were more than ten public repositories containing solutions to Codeforces problems, which made up part of the APPS dataset that was released shortly before HumanEval. In an attempt to solve this, all problems for HumanEval were hand-written instead of being copied from existing sources.
Description: HumanEval is a benchmark for measuring functional correctness for synthesizing programs from docstrings. It consists of 164 Python programming problems. Each problem includes a function signature, docstring, body, and several unit tests, with an average of 7.7 tests per problem. Programming tasks in the HumanEval dataset assess language comprehension, reasoning, algorithms, and simple mathematics.
Paper URL: https://arxiv.org/abs/2107.03374
Dataset URL: https://github.com/openai/human-eval
2. GSM8K
Creator: OpenAI
Released: September 2021
Used to evaluate: Eight of the popular LLMs for coding—GPT-3.5 (Mar 2023), GPT-4 (Mar 2023), StarCoder (May 2023), PaLM 2 (May 2023), Claude 2 (Jul 2023), Llama 2 (July 2023), Code Llama (Aug 2023), and Mistral (Sep 2023)
Motivation: LLMs can match human performance on many tasks, but they still struggle to robustly perform multi-step mathematical reasoning.
Description: GSM8K is a dataset of 8.5K high-quality grade school math problems created by human writers. These problems take between 2 and 8 steps to solve, and solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations (+ - / *) to reach the final answer. A bright middle school student should be able to solve every problem. It's worth noting that this not a code benchmark but nearly all of the popular LLMs for coding were evaluated by their creators using it.
Paper URL: https://arxiv.org/pdf/2110.14168.pdf
Dataset URL: https://github.com/openai/grade-school-math
3. MBPP
Creator: Google
Released: Aug 2021
Used to evaluate: Six of the popular LLMs for coding—StarCoder (May 2023), PaLM 2 (May 2023), Claude 2 (Jul 2023), Llama 2 (July 2023), Code Llama (Aug 2023), and Mistral (Sep 2023)
Motivation: MBPP is similar to the HumanEval benchmark, but differs in the formatting of prompts. It consistently contains three input/output examples, written as assert statements. In contrast, HumanEval varies in the number and formatting of the input/output examples in a way that mimics real-world software.
Description: The Mostly Basic Programming Problems (MBPP) dataset contains 974 programming tasks, designed to be solvable by entry-level programmers. It was designed to measure the ability of these models to synthesize short Python programs from natural language descriptions. This dataset consists of a large set of crowd-sourced questions and a smaller set of questions edited and hand verified by the authors. Each problem typically comes with three test cases.
Paper URL: https://arxiv.org/abs/2108.07732
Dataset URL: https://github.com/google-research/google-research/tree/master/mbpp
A couple notable mentions
4. MultiPL-E
Creator: Northeastern University, Wellesley College, Oberlin College, Stevens Institute of Technology, Microsoft Research, and Roblox
Released: Aug 2022
Used to evaluate: Two of the popular LLMs for coding—StarCoder (May 2023) and Code Llama (Aug 2023)
Motivation: To extend the HumanEval benchmark and MBPP benchmark to 18 languages that encompass a range of programming paradigms and popularity
Description: MultiPL-E is a system for translating unit test-driven code generation benchmarks to new languages in order to create the first massively multilingual code generation benchmark.
Paper URL: https://arxiv.org/abs/2208.08227
Dataset URL: github.com/nuprl/MultiPL-E
5. DS-1000
Creator: HKU, Peking, Stanford, UC Berkeley, Washington, Meta AI, and CMU
Released: Nov 2022
Used to evaluate: Two of the popular LLMs for coding—StarCoder (May 2023) and WizardCoder (Jun 2023)
Motivation: Compared to prior works, the problems reflect diverse, realistic, and practical use cases since collected from StackOverflow. They are slightly modified to proactively defend against memorization.
Description: DS-1000 is a code generation benchmark with a thousand data science problems spanning seven Python libraries, such as NumPy and Pandas.
Paper URL: https://arxiv.org/abs/2211.11501
Dataset URL: https://github.com/xlang-ai/DS-1000
A few recently released benchmarks
6. AgentBench
Creator: Tsinghua University, Ohio State, and UC Berkeley
Released: August 2023
Used to evaluate: None of the popular LLMs for coding in their papers
Motivation: There has been an urgent need to evaluate LLMs as agents on challenging tasks in interactive environments.
Description: AgentBench is a multi-dimensional, evolving benchmark that currently consists of 8 distinct environments to assess LLM-as-Agents’ reasoning and decision-making abilities in a multi-turn, open-ended generation setting.
Paper URL: https://arxiv.org/pdf/2308.03688.pdf
Dataset URL: https://github.com/THUDM/AgentBench
7. SWE-Bench
Creator: Princeton and UChicago
Released: October 2023
Used to evaluate: None of the popular LLMs for coding in their papers
Motivation: Language models have outpaced our ability to evaluate them effectively, but for their future development it is essential to study the frontier of their capabilities. SWE-Bench considers real-world software engineering to be a rich, sustainable, and challenging testbed for evaluating the next generation of language models.
Description: SWE-Bench is an evaluation framework including 2,294 software engineering problems drawn from real GitHub issues and corresponding pull requests across 12 popular Python repositories. Given a codebase along with a description of an issue to be resolved, a language model is tasked with editing the codebase to address the issue. Resolving issues in SWE-bench frequently requires understanding and coordinating changes across multiple functions, classes, and even files simultaneously, calling for models to interact with execution environments, process extremely long contexts, and perform complex reasoning that goes far beyond traditional code generation.
Paper URL: https://arxiv.org/abs/2310.06770
Dataset URL: https://github.com/princeton-nlp/SWE-bench
8. RepoBench
Creator: UCSD
Released: October 2023
Used to evaluate: None of the popular LLMs for coding in their papers
Motivation: Current benchmarks mainly focus on single-file tasks, leaving an assessment gap for more complex, real-world, multi-file programming scenarios.
Description: RepoBench is a new benchmark specifically designed for evaluating repository-level code auto-completion systems. RepoBench supports both Python and Java and consists of three interconnected evaluation tasks: RepoBench-R (Retrieval), RepoBench-C (Code Completion), and RepoBench-P (Pipeline). Each task respectively measures the system’s ability to retrieve the most relevant code snippets from other files as cross-file context, predict the next line of code with cross-file and in-file context, and handle complex tasks that require a combination of both retrieval and next-line prediction. RepoBench aims to facilitate a more complete comparison of performance and encourage continuous improvement in auto-completion systems.
Paper URL: https://arxiv.org/abs/2306.03091
Dataset URL: https://github.com/Leolty/repobench
Some multilingual benchmarks
10. HumanEval-X
Creator: Tsinghua University, Zhipu.AI, and Huawei
Released: Mar 2023
Used to evaluate: None of the popular LLMs for coding in their papers
Motivation: The original HumanEval dataset is Python only
Description: The HumanEval-X benchmark builds upon HumanEval by rewriting the solutions by hand in C++, Java, JavaScript, and Go
Paper URL: https://arxiv.org/abs/2303.17568
Dataset URL: https://github.com/THUDM/CodeGeeX
11. MBXP / Multilingual HumanEval
Creator: AWS AI Labs
Released: Mar 2023
Used to evaluate: None of the popular LLMs for coding in their papers
Motivation: The original MBPP and HumanEval datasets are Python only
Description: These datasets encompass over 10 programming languages and are generated using a scalable conversion framework that transpiles prompts and test cases from the original Python datasets into the corresponding data in the target language
Paper URL: https://arxiv.org/abs/2210.14868
Dataset URL: https://github.com/amazon-science/mxeval
12. BabelCode / TP3
Creator: Google and NYU
Released: May 2023
Used to evaluate: None of the popular LLMs for coding in their papers
Motivation: Current benchmarks for evaluating neural code models focus on only a small subset of programming languages, excluding many popular languages such as Go or Rust.
Description: The BabelCode framework is for execution-based evaluation of any benchmark in any language, enabling new investigations into the qualitative performance of models’ memory, runtime, and individual test case results. Translating Python Programming Puzzles (TP3) is a code translation dataset based on the Python Programming Puzzles benchmark.
Paper URL: https://arxiv.org/abs/2302.01973
Dataset URL: https://github.com/google-research/babelcode
Three more benchmarks
13. ARCADE
Creator: Google
Released: December 2022
Used to evaluate: One of the popular LLMs for coding—PaLM 2 (May 2023)
Motivation: Computational notebooks, such as Jupyter notebooks, are interactive computing environments that are ubiquitous among data scientists to perform data wrangling and analytic tasks. ARCADE aims to measure the performance of AI pair programmers that automatically synthesize programs for such tasks, given natural language (NL) intents from users.
Description: ARCADE is a benchmark of 1,082 code generation problems using the pandas data analysis framework in data science notebooks, featuring multiple rounds of NL-to code problems from the same notebook, and requiring a model to understand rich multi-modal contexts, such as existing notebook cells and their execution states as well as previous turns of interaction.
Paper URL: https://arxiv.org/pdf/2212.09248.pdf
Dataset URL: https://github.com/google-research/arcade-nl2code
14. APPS
Creator: UC Berkeley, UChicago, UIUC, and Cornell University
Released: May 2021
Used to evaluate: One of the popular LLMs for coding—Code Llama (Aug 2023)
Motivation: Unlike prior work in more restricted settings, APPS measures the ability of models to take an arbitrary natural language specification and generate Python code fulfilling this specification.
Description: APPS includes 10,000 problems, which range from having simple one-line solutions to being substantial algorithmic challenges. Similar to how companies assess candidate software developers, they evaluate models by checking their generated code on test cases.
Paper URL: https://arxiv.org/pdf/2105.09938v1.pdf
Dataset URL: https://github.com/hendrycks/apps
15. HumanEval+
Creator: UIUC and Nanjing University
Released: May 2023
Used to evaluate: One of the popular LLMs for coding—WizardCoder (Jun 2023)
Motivation: Existing programming benchmarks can be limited in both quantity and quality for fully assessing the functional correctness of the generated code.
Description: Using the EvalPlus framework, HumanEval was augmented with large amounts of test-cases newly produced by an automatic test input generator, powered by both LLM-and mutation-based strategies. HumanEval+ extends the tests of HumanEval by 81x.
Paper URL: https://arxiv.org/abs/2305.01210
Dataset URL: https://github.com/evalplus/evalplus
Leaderboard
Hugging Face maintains the Big Code Models Leaderboard, which shows a real-time ranking of open-source code LLMs based on their HumanEval and MultiPL-E evaluations.
Conclusion
As you can see, many of these evaluations are quite simple. Performing well on them does not imply that we've "solved coding". We need to push for more ambitious, better repo-level benchmarks, so that LLMs can become more helpful when we are coding. It's also worth noting that many models don't open-source their training data, so there are concerns about data leakage problems, where the evaluation dataset might have been included in the training data.
In the end, benchmarks won’t tell you everything you need to know about a language model, but they are a great way to keep up with the latest models. If a new LLM drops that blows away the leaderboard, this doesn’t imply a great model, but it is a good indicator that you probably want to try it out!
If you liked this blog post and want to read more about DevAI–the community of folks building software with the help of LLMs–in the future, join our monthly newsletter here.