An introduction to code LLM benchmarks for software engineers

An introduction to code LLM benchmarks for software engineers
The most popular LLMs for coding and whether or not they were evaluated on each benchmark in their paper

Benchmarks are used by researchers to evaluate and compare the relative performance of LLMs. Although there is no replacement for trying out LLMs yourself while coding to figure out which one works best for you, benchmarks and popularity rankings can help determine which LLMs are worth trying.

As of October 2023, the most popular commercial LLMs for coding are GPT-4, GPT-3.5, Claude 2, & Palm 2. The most popular open-source LLMs for coding are Code Llama, WizardCoder, Phind-CodeLlama, Mistral, StarCoder, & Llama 2. Below we provide you with an introduction to the benchmarks that the creators of these models used in their papers as well as some other code benchmarks.

The three most common benchmarks

1. HumanEval

Creator: OpenAI

Released: Jul 2021

Evaluated: Nine of the popular LLMs for coding—GPT-3.5 (Mar 2023), GPT-4 (Mar 2023), StarCoder (May 2023), PaLM 2 (May 2023), WizardCoder (Jun 2023), Claude 2 (Jul 2023), Llama 2 (July 2023), Code Llama (Aug 2023), and Mistral (Sep 2023)

Motivation: Code LLMs are often trained on a large fraction of GitHub, which already contains solutions to problems from a variety of sources. For example, there were more than ten public repositories containing solutions to Codeforces problems, which made up part of the APPS dataset that was released shortly before HumanEval. In an attempt to solve this, all problems for HumanEval were hand-written instead of being copied from existing sources.

Description: HumanEval is a benchmark for measuring functional correctness for synthesizing programs from docstrings. It consists of 164 Python programming problems. Each problem includes a function signature, docstring, body, and several unit tests, with an average of 7.7 tests per problem. Programming tasks in the HumanEval dataset assess language comprehension, reasoning, algorithms, and simple mathematics.

Paper URL: https://arxiv.org/abs/2107.03374

Dataset URL: https://github.com/openai/human-eval

2. GSM8K

Creator: OpenAI

Released: September 2021

Used to evaluate: Eight of the popular LLMs for coding—GPT-3.5 (Mar 2023), GPT-4 (Mar 2023), StarCoder (May 2023), PaLM 2 (May 2023), Claude 2 (Jul 2023), Llama 2 (July 2023), Code Llama (Aug 2023), and Mistral (Sep 2023)

Motivation: LLMs can match human performance on many tasks, but they still struggle to robustly perform multi-step mathematical reasoning.

Description: GSM8K is a dataset of 8.5K high-quality grade school math problems created by human writers. These problems take between 2 and 8 steps to solve, and solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations (+ - / *) to reach the final answer. A bright middle school student should be able to solve every problem. It's worth noting that this not a code benchmark but nearly all of the popular LLMs for coding were evaluated by their creators using it.

Paper URL: https://arxiv.org/pdf/2110.14168.pdf

Dataset URL: https://github.com/openai/grade-school-math

3. MBPP

Creator: Google

Released: Aug 2021

Used to evaluate: Six of the popular LLMs for coding—StarCoder (May 2023), PaLM 2 (May 2023), Claude 2 (Jul 2023), Llama 2 (July 2023), Code Llama (Aug 2023), and Mistral (Sep 2023)

Motivation: MBPP is similar to the HumanEval benchmark, but differs in the formatting of prompts. It consistently contains three input/output examples, written as assert statements. In contrast, HumanEval varies in the number and formatting of the input/output examples in a way that mimics real-world software.

Description: The Mostly Basic Programming Problems (MBPP) dataset contains 974 programming tasks, designed to be solvable by entry-level programmers. It was designed to measure the ability of these models to synthesize short Python programs from natural language descriptions. This dataset consists of a large set of crowd-sourced questions and a smaller set of questions edited and hand verified by the authors. Each problem typically comes with three test cases.

Paper URL: https://arxiv.org/abs/2108.07732

Dataset URL: https://github.com/google-research/google-research/tree/master/mbpp

A couple notable mentions

4. MultiPL-E

Creator: Northeastern University, Wellesley College, Oberlin College, Stevens Institute of Technology, Microsoft Research, and Roblox

Released: Aug 2022

Used to evaluate: Two of the popular LLMs for coding—StarCoder (May 2023) and Code Llama (Aug 2023)

Motivation: To extend the HumanEval benchmark and MBPP benchmark to 18 languages that encompass a range of programming paradigms and popularity

Description: MultiPL-E is a system for translating unit test-driven code generation benchmarks to new languages in order to create the first massively multilingual code generation benchmark.

Paper URL: https://arxiv.org/abs/2208.08227

Dataset URL: github.com/nuprl/MultiPL-E

5. DS-1000

Creator: HKU, Peking, Stanford, UC Berkeley, Washington, Meta AI, and CMU

Released: Nov 2022

Used to evaluate: Two of the popular LLMs for coding—StarCoder (May 2023) and WizardCoder (Jun 2023)

Motivation: Compared to prior works, the problems reflect diverse, realistic, and practical use cases since collected from StackOverflow. They are slightly modified to proactively defend against memorization.

Description: DS-1000 is a code generation benchmark with a thousand data science problems spanning seven Python libraries, such as NumPy and Pandas.

Paper URL: https://arxiv.org/abs/2211.11501

Dataset URL: https://github.com/xlang-ai/DS-1000

A few recently released benchmarks

6. AgentBench

Creator: Tsinghua University, Ohio State, and UC Berkeley

Released: August 2023

Used to evaluate: None of the popular LLMs for coding in their papers

Motivation: There has been an urgent need to evaluate LLMs as agents on challenging tasks in interactive environments.

Description: AgentBench is a multi-dimensional, evolving benchmark that currently consists of 8 distinct environments to assess LLM-as-Agents’ reasoning and decision-making abilities in a multi-turn, open-ended generation setting.

Paper URL: https://arxiv.org/pdf/2308.03688.pdf

Dataset URL: https://github.com/THUDM/AgentBench

7. SWE-Bench

Creator: Princeton and UChicago

Released: October 2023

Used to evaluate: None of the popular LLMs for coding in their papers

Motivation: Language models have outpaced our ability to evaluate them effectively, but for their future development it is essential to study the frontier of their capabilities. SWE-Bench considers real-world software engineering to be a rich, sustainable, and challenging testbed for evaluating the next generation of language models.

Description: SWE-Bench is an evaluation framework including 2,294 software engineering problems drawn from real GitHub issues and corresponding pull requests across 12 popular Python repositories. Given a codebase along with a description of an issue to be resolved, a language model is tasked with editing the codebase to address the issue. Resolving issues in SWE-bench frequently requires understanding and coordinating changes across multiple functions, classes, and even files simultaneously, calling for models to interact with execution environments, process extremely long contexts, and perform complex reasoning that goes far beyond traditional code generation.

Paper URL: https://arxiv.org/abs/2310.06770

Dataset URL: https://github.com/princeton-nlp/SWE-bench

8. RepoBench

Creator: UCSD

Released: October 2023

Used to evaluate: None of the popular LLMs for coding in their papers

Motivation: Current benchmarks mainly focus on single-file tasks, leaving an assessment gap for more complex, real-world, multi-file programming scenarios.

Description: RepoBench is a new benchmark specifically designed for evaluating repository-level code auto-completion systems. RepoBench supports both Python and Java and consists of three interconnected evaluation tasks: RepoBench-R (Retrieval), RepoBench-C (Code Completion), and RepoBench-P (Pipeline). Each task respectively measures the system’s ability to retrieve the most relevant code snippets from other files as cross-file context, predict the next line of code with cross-file and in-file context, and handle complex tasks that require a combination of both retrieval and next-line prediction. RepoBench aims to facilitate a more complete comparison of performance and encourage continuous improvement in auto-completion systems.

Paper URL: https://arxiv.org/abs/2306.03091

Dataset URL: https://github.com/Leolty/repobench

Some multilingual benchmarks

10. HumanEval-X

Creator: Tsinghua University, Zhipu.AI, and Huawei

Released: Mar 2023

Used to evaluate: None of the popular LLMs for coding in their papers

Motivation: The original HumanEval dataset is Python only

Description: The HumanEval-X benchmark builds upon HumanEval by rewriting the solutions by hand in C++, Java, JavaScript, and Go

Paper URL: https://arxiv.org/abs/2303.17568

Dataset URL: https://github.com/THUDM/CodeGeeX

11. MBXP / Multilingual HumanEval

Creator: AWS AI Labs

Released: Mar 2023

Used to evaluate: None of the popular LLMs for coding in their papers

Motivation: The original MBPP and HumanEval datasets are Python only

Description: These datasets encompass over 10 programming languages and are generated using a scalable conversion framework that transpiles prompts and test cases from the original Python datasets into the corresponding data in the target language

Paper URL: https://arxiv.org/abs/2210.14868

Dataset URL: https://github.com/amazon-science/mxeval

12. BabelCode / TP3

Creator: Google and NYU

Released: May 2023

Used to evaluate: None of the popular LLMs for coding in their papers

Motivation: Current benchmarks for evaluating neural code models focus on only a small subset of programming languages, excluding many popular languages such as Go or Rust.

Description: The BabelCode framework is for execution-based evaluation of any benchmark in any language, enabling new investigations into the qualitative performance of models’ memory, runtime, and individual test case results. Translating Python Programming Puzzles (TP3) is a code translation dataset based on the Python Programming Puzzles benchmark.

Paper URL: https://arxiv.org/abs/2302.01973

Dataset URL: https://github.com/google-research/babelcode

Three more benchmarks

13. ARCADE

Creator: Google

Released: December 2022

Used to evaluate: One of the popular LLMs for coding—PaLM 2 (May 2023)

Motivation: Computational notebooks, such as Jupyter notebooks, are interactive computing environments that are ubiquitous among data scientists to perform data wrangling and analytic tasks. ARCADE aims to measure the performance of AI pair programmers that automatically synthesize programs for such tasks, given natural language (NL) intents from users.

Description: ARCADE is a benchmark of 1,082 code generation problems using the pandas data analysis framework in data science notebooks, featuring multiple rounds of NL-to code problems from the same notebook, and requiring a model to understand rich multi-modal contexts, such as existing notebook cells and their execution states as well as previous turns of interaction.

Paper URL: https://arxiv.org/pdf/2212.09248.pdf

Dataset URL: https://github.com/google-research/arcade-nl2code

14. APPS

Creator: UC Berkeley, UChicago, UIUC, and Cornell University

Released: May 2021

Used to evaluate: One of the popular LLMs for coding—Code Llama (Aug 2023)

Motivation: Unlike prior work in more restricted settings, APPS measures the ability of models to take an arbitrary natural language specification and generate Python code fulfilling this specification.

Description: APPS includes 10,000 problems, which range from having simple one-line solutions to being substantial algorithmic challenges. Similar to how companies assess candidate software developers, they evaluate models by checking their generated code on test cases.

Paper URL: https://arxiv.org/pdf/2105.09938v1.pdf

Dataset URL: https://github.com/hendrycks/apps

15. HumanEval+

Creator: UIUC and Nanjing University

Released: May 2023

Used to evaluate: One of the popular LLMs for coding—WizardCoder (Jun 2023)

Motivation: Existing programming benchmarks can be limited in both quantity and quality for fully assessing the functional correctness of the generated code.

Description: Using the EvalPlus framework, HumanEval was augmented with large amounts of test-cases newly produced by an automatic test input generator, powered by both LLM-and mutation-based strategies. HumanEval+ extends the tests of HumanEval by 81x.

Paper URL: https://arxiv.org/abs/2305.01210

Dataset URL: https://github.com/evalplus/evalplus

Leaderboard

Hugging Face maintains the Big Code Models Leaderboard, which shows a real-time ranking of open-source code LLMs based on their HumanEval and MultiPL-E evaluations.

Conclusion

As you can see, many of these evaluations are quite simple. Performing well on them does not imply that we've "solved coding". We need to push for more ambitious, better repo-level benchmarks, so that LLMs can become more helpful when we are coding. It's also worth noting that many models don't open-source their training data, so there are concerns about data leakage problems, where the evaluation dataset might have been included in the training data.

In the end, benchmarks won’t tell you everything you need to know about a language model, but they are a great way to keep up with the latest models. If a new LLM drops that blows away the leaderboard, this doesn’t imply a great model, but it is a good indicator that you probably want to try it out!

If you liked this blog post and want to read more about DevAI–the community of folks building software with the help of LLMs–in the future, join our monthly newsletter here.