LLMs are helpful with Python, but what about all of the other programming languages?

Ty Dunn

Nov 17, 2023 • 9 min read

Recently, many folks have been claiming that their LLM is the best at coding. Their claims are typically based off of self-reported evaluations on the HumanEval benchmark. But when you look into that benchmark, you realize that it only consists of 164 Python programming problems. This led me down a rabbit hole of trying to figure out how helpful LLMs are with different languages. In this post, I try to estimate this for the 38 most used programming, scripting, and markup languages according to the 2023 Stack Overflow Developer Survey.

Approach

This was not an easy question to attempt to answer. I tried to get a rough picture of the world by collecting and reviewing multilingual benchmark results, public dataset compositions, available GitHub and Stack Overflow data, and developer anecdotes. You can find all of that data here.

Multilingual benchmark results

I looked at the MultiPL-E, BabelCode / TP3, MBXP / Multilingual HumanEval, and HumanEval-X multilingual benchmarks. These benchmarks are translations of the Python problems in the HumanEval and MBPP benchmarks into different languages, so they also might not be particularly representative of the real world situations developers try to use LLM. Plus, these benchmarks each compare their own subset of programming languages on a different set of LLMs. Together, this made it quite difficult to use just benchmark evaluations to answer this question.

Public dataset compositions

I reviewed the compositions of The Stack, CodeParrot, AlphaCode, CodeGen, and PolyCoder datasets. I was trying to see how much each language is represented in the datasets used to train models. However, many LLM creators do not share what data they use. Thus, it's unknown how representative these public datasets are of the more common private datasets, but I expect there to be some similarities. The quality of the data also might vary by language, even when there are similar amounts of data (e.g. some languages are more verbose than others).

Available GitHub and Stack Overflow data

I downloaded data from GitHut 2.0, which attempts to analyze how languages are used on GitHub by aggregating events and interactions with hosted repositories. I also grabbed the number of questions that were tagged with each language on Stack Overflow. Since so many training datasets are not public, I use this to try to better estimate how much data is available for each language relative to others. But this data is only a rough proxy, since GitHub only shows the number of pull requests, issues, stars, and pushes, while many Stack Overflow questions might not be tagged with their programming language.

Developer anecdotes

I read many perspectives on using LLMs while coding in the subreddits dedicated to each language on Reddit. Although these comments were difficult to use when comparing languages, I grabbed a few from each subreddit that stuck out to me and included them in the data as well.

How to make LLMs more helpful for more languages

Before we take a look at each of the languages, it’s important to note that we are not stuck. It’s possible to make LLMs more helpful for more languages. In fact, I see three primary opportunities:

Better benchmarks

The likely most significant action we can take to make LLMs more helpful for more languages would be to have benchmarks that include more languages and evaluates them on a wider range of tasks. Once we do that, then we will have a much better sense of how helpful LLMs are with different languages. This will make it clear where LLM performance is lacking not only for particular languages but also for particular situations where developers want to use LLMs.

Better datasets

The other significant action we could take to make LLMs more helpful for more languages would be to collect datasets that include more languages used in many different representative situations. These models learn based on their datasets. That is, model behavior is heavily determined by the dataset that was used to train it. Therefore, it is critical that we create better datasets for more languages, if we want LLMs to be more helpful with them.

Better models

Datasets and benchmarks are some of the biggest drivers of model performance. If we can push for better benchmarks and datasets for more languages, then that will set us up to push for models that are more helpful for more languages. It also might be possible to train or fine-tune LLMs to be specialized for a particular language. Better benchmarks and datasets will be critical for this too.

Tiers

After looking at the data, I grouped all of the languages into four tiers, where the first tier includes languages that LLMs are likely the most helpful with and the fourth tier includes languages that LLM are likely the least helpful with. That said, how helpful LLMs are for a particular language highly depends on the situation, the developer, the LLM, the context, and many other factors.

First tier

C++

C++ has one of the largest presences on GitHub and Stack Overflow. This shows up in its representation in public LLM datasets, where it is one of the languages with the most data. Its performance is near the top of the MultiPL-E, BabelCode / TP3, MBXP / Multilingual HumanEval, and HumanEval-X benchmarks. However, given that C++ is often used when code performance and exact algorithm implementation is very important, many developers don’t believe that LLMs are as helpful for C++ as some of the other languages in this tier.

Java

The performance of LLMs with Java is near the top of the BabelCode / TP3, MBXP / Multilingual HumanEval, MultiPL-E, and HumanEval-X benchmarks. There is more data about it than any other language in the public LLM datasets, and it has one of the largest presences on GitHub and Stack Overflow. That said, LLMs are frequently not considered to be as helpful with Java as JavaScript, Python, and TypeScript by developers.

JavaScript

JavaScript has one of the largest presences on GitHub and Stack Overflow. This shows up in its representation in public LLM datasets too, where it is one of the languages with the most data. As expected, its performance on the MBXP / Multilingual HumanEval, BabelCode / TP3, MultiPL-E, and HumanEval-X benchmarks are at the top as well. Many developers point to it as a language where LLMs are most helpful.

PHP

The performance of LLMs with PHP is near the top of the BabelCode / TP3, MultiPL-E, and MBXP / Multilingual HumanEval benchmarks. It’s not included in the HumanEval-X benchmark. It also has one of the largest presences on GitHub and Stack Overflow, and it is one of the languages with the most data represented in public LLM datasets. Nevertheless, LLMs are typically not considered by developers to be as helpful with PHP as JavaScript, Python, and TypeScript.

Python

Python has one of the largest presences on GitHub and Stack Overflow. But surprisingly, it's not as well represented in public LLM datasets as some of the other languages in this tier. That said, it comes out at the top of all of the multilingual benchmarks. This is not surprising because all of these multilingual benchmarks include the hand-written Python problems from the HumanEval dataset, which nearly all LLM creators use to benchmark the code capabilities of their LLMs.

TypeScript

Compared to the other languages in the first tier, its GitHub and Stack Overflow presence and representation in public LLM datasets are not nearly as large. However, it is one of the top languages on the MBXP / Multilingual HumanEval, BabelCode / TP3, and MultiPL-E benchmarks. It’s not included in the HumanEval-X benchmark. Some hypothesize that its similarity to JavaScript enables LLMs to perform much better on it than you might expect.

Second tier

C

You could make the case that C should be included in the first tier. It is one of the most represented languages in public LLM datasets, has a very large presence on GitHub and Stack Overflow, and it’s similar to C++. However, it’s not included in any of the multilingual benchmarks. It also is often used when code performance and exact algorithm implementation is very important, so developers generally don’t see LLMs as helpful with C.

CSS

CSS is a style sheet language that is used along with the markup language HTML. It is not included in the multilingual benchmarks. The current approach used to evaluate LLMs on other languages would likely not work for CSS. That said, it has a large presence on GitHub and Stack Overflow, and its representation in public LLM datasets is quite large too.

C#

LLMs do not perform as well on the MultiPL-E, MBXP / Multilingual HumanEval, and BabelCode / TP3 benchmarks with C# as the languages in the first tier. It’s also not included in the HumanEval-X benchmark. However, it has a large representation in public LLM datasets on top of a large presence on GitHub and Stack Overflow.

Go

The performance of LLMs with Go on the MBXP / Multilingual HumanEval, BabelCode / TP3, MultiPL-E, and HumanEval-X benchmarks is lower than first tier languages. It has a pretty large presence on GitHub and Stack Overflow to go along with a rather sizable representation in the public LLM datasets.

HTML

Similar to CSS, HTML is not included in any of the multilingual benchmarks, and the current approach used to evaluate LLMs on other languages would likely not work for HTML. This is because it is a markup language. But also like CSS, it has a large presence on GitHub and Stack Overflow, and its representation in public LLM datasets is quite large too.

Ruby

The performance of LLMs with Ruby on the MBXP / Multilingual HumanEval and MultiPL-E benchmarks is lower than the first tier languages. It is not included in the BabelCode / TP3 and HumanEval-X benchmarks. Its representation in public LLM datasets is quite large, and it has a large presence on GitHub and Stack Overflow.

Third tier

Languages in the third tier generally have a smaller presence on GitHub and Stack Overflow languages than those in the first two tiers. They also tend to have a smaller representation in public LLM datasets as well.

Bash

LLMs perform near the bottom on Bash in the MultiPL-E benchmark. It is not included in the other three multilingual benchmarks.

Haskell

LLMs perform near the bottom on Haskell in the BabelCode / TP3 benchmark benchmark. It is not included in the other three multilingual benchmarks.

Julia

LLMs perform near the bottom on Julia in the BabelCode / TP3 and MultiPL-E benchmarks. It is not included in the other two multilingual benchmarks.

Lua

LLMs perform near the bottom on Lua in the BabelCode / TP3 and MultiPL-E benchmarks. It is not included in the other two multilingual benchmarks.

Perl

LLMs perform near the bottom on Perl in the MultiPL-E and MBXP / Multilingual HumanEval benchmarks. It is not included in the other two multilingual benchmarks.

PowerShell

PowerShell is not included in any of the multilingual benchmarks.

Rust

LLMs perform near the bottom on Rust in the BabelCode / TP3 and MultiPL-E benchmarks. It is not included in the other two multilingual benchmarks.

Scala

LLMs perform better on Scala in the MultiPL-E and MBXP / Multilingual HumanEval benchmarks than most of the other languages in the . It is not included in the other two multilingual benchmarks.

SQL

SQL is not included in any of the multilingual benchmarks.

Swift

LLMs perform near the bottom on Swift in the MultiPL-E and MBXP / Multilingual HumanEval benchmarks. It is not included in the other two multilingual benchmarks.

VBA

VBA is not included in any of the multilingual benchmarks.

Fourth tier

The story for many of the languages in the fourth tier is quite similar. Except for a few exceptions noted below, these languages are not included in any of the four multilingual benchmarks. Delphi and Solidity don’t appear to be included in any of the public LLM datasets, while the rest of the languages (other than Assembly) seem to have a very small but unspecified representation in The Stack and no representation in any of the other public LLM datasets. All of these languages have some of the smallest presences on GitHub and Stack Overflow out of all of the languages investigated.

Assembly

Assembly has a small representation in The Stack and CodeParrot datasets. However, Assembly is a general term for the human-readable representation of a processor's ISA, not a single language. There are many assembly languages and even different representations of the same ISA. This is a big contributing factor for why developers have said that LLMs are not particularly helpful when writing Assembly.

Clojure

Dart

LLMs are evaluated on Dart in the BabelCode / TP3 benchmark but performance is near the bottom.

Delphi

Elixir

Erlang

GDScrip

Groovy

Kotlin

LLMs are evaluated on Kotlin in the MBXP / Multilingual HumanEval benchmark but performance is near the bottom.

Lisp

MATLAB

Objective-C

R

LLMs are evaluated on R in the MultiPL-E and BabelCode / TP3 benchmarks but performance is near the bottom.

Solidity

VB.NET

Conclusion

Many developers are still quite skeptical of using LLMs while coding, no matter what programming language they are using. Some even say they struggle to see how LLMs are useful at all. Nevertheless, there is a growing number of developers who have figured out how to benefit from LLMs and report using them often.

But even among these folks, there are many differing opinions about when, where, and how to use LLMs. For example, some say LLMs are great to use when learning a language, while others say to not touch them until you are an expert at a language. The truth is likely somewhere in between, since it really depends on how you best learn and how you use LLMs when learning.

We have a long way to go to make it easy for developers to leverage LLMs while coding, especially because LLMs confidently make things up in certain situations, and it's up to each developer to not follow the suggestion or to avoid using LLMs at all in these instances. For those who have managed to reliably get past this challenge, they are quite excited by the promise of LLMs and expect them to get much better over time.