A framework for evaluating AI code assistants

A framework for evaluating AI code assistants

Throughout hundreds of conversations with users, we’ve heard lots about what they need in a great AI code assistant. In this post we’ll share a set of criteria that are representative of those we find companies caring about the most, especially once they have the hindsight of running a tool in production.

Accurate Codebase Retrieval

One of the first features that users will typically put to the test is “codebase retrieval,” in which the assistant automatically determines which code snippets to reference before giving its answer. While new, long-context models like Gemini (2M token context window) slightly ease the burden of retrieval, accuracy will continue to be necessary when considering latency, cost, and potential degraded performance of LLMs when given too much input.

Especially in large, enterprise codebases where the majority of code is not deeply familiar to any given engineer, codebase retrieval unlocks one of the major use cases of AI code assistants: navigating and getting up to speed on code that you did not write.

In the future we plan to release public benchmarks for the sake of enabling quantitative evaluation of different codebase retrieval solutions.

Granular Context

While automated retrieval makes for a magical first experience, we find that long-term users are more likely to lean on manual specification of context, allowing them to get precise answers. When evaluating a code assistant, you should at minimum expect an ergonomic keyboard shortcut to select highlighted code as well as the ability to reference specific files, code objects, and web pages by typing “@”.

To truly root LLMs in the context that a developer has while working, an advanced assistant will offer references to documentation, git diff, information from throughout the software development lifecycle like GitHub Issues or Jira tickets, and other custom sources.

Model Flexibility

Choosing your own LLM is about both experimentation and control. In evaluating this aspect, you should be sure to ask whether an AI code assistant makes it possible to a) use either local, self-hosted, or closed-source models, b) configure more than one model, and c) easily toggle between these.

These will enable you to utilize existing compute budget to power your AI code assistant, preserve privacy in the case of self-hosted or local LLMs, and encourage experimentation so that you don’t have to lag behind when improved models are released.

Autocomplete Quality and Latency

This criterion is straightforward: when you expect an inline suggestion, does it show up quickly, and is it accurate? As we’ll expand on below, the ultimate test of this is to view real-world completion acceptance rates, but earlier on you can get a preliminary sense of quality by doing a test drive for a few hours.

Fast Access to Critical Use Cases

Convenience plays a major role in adoption. If developers are required to rewrite a lengthy prompt each time they want to perform common actions, it is no surprise that they will be less likely to use the product. Some baseline use cases that we see across the board are writing unit tests, writing doc strings, and reviewing code. For a code assistant to score well on this criterion, it should allow users to accomplish such a task with as little as a button, a keyboard shortcut, or a slash command.

But even if you use a “canonical” tech-stack, each of these use cases means something very different for any given team. A good assistant will allow you to easily define custom prompts. A great assistant will help build an architecture of participation by making it easy to share and reuse prompts within a team.

Access to Usage Data

Fair access to observability data is a must-have for any organization that wishes to analyze their success beyond basic first impressions. Rather than being stuck guessing at how to increase adoption rates, useful data can tell you which user groups, features, or areas of the codebase need attention, so you can take confident action to improve the developer experience. A few high-value data streams include acceptance of autocomplete suggestions broken down by language, commonly asked questions, and usage of @-mentions.

“Delightful” UI

At the end of the day, code assistants will be adopted if they delight developers. Any evaluation that forgets the honest, intuitive feedback of users might be shortsighted. When you consider that they take up a large portion of the editor window, display suggestions on nearly every keystroke, and generate large amounts of code, it’s no surprise that AI code assistants are polarizing. In fact, poor first experiences are often what prevent developers from overcoming the initial investment to succeed with AI dev tools. Whether you run a large-scale survey, ask co-workers for their honest opinions, or personally get a feel for a product, you’ll want to make sure that the subjective experience is one that developers will love.

What about Continue?

Readers of a paragraph this far down the page may also wish to benchmark Continue. To make this easier, we’ll take a moment to point out the relevant features:

  • Accurate Codebase Retrieval: Use cmd+enter or “@codebase” to try Continue’s codebase retrieval
  • Granular Context: Continue has an extensive ecosystem of open-source context providers, and makes it easy to write your own to connect to custom data sources.
  • Model Flexibility: See here for full documentation on setting up Continue with any language model
  • Autocomplete Quality and Latency: Continue supports state-of-the-art autocomplete models, like Codestral, and backs them with an advanced context engine.
  • Fast Access to Critical Use Cases: Continue’s .prompt file format is designed to make it as easy as possible to write and share custom prompts that can be invoked with a slash command like “/test”, “/review”, or “/docstring”. Quick actions provide another way of invoking common actions with inline buttons.
  • Access to Usage Data: Continue gives you full access to your analytics and development data, meaning you can ask richer questions about the usage of your assistant, and get ahead on collecting a core asset that will enable you to train custom models.
  • “Delightful” UI: One feature in particular we recommend giving a try is cmd+I: highlight code, use cmd+I, and type natural language instructions to have a diff streamed into the editor.

How to use this framework

Every organization is different, so we’ll always encourage taking time to consider which criteria are most fitting. But if you’re looking to quickly operationalize this framework, a reliable way forward is to create a matrix: on one axis is each of the criteria, and on the other axis is each of the tools you are looking to evaluate. Label each box with either a 1 ❌ (does not fulfill this criterion), 2 🟡 (somewhat fulfills this criterion, but is lacking in some respect), or 3 ✅ (perfectly fulfills this criterion or goes above and beyond).