What are the accuracy limits of codebase retrieval?

What are the accuracy limits of codebase retrieval?

To automatically determine the most important context from your codebase, we've just released v1 of Continue's codebase retrieval feature. You can ask questions like “where do I add a new endpoint to the server?”, “what scripts are used to build this project?”, or “where do I edit the user profile page?” To answer these, the following problem has to be solved: given a user query, like the above, find the N most useful tokens from throughout the codebase.

This is an incredibly open-ended search problem that invites a wide range of potential solutions. In the beginning, a simple test query will point out the obvious problems: entire files are too long to fit in context, some files are empty, and oftentimes the dozen imports at the top of a file aren’t relevant. But very soon the obvious solutions are saturated and it becomes necessary to have automatic and definitive comparison between strategies. Time for a metric:

\[ F_1 = \frac{P\cdot R}{P + R} \]

The F1 score is a measure of \(P\), precision (what percent of chosen files were relevant) and \(R\), recall (what percent of relevant files were chosen). Overall, it gives us an understanding of the accuracy of our retrieval system. As we optimized the retrieval pipeline, we could separately test the F1 score of each stage to learn which was limiting.

While progress is early (we don't yet know the answer to the title of this blog post!) and we don’t yet have a large enough benchmark to share useful numbers, great qualitative progress has been made and there are still uncountably many improvements to go! Here I'll give the high-level overview of the possibilities we see and what we’ve done so far (look for the bold, indicating where we are now). In the future, I'll follow up on the routes that we experiment with and share their concrete effects on the F1 score.

Evaluation

As mentioned above, the most important part of building a good retrieval pipeline is being able to measure its quality. Here are a few ways we can and will eventually test our system.

Naive: Try a few inputs and check what gets included

Basic: Encode a few example questions and the files you would expect to be included for each. Calculate F1 score based on this.

Better: Do this same thing, but make sure to select examples from diverse codebases and questions

Future: Generate or find a large dataset and run expensive evaluations

Chunking

The first detail one runs across while building a retrieval system is that some files won't fit in the context length of the LLM, so they must be shortened into manageable "chunks". Furthermore, OpenAI's ada-002 embeddings typically perform better with inputs of length 512 or fewer tokens. To coerce potentially large code files into smaller pieces, we need a chunking strategy.

Naive: Truncate the file

Basic: Split the file into segments of equal length

Better: Our chunking strategy uses tree-sitter to parse the AST of any programming language, allowing it to understand the structure of classes and functions. It works as follows:

  1. Check if the file already fits and if so use the entire thing
  2. Otherwise, pull out all top-level functions and classes
  3. For each, check whether it will fit. If so, add it.
  4. If not, truncate the contents of the sub-methods to build a chunk like this:
class CodebaseIndex:
    directory: str

    def __init__(self, directory: str):
      ...

    async def exists(self) -> bool:
      """Returns whether the index exists (has been built)"""
      ...

    async def build(
        self,
        ignore_files: List[str] = [],
    ) -> AsyncGenerator[float, None]:
      """Builds the index, yielding progress as a float, 0-1"""
      ...

    async def update(self) -> AsyncGenerator[float, None]:
      """Updates the index, yielding progress as a float, 0-1"""
      ...

    async def query(self, query: str, n: int = 4) -> List[Chunk]:
      """Queries the index, returning the top n results"""
      ...
  1. Recurse into any sub-methods that were hidden to give them their own chunk(s)

Future: It might be helpful to include certain metadata at the top of each chunk, for example ctags, either for retrieval, re-ranking, or both. We could also include summaries of the code snippet generated by a language model, or examples of questions that are likely to be asked about the file.

Retrieval

To select the most important chunks, we follow a two-stage process. First is retrieval, in which we collect a larger number of results. Fundamentally, this process strives for high recall.

Embeddings models

While sentence transformers have been around for a while and still achieve top performance, there are a number of ways to improve them for a specific use case.

Naive: Use OpenAI’s ada-002 embeddings or an off-the-shelf sentence transformers model (we use all-MiniLM-L6-v2 for entirely local embeddings)

Basic: Learn a matrix by which to multiply the vector embeddings, which will emphasize dimensions that are particularly helpful for your use case

Better: Train an entire custom embeddings model

Future: Learn multiple matrices for different situations, potentially one per codebase

Hypothetical Document Embeddings (HyDE)

When comparing a user’s question with candidate code snippets, similarity search is applied to find chunks that are the most relevant. But there is an important asymmetry: the user’s question is a question, whereas the necessary snippets will be code—two different categories of text, which might be separated in the latent space of the embeddings model. To solve this, the original HyDE paper realized that you could ask a language model to generate an imagined response, and compare this to the embedded documents.

Naive: Perform similarity search with the raw query

Basic: Ask for a fake code snippet and embed this for comparison

Better: Ask for a language-specific code snippet

Future: Use information like the structure of the codebase to give the model extra knowledge when writing the hypothetical document. If it sees main.tsx and a components folder, for example, the model can estimate that it should write a React snippet

Other retrieval sources

The use of vector embeddings is called dense retrieval. While it’s a great way to capture related concepts that aren’t word-for-word matches, we can do better by combining with other search methods. These are some that we are currently experimenting with:

Use ripgrep to search for exact matches of words or phrases. Optionally can ask an LLM to generate keywords to search for.

Meilisearch

Meilisearch is a slightly fuzzier search engine which will account for typos and other small differences, but still provides lightning fast search to augment dense retrieval.

Tool usage

Let an LLM write queries on its own. For example, it could be prompted to output keywords to search, AST patterns to match, or folders to investigate given ctags of the codebase.

Tagging

Invest in pre-processing in order to describe characteristics of each snippet of code. For example, is it a function, a class, or a variable? What programming language is it written in? Are there comments? Is the function async? It’s even possible to let the authors of the codebase themselves play a role in this tagging process. Tags can then be used to filter or rank results.

Ensemble retrieval

How can we combine all of these sources?

Naive: Retrieve the same number of chunks from each index, so they add up to the desired total

Basic: Select slightly more results from each, and weight results that were retrieved from multiple more heavily

Better: Ensure that each source returns quantitative, normalized results so the sum for each chunk can be used as an overall score

Future: Use regression to learn a model to combine the scores

Re-Ranking

After retrieving an initial group of chunks, we want to perform a second step to whittle it down to the top few results, striving for high precision. In our initial testing, we've found good results by going from ~50 initial to ~10 final results. This will probably change as we find more reliable retrieval and re-ranking methods, and as the context length of models increases.

Naive: Show gpt-3.5-turbo all of the results and ask for the best 10 (probably have to group and run in parallel to fit all of these into the context window)

Basic: Request in parallel for each snippet, asking for a single token answer about whether it is relevant: Yes or No

Better: Use the returned logprobs for the token "Yes" in order to get a continuous ranking

Future: Train a custom re-ranking model. Most likely this would require fewer parameters than a full-featured LLM. As an example, Cohere serves such a model.

Re-ranking with local models

The re-ranking process we currently use sends many requests to the model in parallel. With local models, we don’t have the luxury of any large batch size and so need to resort to other means. One possibility is to use sentence transformers as cross encoders.

Prompting

The final step of the pipeline is actually answering the question! You’ll probably get a fine response if you just pass all of the context into the prompt along with the original input. But since using many documents to answer a high-level question is a non-trivial subset of the prompts you could give an LLM, some specific instructions might help. For example, “Please respond to the query by using the information given to you, citing specific files and code snippets, and not discussing anything not shown to you.”

Naive: Prompt with only the snippets and question

Basic: Include basic instructions like above

Better: Add instructions on what to do if there isn't enough information to answer the question. For example, "If there isn't enough information to answer the question, suggest where the user might look to learn more."

Future: Include some additional information, for example a file tree that only shows most recent common ancestor directories of the included files.

Future work

These are a few of the more speculative ideas we are excited to try:

Processing

When the response is received, we currently process any file references so that they become clickable links to the file itself. Our next order of business here is to link to objects in code as well, like functions and classes. To go even further, there is work we can do to encourage the model to output linkable text.

Graph relations

Code has an inherent graph structure that can be understood using the Language Server Protocol’s go to definition functionality or analyzing the relationships of files in commits. If 5 separate files are selected by the retrieval process, all of which import some common function, shouldn’t that function be included as well? If a section of a README is selected that links to some file should we include the file? If two files are commonly edited together in commits should they commonly be returned together in retrieval?

Using user feedback

If an irrelevant file is chosen and the user clicks a “thumbs down” button to indicate this, how can we use that information to improve? So far, three methods come to mind:

  1. Fine-tuning the re-ranking model. This is a simple matter of building a dataset with the prompt being the same one it said “Yes” to, and the correct response being “No”.
  2. Updating the calculated embeddings. The idea here is to “penalize” the vector embedding in the direction of the user’s request so that it is considered less similar in the future. This could be done by subtracting a multiple of the user input vector from the document’s vector. Embeddings would improve over time within a particular codebase for a particular user or team, but they wouldn’t improve overall.
  3. Update the embedding function. A known trick to improve embeddings is to use one additional matrix multiplication to emphasize the aspects of the latent space that are more important for your use case. Instead of updating the vector embeddings themselves, we could update the process of their calculation (and recalculate them) by updating this matrix.

Up-front customization

Another way for developers to improve their retrieval experience is by investing time in writing “documentation for the LLM.” This could take many forms, and in most cases isn’t too different from pre-existing documentation or other common practices. Some ideas:

  • Select certain files or folders to ignore, in addition to the .gitignore. We currently support using a file named .continueignore.
  • Write an "FAQ" file that contains a list of questions and the associated files that should be used to answer them, along with an answer to the question. Each question can then be embedded, and if it is chosen by similarity search then we can replace it with the answer and linked files.
  • Write summaries of each important directory so that the LLM can either read these when deciding which folders to explore via tool usage, or so that the summary can be retrieved and used as a part of the initial retrieval process.
  • Pre-tag files and folders with certain keywords. For example, I might label a directory as related to the 'server' whereas another is labeled as 'webapp'.

Retrieving from other sources of context

There’s no reason to stop with code—Continue comes pre-built with a handful of context providers that make it easy to reference GitHub issues, the diff of your current commit, documentation sites, or any other source configured through our SDK. Any of these could be embedded and included in the retrieval process.

Conclusion

We’re very early in our work on retrieval, but it will be a large area of work moving forward, and we’re excited to continue sharing updates on our progress. Check back soon if you’re interested in hearing which of the above ideas make the greatest impact, and please don’t hesitate to reach out if you have any ideas of your own!