How can you improve the code suggestions you get from LLMs?

Ty Dunn

02 Mar 2024 • 4 min read

Josh Collinsworth wrote a post titled “I worry our Copilot is leaving some passengers behind” a couple weeks ago, and I can’t stop thinking about it, especially this section in particular:

"Why should we just accept that LLM tools not only fail to at least give us the same warnings, but actively push us the wrong way?

That constant pressure is my real concern.

Sure, you should know bad code when you see it, and you should not let it past you when you do. But what happens when you’re seeing bad code all day every day?

What happens when you aren’t sure whether it’s good or not?

One benefit of Copilot people commonly tout is how helpful it is when working in a new or unfamiliar language. But if you’re in that situation, how will you know a bad idea when you see it?

Again: I’m not concerned with some platonic ideal of code quality here; I’m concerned with very real impact on user experience and accessibility."

I can’t think stop thinking about it because I believe something can be done. If we are getting suggestions that are flawed or wrong, then we need to do something about it.

For suggestions that are wrong on every level like Josh’s example that blocked users unnecessarily, we need to make sure we never get that suggestion again.

When our actions show the “right way to build software”, they should shape future suggestions. And as we evolve our definition of the “right way”, the suggestions we get from LLM tools should evolve too.

So what might you do to improve the code suggestions you get from LLMs?

1. Provide clear and comprehensive instructions

Pros

This is easy if you know what you are doing
You can customize it for each situation as needed

Cons

Your instructions might have to be so precise that just writing the code is easier
It’s a slow, tedious process that every person has to repeat every single time

2. Add a system message with instructions that should always be followed

Pros

You can set it once and forget it (i.e. like environment variables)
It works well for many things (e.g. your operating system version)

Cons

It’s hard to predict every possible instruction beforehand
You can only fit so much info in the system message due to context length

3. Automatically filter for obviously bad suggestions and ask for a new suggestion

Pros

You can ensure that code does not violate licenses, uses certain libraries, etc.
You could even automatically re-prompt when a filter catches a suggestion

Cons

It’s hard to determine what filters are both necessary and sufficient beforehand
This will result in a slow and costly filtering system that will grow massive

4. Improve how context from codebase + software development lifecycle is retrieved and used

Pros

There are lots of guides about how to get a basic RAG system working
Using docs and code snippets as context can help mitigate knowledge cutoff issues

Cons

It’s difficult to build a system that automatically determines what context is relevant instantly
It likely requires a lot of integrations that you must maintain forever

5. Use different LLMs and more than one

Pros

Most LLM tools use 1-15B parameter models for tab-autocomplete + GPT-4 for questions
You could have models for specific situations (e.g. a proprietary programming language)

Cons

It might not be possible for you to use the models you want and need
Many of the models you want or need might not even exist

6. Use fine-tuning to improve existing LLMs

Pros

It can cause the model to learn your preferred styles
It can be highly customized for each of your use cases

Cons

It likely requires people to generate a lot of domain-specific instructions and 100+ GPU hours
It is not nearly as effective at learning new knowledge / capabilities

7. Use domain-adaptive continued pre-training to improve open-source LLMs

Pros

There are a number of decently strong base models like Llama 2 with open weights
This is how Meta created Code Llama and Nvidia created ChipNeMo

Cons

It likely requires billions of tokens of relevant company data + thousands of GPU hours
This is a challenging, expensive, and time-consuming approach

8. Pre-train your own LLM from scratch

Pros

You can determine what knowledge / capabilities are learned by pre-processing training data
This is how the best models like GPT-4 and DeepSeek Coder were created

Cons

It likely requires trillions of tokens of Internet data + relevant company data + millions of GPU hours
It’s the most challenging, expensive, and time-consuming approach

Conclusion

To do many of these things, you are going to need far more configurability than what is offered by most AI code assistance tools today. If there is any part of your system that you don’t control, you will find that the suggestions can and will change underneath you.

I believe we are going to need to do all of the things listed above to ensure our copilots leave no passengers behind. I wrote a sketch of what I think this will ultimately require last summer, which is tad outdated but still worth reading: “It’s time to collect data on how you build software”.

If you want to read more from the community of folks building software with the help of LLMs in the future, join our monthly newsletter here.