How can you improve the code suggestions you get from LLMs?

How can you improve the code suggestions you get from LLMs?

Josh Collinsworth wrote a post titled “I worry our Copilot is leaving some passengers behind” a couple weeks ago, and I can’t stop thinking about it, especially this section in particular:

"Why should we just accept that LLM tools not only fail to at least give us the same warnings, but actively push us the wrong way?

That constant pressure is my real concern.

Sure, you should know bad code when you see it, and you should not let it past you when you do. But what happens when you’re seeing bad code all day every day?

What happens when you aren’t sure whether it’s good or not?

One benefit of Copilot people commonly tout is how helpful it is when working in a new or unfamiliar language. But if you’re in that situation, how will you know a bad idea when you see it?

Again: I’m not concerned with some platonic ideal of code quality here; I’m concerned with very real impact on user experience and accessibility."

I can’t think stop thinking about it because I believe something can be done. If we are getting suggestions that are flawed or wrong, then we need to do something about it.

For suggestions that are wrong on every level like Josh’s example that blocked users unnecessarily, we need to make sure we never get that suggestion again.

When our actions show the “right way to build software”, they should shape future suggestions. And as we evolve our definition of the “right way”, the suggestions we get from LLM tools should evolve too.

So what might you do to improve the code suggestions you get from LLMs?

1. Provide clear and comprehensive instructions

Pros

  • This is easy if you know what you are doing
  • You can customize it for each situation as needed

Cons

  • Your instructions might have to be so precise that just writing the code is easier
  • It’s a slow, tedious process that every person has to repeat every single time

2. Add a system message with instructions that should always be followed

Pros

  • You can set it once and forget it (i.e. like environment variables)
  • It works well for many things (e.g. your operating system version)

Cons

  • It’s hard to predict every possible instruction beforehand
  • You can only fit so much info in the system message due to context length

3. Automatically filter for obviously bad suggestions and ask for a new suggestion

Pros

  • You can ensure that code does not violate licenses, uses certain libraries, etc.
  • You could even automatically re-prompt when a filter catches a suggestion

Cons

  • It’s hard to determine what filters are both necessary and sufficient beforehand
  • This will result in a slow and costly filtering system that will grow massive

4. Improve how context from codebase + software development lifecycle is retrieved and used

Pros

  • There are lots of guides about how to get a basic RAG system working
  • Using docs and code snippets as context can help mitigate knowledge cutoff issues

Cons

  • It’s difficult to build a system that automatically determines what context is relevant instantly
  • It likely requires a lot of integrations that you must maintain forever

5. Use different LLMs and more than one

Pros

  • Most LLM tools use 1-15B parameter models for tab-autocomplete + GPT-4 for questions
  • You could have models for specific situations (e.g. a proprietary programming language)

Cons

  • It might not be possible for you to use the models you want and need
  • Many of the models you want or need might not even exist

6. Use fine-tuning to improve existing LLMs

Pros

  • It can cause the model to learn your preferred styles
  • It can be highly customized for each of your use cases

Cons

  • It likely requires people to generate a lot of domain-specific instructions and 100+ GPU hours
  • It is not nearly as effective at learning new knowledge / capabilities

7. Use domain-adaptive continued pre-training to improve open-source LLMs

Pros

Cons

  • It likely requires billions of tokens of relevant company data + thousands of GPU hours
  • This is a challenging, expensive, and time-consuming approach

8. Pre-train your own LLM from scratch

Pros

  • You can determine what knowledge / capabilities are learned by pre-processing training data
  • This is how the best models like GPT-4 and DeepSeek Coder were created

Cons

  • It likely requires trillions of tokens of Internet data + relevant company data + millions of GPU hours
  • It’s the most challenging, expensive, and time-consuming approach

Conclusion

To do many of these things, you are going to need far more configurability than what is offered by most AI code assistance tools today. If there is any part of your system that you don’t control, you will find that the suggestions can and will change underneath you.

I believe we are going to need to do all of the things listed above to ensure our copilots leave no passengers behind. I wrote a sketch of what I think this will ultimately require last summer, which is tad outdated but still worth reading: “It’s time to collect data on how you build software”.

If you want to read more from the community of folks building software with the help of LLMs in the future, join our monthly newsletter here.