A custom autocomplete model in 30 minutes using Unsloth (Community post)

A custom autocomplete model in 30 minutes using Unsloth (Community post)

This is a guest post on the Continue Blog by Sophia Parafina, a developer advocate who has previously worked at Pulumi, Anaconda, and Docker.


Continue is an open-source AI code assistant. It's a compound AI system that uses a set of models (chat, autocomplete, embeddings, etc) along with other components. One of its main benefits is that it records development data, which is useful for fine-tuning an LLM to produce better suggestions for developers.

Fine-tuning used to require hours of GPU time and expertise. However, Unsloth provides Jupyter notebooks for fine-tuning on free Google Colab instances. In this post, I show how to fine-tune a model with Unsloth, so that you can replace your generic, open-source autocomplete model with one fine-tuned on your development data.

ℹ️
To follow along, you'll need an account to use Hugging Face Datasets and an account to use Google Colab


Fine-tuning with Unsloth


1. Copy your development data to a working directory

Linux/macOS

    mkdir training_data && cp training_data
    cp ~/.continue/dev_data/autocomplete.jsonl .

Windows

    md training_data && cp training_data
    copy C:\Users\username\.continue\dev_data\autocomplete.jsonl .

2. Format the development data

The Unsloth notebook uses the alpaca dataset format:

  • instruction: describes the task the model should perform. Each of the 52K instructions is unique.
  • input: optional context or input for the task. For example, when the instruction is "Summarize the following article", the input is the article. Around 40% of the examples have an input
  • output: the answer to the instruction as generated by text-davinci-003.

Unsloth uses only the prompt and completion fields in the development data, for example:

    {
        "prompt": "[SUFFIX]\n    </div>\n  );\n}\n[PREFIX]import { SharedHeader } from \"@/components/shared-header\";\n\nexport function UserSettingsPage() {\n  return (\n    <div>\n      <SharedHeader />\n      <h1>User Settings</h1>\n      ",
        "completion": "<p>This is the user settings page.</p>",
    }

We'll use dlt to format the training data and save it to Hugging Face datasets:

    import dlt
    import json
    
    from huggingface_hub import HfApi
    
    
    @dlt.resource(table_name="autocompletions")
    def resource():
        with open("autocomplete.jsonl", "r") as file:
            for line in file:
                full_data = json.loads(line)
                if full_data["accepted"]:
                    yield {
                        "instruction": full_data["prompt"],
                        "input": "",
                        "output": full_data["completion"],
                    }
    
    
    @dlt.destination(batch_size=0, loader_file_format="parquet")
    def hf_destination(items, table):
        api = HfApi()
        api.upload_file(
            path_or_fileobj=items,
            path_in_repo=items.split("/")[-1],
            repo_id="<username>/continue-training-data",
            repo_type="dataset",
        )
    
    
    pipeline = dlt.pipeline(pipeline_name="continue_data", destination=hf_destination)
    
    load_info = pipeline.run(resource())
    
    print(load_info)


3. Fine-tuning the model

We'll use an Unsloth notebook to fine-tune Qwen2.5-Coder-7B in Google Colab. The notebook doesn't have the model listed, but Unsloth has over 200 models available on Hugging Face. Set the model to use in the second cell of the notebook:

model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "unsloth/Qwen2.5-Coder-7B-bnb-4bit",
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
        # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
    )

In the Data Prep cell, specify the training data uploaded to Hugging Face datasets:

    from datasets import load_dataset
    dataset = load_dataset("<username>/continue-training-data", split = "train")
    dataset = dataset.map(formatting_prompts_func, batched = True,)

Now that we've selected a model and loaded the development data, we can use Hugging Face's Supervised Fine-tuning Trainer (SFT) to fine-tune it.

4. Saving the model

The notebook provides several options for saving the model, ranging from saving only the LORA model, a float16 format, and GGUF format supported by inference engines such as Ollama, LM Studio, and Llamafile.

In the GGUF / llama.cpp Conversion cell, save the model to Hugging Face as q4_k_m quantized model, a model balanced for size and accuracy. Provide your repository, model name, and a Hugging Face token to save the model:

    # Save to q4_k_m GGUF
    if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
    if False: model.push_to_hub_gguf("<username>/qwen2.5-coder-continue", tokenizer, quantization_method = "q4_k_m", token = "hf-xxxxxx")

5. To run the notebook, select Runtime > Run All

6. Inference with Ollama

Download the model from Hugging Face. We'll use Ollama to run the trained model, but first we have to import the model. Create an Ollama Modelfile called qwen2.5-coder-continue and add the following:

    FROM /path/to/qwen2.5-coder-continue

Use the ollama create command to build the model. Use ollama list to view the model and start Ollama with the model:

    ollama create qwen2.5-coder-continue
    ollama list
    ollama start qwen2.5-coder-continue

7. Configure Continue

The last step is to configure Continue with the model. Edit the config.json file and update tabAutocompleteModel with the model name:

    "tabAutocompleteModel": {
        "title": "Qwen2.5 Coder Continue",
        "provider": "ollama",
        "model": "qwen2.5-coder-continue"
    },

Summary


Autocomplete models are trained on a variety of datasets. These datasets include code from open source projects, code written by other developers, and synthetically generated code. Fine-tuning models with your development data can improve suggestion accuracy because it's based on what you actually write. To try out Continue, install the extension in VS Code or JetBrains.