Using Ollama with Continue: A Developer's Guide

Using Ollama with Continue: A Developer's Guide
Using Continue with Ollama locally.

As developers, we want tools that respect our privacy, let us customize our experience, and integrate seamlessly into our workflows. Continue paired with Ollama offers exactly that—a way to run custom AI coding assistants locally.

The benefits of this include

  1. Data privacy: your code stays on your machine, not sent to third-party servers
  2. Full control: choose exactly which models you want to use for different coding tasks
  3. Customization: configure rules, prompts, docs, etc. to tailor the assistant to your needs
  4. No subscription costs: use powerful open-source models without recurring fees
  5. Offline usage: work without an internet connection when needed

In this guide, I'll walk you through setting up Continue with Ollama, so you can build a development environment that respects your workflow.

What you'll need

Step 1: Install Ollama

Ollama lets you run powerful language models locally. Installation is straightforward:

MacOS/Linux

curl -fsSL https://ollama.ai/install.sh | sh

Windows

Download the installer from Ollama.

Step 2: Pull a code-optimized model

After installation, pull a model that works well for coding. I find that Qwen Coder 2.5 7B represents a sweet spot for many developers.

ollama pull qwen2.5-coder:7b

This will download the model to your local machine. You can verify it works by running:

ollama run qwen2.5-coder:7b "Write a function to calculate the factorial of 5"

Step 3: Set up your Continue account

Before creating your assistant, you'll need to:

  1. Visit https://hub.continue.dev/signup to create your account
  2. Verify your email address
  3. Log in to access Continue Hub

Step 4: Create a new assistant

Once you have a Continue Hub account, you can create a new assistant:

  1. From your Continue Hub dashboard, click the "+" button to create a new assistant
    Create a New Assistant button

  2. Give your assistant a name (e.g., "Llama Local")

  3. By default, Continue will populate some useful model blocks for you. You can:

  4. Save your assistant configuration
    Save Assistant

  5. Your assistant will now be available at a URL like: https://hub.continue.dev/chad/llama-local

Step 5: Install and log into Continue in your IDE

  1. Install the Continue extension for your IDE:

    • For detailed instructions, see the official documentation
    • Follow the installation steps for VS Code or JetBrains
  2. Once installed, log into your Continue account

  3. Access all of your assistants in the Continue IDE extension
    Multiple Assistants in Sidebar

Next step is customizing all the things

After setting up your assistant, you can customize it further:

  1. Rules: Define how your assistant should behave and what expertise it should have
  2. Additional models: Set up different model roles for different types of tasks.
  3. Other blocks: Explore all the different types of blocks on the Hub.

Conclusion

By combining Continue with Ollama, you've created a powerful, private, and customizable coding assistant that runs entirely on your machine. This setup gives you the benefits of AI assistance while maintaining control over your code and data.

As models continue to improve, you can easily upgrade your local setup by pulling newer models without changing your Continue configuration. Experiment with different models and settings to find the perfect balance between performance and capability for your specific hardware and workflow.

Happy coding!


Hardware requirements for different model sizes

Choosing the right model depends on your hardware capabilities. Here are general guidelines:

Model Size Examples GPU VRAM Requirements
Small
(1.5B-3B)
Qwen-Coder-2.5 1.5B
Qwen-Coder-2.5 3B
<8GB VRAM
Medium
(7B-14B)
Mistral 7B
Qwen-Coder-2.5 7B
8-16GB VRAM
Large
(32B+)
DeepSeek R1 32B
Qwen-Coder-2.5 32B
24GB+ VRAM
(Apple M4 Pro 64GB / RTX 3090 / 4090)

Additional Hardware Notes

  • Small models (1.5B-3B) will work on modern CPUs with 16GB+ RAM, though your mileage may vary
  • For CPU-only usage, expect significantly lower throughput (1-5 tokens/sec)
  • CPUs with integrated GPU cores (Apple Silicon, AMD Ryzen™ AI Max 300, etc.) perform well with appropriate unified memory and bandwidth
  • Medium and large models become impractical without dedicated GPU acceleration
  • Using quantization (4-bit or 8-bit precision) can reduce VRAM requirements by 2-4x but with some loss of accuracy
  • Consider running smaller models with higher quantization precision rather than larger models with aggressive quantization

Apple Silicon Specifications

For Mac users, here's a breakdown of Apple Silicon capabilities:

Chip CPU Cores Neural Engine Cores Max RAM Max RAM Bandwidth
M1 8 16 16GB 68.25GB/s
M1 Pro 8-10 16 32GB 200GB/s
M1 Max 10 16 64GB 400GB/s
M1 Ultra 20 32 128GB 800GB/s
M2 8 16 24GB 100GB/s
M2 Pro 10-12 16 32GB 200GB/s
M2 Max 12 16 96GB 400GB/s
M2 Ultra 24 32 192GB 800GB/s
M3 8 16 24GB 100GB/s
M3 Pro 11-12 16 36GB 150GB/s
M3 Max 14-16 16 128GB 400GB/s
M3 Ultra 28-32 32 192GB 800GB/s
M4 10 16 32GB 120GB/s
M4 Pro 14-16 32 64GB 280GB/s
M4 Max 16-18 32 128GB 560GB/s