Building a Semantic Code History Search with LanceDB

Building a Semantic Code History Search with LanceDB

Ever wished you could ask "who wrote all the authentication logic?" instead of manually running git blame on every auth-related file? What if your AI coding assistant could understand not just who changed code, but why and what it relates to across your entire codebase? Let's build exactly that using LanceDB's multimodal lakehouse and Continue's MCP integration!

The Problem: Git Blame

Traditional git blame shows you line-by-line authorship, but it's keyword-based and lacks semantic understanding. You can't ask questions like:

  • "Find all error handling patterns similar to this one"
  • "Who implemented our retry logic across services?"
  • "Show me authentication code changes from last sprint"

LanceDB - an AI-native vector database that turns your code history into a searchable knowledge base! Built on Apache Arrow with columnar storage, it's blazing fast and runs embedded in your app. No separate database server needed.

What You'll Get

A local MCP server that lets you ask Continue natural language questions like:

  • "Who implemented the error handling in our API?"
  • "Find all database-related changes by Sarah"
  • "Show me authentication code changes from last month"

The Setup: Git Blame Search MCP

We'll use the git_blame_search tool as an MCP (Model Context Protocol) server that integrates with Continue. This gives your AI assistant superpowers to understand your codebase history semantically.

Quick Start

# Clone the repository
git clone https://github.com/bdougie/git_blame_search.git
cd git_blame_search

# Install with uv (super fast Python package manager)
curl -LsSf https://astral.sh/uv/install.sh | sh
uv sync

2. Configure Continue

Add to your Continue config (~/.continue/config.yml):

mcpServers:
  git_blame_search:
    command: uv
    args:
      - --directory
      - /path/to/git_blame_search  # Replace with your actual path
      - run
      - python
      - src/server.py

Important: The cwd points to where you cloned the repo. Everything runs locally on your machine - your code never leaves your computer!

3. Start Using It!

Restart VS Code and start asking questions in Continue:

@mcp who wrote the authentication middleware?
@mcp find commits related to database optimization
@mcp show me error handling patterns in the codebase

How It Works

The Indexing Magic (git_blame_tool.py)

The tool creates a semantic index of your git history:

  1. Extracts Git Blame Data: For each file, it runs git blame to get line-by-line authorship
  2. Creates Embeddings: Uses sentence transformers to convert code + context into vectors
  3. Stores in LanceDB: Saves everything in a local vector database for lightning-fast searches

Key snippet:

# Get blame data and create embeddings
blame_data = get_git_blame(file_path)
for line in blame_data:
    text = f"{line['content']} {line['commit_message']}"
    embedding = model.encode(text)
    # Store in LanceDB with metadata

The MCP Server (server.py)

Built with FastMCP, the server exposes tools to Continue:

@mcp.tool()
async def search_git_blame(query: str):
    """Search git history using natural language"""
    # Encode query to vector
    query_embedding = model.encode(query)
    
    # Search LanceDB
    results = table.search(query_embedding).limit(10)
    
    return format_results(results)

The server:

  • Runs locally in your project directory
  • Uses stdio transport to communicate with Continue
  • Provides semantic search over your indexed git history

Why This Changes Everything

Traditional Git Blame asks: "Who touched line 42?"

Semantic Git Blame asks: "Who knows about authentication patterns?"

With LanceDB's multimodal capabilities, you can:

  • Store code snippets with embeddings
  • Include commit messages and PR descriptions
  • Add documentation and architecture diagrams
  • Query everything with natural language

Extra credit

  1. LanceDB on Continue Hub: LanceDB is officially available on the Continue Hub, making integration even easier!
  2. Performance: LanceDB runs embedded, so searches are lightning fast with no network latency
  3. Scale: Handle millions of code snippets without breaking a sweat
  4. Privacy: Everything runs locally - your code never leaves your machine

The Future is Multimodal

LanceDB's vision of the multimodal lakehouse means you're not limited to just code. Imagine:

  • Searching architecture diagrams alongside code
  • Finding who drew that system design
  • Connecting meeting notes to code changes
  • Querying across all your development artifacts

Stop grepping through git logs, try out LanceDB with Continue's MCP integration, you get an AI assistant that truly understands your codebase's history.

šŸš€ Try it now: Install Continue, set up this LanceDB service as an MCP server, and start asking your code history real questions. Your future self (and your team) will thank you!

Want more? Check out Continue's MCP docs and start building your own semantic development tools!