codebase indexing

codebase indexing dashboard

see screenshots of the app here

currently building a web app that takes a bug issue from github and helps a dev find the relevant code in the codebase to fix the bug.

first we need to chunk the codebase

you can read about tree-sitter concepts here.

the first step in indexing a codebase is breaking it into meaningful pieces. i'm using tree-sitter to parse python code and extract "chunks" - functions, classes, methods, and module-level variables.

what each chunk captures

each chunk is more than just code. it's a structured object with rich metadata:

@dataclass
class Chunk:
    id: str                    # "file.py::ClassName::method_name"
    type: str                  # "function", "method", "class", "module_var"
    name: str                  # human-readable name
    code: str                  # actual source code
    file_path: str             # where it lives
    start_line: int            # for navigation
    end_line: int
    parent_class: str | None   # distinguishes methods from functions
    imports: list[str]         # all file imports
    calls: list[str]           # functions/methods called within
    base_classes: list[str]    # for inheritance tracking
    docstring: str | None      # for semantic context
    used_imports: list[str]    # only imports actually referenced

why this metadata matters

for bug triage, i need to trace execution flow and understand code relationships:

  • call graphs - if a bug mentions process_data(), i need to know what it calls
  • inheritance - errors in ChildClass might come from ParentClass
  • import resolution - map from foo import bar to actual code locations
  • docstrings - provide semantic context for LLM summarization later
  • line numbers - link directly to github for developers

i also think that showing git diffs of the code file in the web app will help the dev understand the changes and the context of the code as well.

parsing example

here's a simple python file:

import os
from pathlib import Path

class FileProcessor:
    """Processes files from a directory."""
    
    def __init__(self, path: str):
        self.path = Path(path)
    
    def process(self):
        """Process all files."""
        current = os.getcwd()
        files = self.list_files()
        return files
    
    def list_files(self):
        return list(self.path.glob("*.py"))

the chunker extracts 4 chunks:

chunk 1: FileProcessor class

  • type: class
  • base_classes: []
  • docstring: "Processes files from a directory."

chunk 2: init method

  • type: method
  • parent_class: FileProcessor
  • calls: ["Path"]
  • used_imports: ["from pathlib import Path"]

chunk 3: process method

  • type: method
  • parent_class: FileProcessor
  • calls: ["os.getcwd", "self.list_files"]
  • used_imports: ["import os"]
  • docstring: "Process all files."

chunk 4: list_files method

  • type: method
  • parent_class: FileProcessor
  • calls: ["list", "self.path.glob"]
  • used_imports: ["from pathlib import Path"]
  • docstring: "List all files in the directory." - this will be generated by an LLM summarizer as it did not exist in the source code

building the relationship graph

after chunking, i need to understand how chunks relate to each other. the relationship resolver builds a graph connecting chunks through:

  • calls - when function A calls function B
  • inheritance - when class Child inherits from class Parent
  • imports - mapping imported names to their source locations

the resolver uses a symbol table to match function/class names across the codebase. it handles:

  • file-scoped lookups (file.py::function_name)
  • import resolution (mapping from foo import bar to actual code)
  • qualified names for methods (ClassName.method_name)

some relationships can't be resolved (external libraries, dynamic imports), so those are tracked separately for debugging.

LLM summarization

each chunk gets a natural language summary generated by an LLM (using OpenRouter's API). the summarizer includes relationship context in the prompt:

  • parent class information
  • functions it calls
  • base classes it inherits from
  • existing docstrings

this context helps the LLM generate more accurate summaries. for example, a method summary will mention that it's part of a specific class and what it calls.

summaries are cached by code hash - if the code hasn't changed, we reuse the cached summary. this saves API calls and speeds up re-indexing.

semantic search with embeddings

to enable semantic search, i embed chunk summaries using ollama's embedding model (qwen3-embedding:4b). the vector store uses cosine similarity to find chunks semantically similar to a query.

the key insight: we embed the summary (natural language), not the raw code. this works better for semantic matching because summaries describe what the code does, not how it's written.

embeddings are also cached by code hash, so unchanged code doesn't need re-embedding.

the Q&A agent

the agent uses OpenRouter with tool calling to answer questions about codebases. it has access to 7 tools:

  1. semantic_search - find chunks similar to a natural language query
  2. grep_code - exact text search (for error messages, specific strings)
  3. lookup_symbol - find functions/classes by exact name
  4. get_callers - find what calls a function
  5. get_callees - find what a function calls
  6. read_file_context - read surrounding code from a file
  7. list_files - list files in the codebase

the agent uses an iterative loop: it makes tool calls, gets results, and continues until it has enough information to answer. all file operations are sandboxed to the codebase directory for security.

the API backend

the backend is a FastAPI server that exposes endpoints for:

  • /chunks - extract chunks from a directory
  • /graph - get graph data for visualization
  • /index - full indexing pipeline (chunking + summarization + embedding)
  • /index/start - start background indexing job
  • /index/status/{job_id} - check indexing progress
  • /embed - generate embeddings for chunks
  • /ask - Q&A agent endpoint
  • /file-content - get file contents with syntax highlighting

indexing can run synchronously or as a background job with progress tracking. the API caches indexed codebases in memory for fast subsequent queries.

the web interface

the frontend is built with Next.js and React Flow for graph visualization. it shows:

  • graph view - interactive visualization of files, chunks, and relationships
  • file browser - select which files to display in the graph
  • code preview - click any chunk to see its code with syntax highlighting (using shiki)
  • chat interface - ask questions about the codebase using the Q&A agent

the graph uses ELK (Eclipse Layout Kernel) for automatic layout, with different edge types for calls (blue), inheritance (red), and containment (lavender). you can filter by chunk type and edge type, and the chat interface shows tool calls made by the agent for transparency.

caching strategy

currently using simple hash-based caching:

  • summaries cached by code hash
  • embeddings cached by code hash

this works well for development, but for production i'd want to implement merkle tree caching (like cursor does) to only reprocess changed files. that's still on the todo list.