see screenshots of the app here
currently building a web app that takes a bug issue from github and helps a dev find the relevant code in the codebase to fix the bug.
first we need to chunk the codebase
you can read about tree-sitter concepts here.
the first step in indexing a codebase is breaking it into meaningful pieces. i'm using tree-sitter to parse python code and extract "chunks" - functions, classes, methods, and module-level variables.
what each chunk captures
each chunk is more than just code. it's a structured object with rich metadata:
@dataclass
class Chunk:
id: str # "file.py::ClassName::method_name"
type: str # "function", "method", "class", "module_var"
name: str # human-readable name
code: str # actual source code
file_path: str # where it lives
start_line: int # for navigation
end_line: int
parent_class: str | None # distinguishes methods from functions
imports: list[str] # all file imports
calls: list[str] # functions/methods called within
base_classes: list[str] # for inheritance tracking
docstring: str | None # for semantic context
used_imports: list[str] # only imports actually referencedwhy this metadata matters
for bug triage, i need to trace execution flow and understand code relationships:
- call graphs - if a bug mentions
process_data(), i need to know what it calls - inheritance - errors in
ChildClassmight come fromParentClass - import resolution - map
from foo import barto actual code locations - docstrings - provide semantic context for LLM summarization later
- line numbers - link directly to github for developers
i also think that showing git diffs of the code file in the web app will help the dev understand the changes and the context of the code as well.
parsing example
here's a simple python file:
import os
from pathlib import Path
class FileProcessor:
"""Processes files from a directory."""
def __init__(self, path: str):
self.path = Path(path)
def process(self):
"""Process all files."""
current = os.getcwd()
files = self.list_files()
return files
def list_files(self):
return list(self.path.glob("*.py"))the chunker extracts 4 chunks:
chunk 1: FileProcessor class
- type:
class - base_classes:
[] - docstring:
"Processes files from a directory."
chunk 2: init method
- type:
method - parent_class:
FileProcessor - calls:
["Path"] - used_imports:
["from pathlib import Path"]
chunk 3: process method
- type:
method - parent_class:
FileProcessor - calls:
["os.getcwd", "self.list_files"] - used_imports:
["import os"] - docstring:
"Process all files."
chunk 4: list_files method
- type:
method - parent_class:
FileProcessor - calls:
["list", "self.path.glob"] - used_imports:
["from pathlib import Path"] - docstring:
"List all files in the directory."- this will be generated by an LLM summarizer as it did not exist in the source code
building the relationship graph
after chunking, i need to understand how chunks relate to each other. the relationship resolver builds a graph connecting chunks through:
- calls - when function A calls function B
- inheritance - when class Child inherits from class Parent
- imports - mapping imported names to their source locations
the resolver uses a symbol table to match function/class names across the codebase. it handles:
- file-scoped lookups (
file.py::function_name) - import resolution (mapping
from foo import barto actual code) - qualified names for methods (
ClassName.method_name)
some relationships can't be resolved (external libraries, dynamic imports), so those are tracked separately for debugging.
LLM summarization
each chunk gets a natural language summary generated by an LLM (using OpenRouter's API). the summarizer includes relationship context in the prompt:
- parent class information
- functions it calls
- base classes it inherits from
- existing docstrings
this context helps the LLM generate more accurate summaries. for example, a method summary will mention that it's part of a specific class and what it calls.
summaries are cached by code hash - if the code hasn't changed, we reuse the cached summary. this saves API calls and speeds up re-indexing.
semantic search with embeddings
to enable semantic search, i embed chunk summaries using ollama's embedding model (qwen3-embedding:4b). the vector store uses cosine similarity to find chunks semantically similar to a query.
the key insight: we embed the summary (natural language), not the raw code. this works better for semantic matching because summaries describe what the code does, not how it's written.
embeddings are also cached by code hash, so unchanged code doesn't need re-embedding.
the Q&A agent
the agent uses OpenRouter with tool calling to answer questions about codebases. it has access to 7 tools:
- semantic_search - find chunks similar to a natural language query
- grep_code - exact text search (for error messages, specific strings)
- lookup_symbol - find functions/classes by exact name
- get_callers - find what calls a function
- get_callees - find what a function calls
- read_file_context - read surrounding code from a file
- list_files - list files in the codebase
the agent uses an iterative loop: it makes tool calls, gets results, and continues until it has enough information to answer. all file operations are sandboxed to the codebase directory for security.
the API backend
the backend is a FastAPI server that exposes endpoints for:
/chunks- extract chunks from a directory/graph- get graph data for visualization/index- full indexing pipeline (chunking + summarization + embedding)/index/start- start background indexing job/index/status/{job_id}- check indexing progress/embed- generate embeddings for chunks/ask- Q&A agent endpoint/file-content- get file contents with syntax highlighting
indexing can run synchronously or as a background job with progress tracking. the API caches indexed codebases in memory for fast subsequent queries.
the web interface
the frontend is built with Next.js and React Flow for graph visualization. it shows:
- graph view - interactive visualization of files, chunks, and relationships
- file browser - select which files to display in the graph
- code preview - click any chunk to see its code with syntax highlighting (using shiki)
- chat interface - ask questions about the codebase using the Q&A agent
the graph uses ELK (Eclipse Layout Kernel) for automatic layout, with different edge types for calls (blue), inheritance (red), and containment (lavender). you can filter by chunk type and edge type, and the chat interface shows tool calls made by the agent for transparency.
caching strategy
currently using simple hash-based caching:
- summaries cached by code hash
- embeddings cached by code hash
this works well for development, but for production i'd want to implement merkle tree caching (like cursor does) to only reprocess changed files. that's still on the todo list.