i'm working on a codebase indexing project that requires me to extract code into chunks. you can read more about the project here. tree sitter turned out to be the perfect tool for this.
what is tree-sitter?
tree-sitter is a multi-language incremental parsing library and a parser generator tool. it combines three key concepts:
- parser generator tool - takes grammar rules and generates source code into parser code
- incremental parsing library - efficiently reparses code by reusing previous results
- concrete syntax tree (CST) - preserves all tokens including whitespaces and comments
how does it work?
tree-sitter takes your source code and:
- tokenizes it into tokens (e.g.,
module,function_definition,expression_statement) - builds a tree where each node has types defined by the language grammar
- allows you to traverse the tree to extract or analyze code structure
why incremental?
the incremental parsing feature is what makes tree-sitter special:
- reuses previous parsed output so it doesn't need to reparse the entire file
- tracks byte changes to only update affected parts of the tree
- makes it extremely fast for real-time editor features
example: parsing python code
let's see what tree-sitter output actually looks like. here's a simple python function:
def hello(name):
print(f"Hello {name}")and here's the tree that tree-sitter generates:
module [0, 0] - [2, 0]
function_definition [0, 0] - [1, 31]
name: identifier [0, 4] - [0, 9]
parameters: parameters [0, 9] - [0, 15]
identifier [0, 10] - [0, 14]
body: block [1, 9] - [1, 31]
expression_statement [1, 9] - [1, 31]
call [1, 9] - [1, 31]
function: identifier [1, 9] - [1, 14]
arguments: argument_list [1, 14] - [1, 31]
string [1, 15] - [1, 30]
string_start [1, 15] - [1, 17]
string_content [1, 17] - [1, 23]
interpolation [1, 23] - [1, 29]
expression: identifier [1, 24] - [1, 28]
string_end [1, 29] - [1, 30]each line shows:
- node type - what kind of syntax element it is
- position -
[row, column]ranges showing where it appears in the source