ADR-004: Recursive AST-Driven Chunking
Status: PROPOSED at merge, shipped since (the status field in the full ADR predates the merge commit).
The full ADR — context, alternatives, detailed algorithm, 400+ lines of per-language
implementation notes, risk analysis, testing strategy, and migration plan — lives
in the repo:
docs/design/adr-004-recursive-ast-driven-chunking.md.
Follow-up work is tracked in
impl-004-recursive-ast-driven-chunking.md
and
review-findings/parser-bugs-from-recursive-chunking.md.
Status today
Shipped. Recursive AST-driven chunking is implemented in internal/parser/
with container/leaf node handling per language. Language configs in
internal/parser/languages.go declare NodeMeta{Kind, IsContainer} for each
chunkable node type. See Supported languages
for what's wired up today.
Context
The original parser used a whitelist-driven, two-level chunking model: top-level
chunkFile dispatched class-like nodes to chunkClass, which extracted exactly
one level of methods from the class body. It required three config maps
(ClassTypes, ClassBodyTypes, ClassBodyType) per language to control
traversal — brittle, hard to extend, and unable to handle nested structures
(e.g., methods inside nested classes, Rust modules inside modules, Ruby
singleton classes).
Decision
Replace the two-level model with a recursive, size-driven AST traversal that
uses ChunkableTypes as the sole traversal guide.
Core algorithm (paraphrased from the full ADR):
chunkNode(node):
if tokens(node) <= maxChunkTokens:
emit node as chunk (excluding comment children)
return
// Node is too big — decompose.
for each named child of node:
skip if child is a comment
if child is chunkable OR child is too big:
recurse into child
if recursion produced children:
emit a signature chunk for the parent + the extracted children
else:
fall back to splitByLines
Comments are excluded entirely — not emitted as chunks, not folded into
signatures, not collected into the file header. Rationale: doc blocks distort
search ranking; tree-sitter can't reliably distinguish doc comments from
regular comments across languages; the source is available via Read when
needed.
Key constraints
ChunkableTypesis the single source of truth for traversal. Adding a language means declaring its node types inNodeMeta— no need to pick which are "class-like".- Handles arbitrary nesting. Nested classes, Rust modules, trait impls, Ruby singleton classes all just work because any chunkable child can itself be chunkable.
- Size-driven recursion. The chunker only recurses when a node exceeds
maxChunkTokens; small nodes are emitted as-is even if they contain chunkable children. - Fallback path (
splitByLines). When a huge node has no chunkable children (e.g., enormous JSON-in-JS or a massive method body), the chunker splits by lines. Known to produce oversized chunks for pathological files — see the parser-bugs follow-up link above.
When to revisit
- Adding a language with unusual grammar shapes (check the existing parser-bugs findings first).
- If
splitByLinescontinues to produce oversized chunks that break embedding or ranking in practice. - If the comment-exclusion rule becomes an issue (e.g., someone wants docstring-aware search).