ChunkHound — Local AI Knowledge Base for Your Code
https://github.com/chunkhound/chunkhound
ChunkHound is a local-first tool (the code stays on your machine) that turns your repository into a “knowledge base” for AI. it features semantic search (“where is our authorization?”), regex search (exact matches), and a Code Research mode—an “exploration” of the code that builds a structured report on the architecture and component relationships.
It looks very promising for use in AI-IDEs. It’s a direct analog to Cursor’s indexing but open-source and local. Theoretically, one could even set up a remote MCP for an entire company if it understands versioning.
Imagine having “smart project search”:
- Standard search (grep/ripgrep) looks for letters/words.
- Semantic search looks for meaning: you type “token validation,” and it finds the code even if the word “token” isn’t there.
- Code Research is an “analyst mode”: not just a list of files, but an explanation of how everything works, with links to specific locations in the code.
How It Works
- Indexing: ChunkHound scans the project and breaks files into pieces (“chunks”).
- Smart Code Chunking (cAST): instead of “every N characters,” it tries to preserve the structure of functions and classes. This is based on research into cAST (chunking via syntax trees).
- Embeddings: a “meaning vector” is built for each chunk (and for your query) to find the closest matches semantically.
- Search:
- Semantic (by meaning),
- Regex (exact pattern),
- Optional multi-hop: it expands the search to “neighboring related topics” to gather the full picture (e.g., “auth” → password hashing → sessions → logging).
- Integration with Assistants via MCP: ChunkHound runs as an MCP server, and the IDE/assistant calls its tools.
Key Facts
1) The Main Feature — Chunk Quality
In RAG systems, things often break at a basic level: the code was cut poorly → the search finds fragments → the AI “understands” incorrectly. ChunkHound relies on cAST structural chunking, which significantly improves search quality and response generation.
2) “Semantics + Regex” = Smart and Reliable
Semantics is great for “find where error logging happens,” but sometimes you need “find all calls to validateUser.” Regex search covers this “rock-solid” scenario and doesn’t require API keys.
3) Two Levels: Fast Search and “Architecture Research”
The project clearly separates the basic search layer (semantics/regex) and the “orchestration” (Code Research), which performs multi-pass exploration and writes a report.
4) Local-first — Good for Private Repos and Speed
The project emphasizes “the code stays local,” with storage and searching handled by a local DB (DuckDB plus a vector index).
5) The Project is Alive and Evolving Fast
Open issues show they are still polishing stability: resolving problems with Ollama reranking, MCP disconnects, and timeouts on large files.