# codevec **Semantic code search via embeddings** A CLI that indexes codebases for semantic search. Query by concept, get relevant code chunks with file paths and line numbers. ## Problem Searching code by keywords (`grep`, `ripgrep`) misses semantic matches: - "authentication" won't find `verifyJWT()` - "handle errors" won't find `if err != nil { return }` - "database connection" won't find `sql.Open()` AI coding assistants spend tokens reading files to find relevant code. Pre-computed embeddings let them jump straight to what matters. ## Usage ```bash # Index current directory codevec index . # Query semantically codevec query "websocket connection handling" # src/relay.go:45-89 (0.87) # src/handler.go:102-145 (0.82) # Query with filters codevec query "error handling" --ext .go --limit 5 # Show chunk content codevec query "authentication" --show # Re-index (incremental, respects .gitignore) codevec index . --update ``` ## Architecture ``` ┌─────────────────────────────────────────────────────────────┐ │ codevec │ │ │ │ ┌──────────┐ ┌──────────┐ ┌──────────────────────┐ │ │ │ Parser │───▶│ Chunker │───▶│ Embedding Generator │ │ │ └──────────┘ └──────────┘ └──────────┬───────────┘ │ │ │ │ │ │ │ file list │ vectors │ │ ▼ ▼ │ │ ┌──────────┐ ┌──────────────┐ │ │ │ .gitignore│ │ sqlite-vec │ │ │ │ filter │ │ index │ │ │ └──────────┘ └──────────────┘ │ └─────────────────────────────────────────────────────────────┘ Storage: .codevec/ ├── index.db # SQLite + sqlite-vec ├── config.json # Index settings (model, chunk size, etc.) └── manifest.json # File hashes for incremental updates ``` ## Chunking Strategy **Goal:** Create semantically meaningful chunks that respect code boundaries. ### Approach 1: AST-Aware (preferred for supported languages) Use tree-sitter to parse and chunk by: - Functions/methods - Classes/structs - Top-level declarations ```go // Chunk: function // File: src/auth.go:15-42 func VerifyToken(token string) (*Claims, error) { // ... } ``` ### Approach 2: Sliding Window (fallback) For unsupported languages or when AST parsing fails: - Fixed-size chunks with overlap - Respect line boundaries - Include context (file path, surrounding lines) ### Chunk Metadata Each chunk stores: ```json { "file": "src/auth.go", "start_line": 15, "end_line": 42, "type": "function", "name": "VerifyToken", "content": "func VerifyToken...", "hash": "abc123" } ``` ## Database Schema ```sql CREATE TABLE chunks ( id INTEGER PRIMARY KEY, file TEXT NOT NULL, start_line INTEGER NOT NULL, end_line INTEGER NOT NULL, chunk_type TEXT, -- function, class, block, etc. name TEXT, -- function/class name if available content TEXT NOT NULL, hash TEXT NOT NULL, created_at INTEGER ); CREATE TABLE embeddings ( chunk_id INTEGER PRIMARY KEY REFERENCES chunks(id), embedding BLOB NOT NULL -- sqlite-vec vector ); CREATE TABLE files ( path TEXT PRIMARY KEY, hash TEXT NOT NULL, indexed_at INTEGER ); -- sqlite-vec virtual table for similarity search CREATE VIRTUAL TABLE vec_chunks USING vec0( chunk_id INTEGER PRIMARY KEY, embedding FLOAT[1536] ); ``` ## Embedding Generation ### Options 1. **OpenAI** — `text-embedding-3-small` (1536 dims, fast, cheap) 2. **Ollama** — Local models (`nomic-embed-text`, `mxbai-embed-large`) 3. **Voyage** — Code-specific embeddings (`voyage-code-2`) ### Configuration ```json { "model": "openai:text-embedding-3-small", "chunk_max_tokens": 512, "chunk_overlap": 50, "languages": ["go", "typescript", "python"], "ignore": ["vendor/", "node_modules/", "*.min.js"] } ``` ## CLI Commands ### `codevec index ` Index a directory. ``` Flags: --model Embedding model (default: openai:text-embedding-3-small) --update Incremental update (only changed files) --force Re-index everything --ignore Additional ignore patterns --verbose Show progress ``` ### `codevec query ` Search for relevant code. ``` Flags: --limit Max results (default: 10) --threshold Min similarity score (default: 0.5) --ext Filter by extension (.go, .ts, etc.) --file Filter by file path pattern --show Print chunk content --json Output as JSON ``` ### `codevec status` Show index stats. ``` Index: .codevec/index.db Files: 142 Chunks: 1,847 Model: openai:text-embedding-3-small Last indexed: 2 hours ago ``` ### `codevec serve` Optional: Run as HTTP server for integration with other tools. ``` GET /query?q=authentication&limit=10 POST /index (webhook for CI) ``` ## Integration with claude-flow Add a `CodeSearch` tool that shells out to codevec: ```typescript // In claude-flow's tool definitions { name: "CodeSearch", description: "Search codebase semantically. Use before Read to find relevant files.", parameters: { query: "string - what to search for", limit: "number - max results (default 10)" }, execute: async ({ query, limit }) => { const result = await exec(`codevec query "${query}" --limit ${limit} --json`); return JSON.parse(result); } } ``` Update research phase prompt: ``` WORKFLOW: 1. Use CodeSearch to find relevant code for the task 2. Use Read to examine specific files from search results 3. Write findings to research.md ``` ## Incremental Updates Track file hashes to avoid re-indexing unchanged files: ```json // .codevec/manifest.json { "src/auth.go": "sha256:abc123...", "src/handler.go": "sha256:def456..." } ``` On `codevec index --update`: 1. Walk directory 2. Compare hashes 3. Re-chunk and re-embed only changed files 4. Delete chunks from removed files ## Language Support **Phase 1 (tree-sitter):** - Go - TypeScript/JavaScript - Python **Phase 2:** - Rust - C/C++ - Java **Fallback:** - Sliding window for any text file ## Tech Stack - **Language:** Go - **Embeddings:** OpenAI API (default), Ollama (local) - **Storage:** SQLite + sqlite-vec - **Parsing:** tree-sitter (via go bindings) ## Open Questions 1. **Chunk size vs context:** Bigger chunks = more context but less precise. Smaller = precise but may miss context. 2. **Include comments?** They're semantically rich but noisy. 3. **Cross-file relationships:** Should we embed import graphs or call relationships? 4. **Cost:** OpenAI embeddings are cheap but not free. Cache aggressively. ## Prior Art - **Sourcegraph Cody** — Similar concept, proprietary - **Cursor** — IDE with semantic codebase understanding - **Bloop** — Open-source semantic code search - **Greptile** — API for codebase understanding ## Next Steps 1. [ ] Basic CLI skeleton (index, query, status) 2. [ ] sqlite-vec integration 3. [ ] OpenAI embedding generation 4. [ ] File walking with .gitignore respect 5. [ ] Sliding window chunker (MVP) 6. [ ] Tree-sitter chunker for Go 7. [ ] Incremental updates 8. [ ] claude-flow integration --- ## Dependencies - `github.com/asg017/sqlite-vec-go-bindings` — sqlite-vec - `github.com/smacker/go-tree-sitter` — tree-sitter (optional) - OpenAI API or Ollama for embeddings