From f7ff79118866d3198cdcc6a9c59881344bd00a4a Mon Sep 17 00:00:00 2001 From: Clawd Date: Thu, 5 Mar 2026 07:05:24 -0800 Subject: Initial design doc --- DESIGN.md | 304 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 304 insertions(+) create mode 100644 DESIGN.md (limited to 'DESIGN.md') diff --git a/DESIGN.md b/DESIGN.md new file mode 100644 index 0000000..0950892 --- /dev/null +++ b/DESIGN.md @@ -0,0 +1,304 @@ +# codevec + +**Semantic code search via embeddings** + +A CLI that indexes codebases for semantic search. Query by concept, get relevant code chunks with file paths and line numbers. + +## Problem + +Searching code by keywords (`grep`, `ripgrep`) misses semantic matches: +- "authentication" won't find `verifyJWT()` +- "handle errors" won't find `if err != nil { return }` +- "database connection" won't find `sql.Open()` + +AI coding assistants spend tokens reading files to find relevant code. Pre-computed embeddings let them jump straight to what matters. + +## Usage + +```bash +# Index current directory +codevec index . + +# Query semantically +codevec query "websocket connection handling" +# src/relay.go:45-89 (0.87) +# src/handler.go:102-145 (0.82) + +# Query with filters +codevec query "error handling" --ext .go --limit 5 + +# Show chunk content +codevec query "authentication" --show + +# Re-index (incremental, respects .gitignore) +codevec index . --update +``` + +## Architecture + +``` +┌─────────────────────────────────────────────────────────────┐ +│ codevec │ +│ │ +│ ┌──────────┐ ┌──────────┐ ┌──────────────────────┐ │ +│ │ Parser │───▶│ Chunker │───▶│ Embedding Generator │ │ +│ └──────────┘ └──────────┘ └──────────┬───────────┘ │ +│ │ │ │ +│ │ file list │ vectors │ +│ ▼ ▼ │ +│ ┌──────────┐ ┌──────────────┐ │ +│ │ .gitignore│ │ sqlite-vec │ │ +│ │ filter │ │ index │ │ +│ └──────────┘ └──────────────┘ │ +└─────────────────────────────────────────────────────────────┘ + +Storage: .codevec/ +├── index.db # SQLite + sqlite-vec +├── config.json # Index settings (model, chunk size, etc.) +└── manifest.json # File hashes for incremental updates +``` + +## Chunking Strategy + +**Goal:** Create semantically meaningful chunks that respect code boundaries. + +### Approach 1: AST-Aware (preferred for supported languages) + +Use tree-sitter to parse and chunk by: +- Functions/methods +- Classes/structs +- Top-level declarations + +```go +// Chunk: function +// File: src/auth.go:15-42 +func VerifyToken(token string) (*Claims, error) { + // ... +} +``` + +### Approach 2: Sliding Window (fallback) + +For unsupported languages or when AST parsing fails: +- Fixed-size chunks with overlap +- Respect line boundaries +- Include context (file path, surrounding lines) + +### Chunk Metadata + +Each chunk stores: +```json +{ + "file": "src/auth.go", + "start_line": 15, + "end_line": 42, + "type": "function", + "name": "VerifyToken", + "content": "func VerifyToken...", + "hash": "abc123" +} +``` + +## Database Schema + +```sql +CREATE TABLE chunks ( + id INTEGER PRIMARY KEY, + file TEXT NOT NULL, + start_line INTEGER NOT NULL, + end_line INTEGER NOT NULL, + chunk_type TEXT, -- function, class, block, etc. + name TEXT, -- function/class name if available + content TEXT NOT NULL, + hash TEXT NOT NULL, + created_at INTEGER +); + +CREATE TABLE embeddings ( + chunk_id INTEGER PRIMARY KEY REFERENCES chunks(id), + embedding BLOB NOT NULL -- sqlite-vec vector +); + +CREATE TABLE files ( + path TEXT PRIMARY KEY, + hash TEXT NOT NULL, + indexed_at INTEGER +); + +-- sqlite-vec virtual table for similarity search +CREATE VIRTUAL TABLE vec_chunks USING vec0( + chunk_id INTEGER PRIMARY KEY, + embedding FLOAT[1536] +); +``` + +## Embedding Generation + +### Options + +1. **OpenAI** — `text-embedding-3-small` (1536 dims, fast, cheap) +2. **Ollama** — Local models (`nomic-embed-text`, `mxbai-embed-large`) +3. **Voyage** — Code-specific embeddings (`voyage-code-2`) + +### Configuration + +```json +{ + "model": "openai:text-embedding-3-small", + "chunk_max_tokens": 512, + "chunk_overlap": 50, + "languages": ["go", "typescript", "python"], + "ignore": ["vendor/", "node_modules/", "*.min.js"] +} +``` + +## CLI Commands + +### `codevec index ` + +Index a directory. + +``` +Flags: + --model Embedding model (default: openai:text-embedding-3-small) + --update Incremental update (only changed files) + --force Re-index everything + --ignore Additional ignore patterns + --verbose Show progress +``` + +### `codevec query ` + +Search for relevant code. + +``` +Flags: + --limit Max results (default: 10) + --threshold Min similarity score (default: 0.5) + --ext Filter by extension (.go, .ts, etc.) + --file Filter by file path pattern + --show Print chunk content + --json Output as JSON +``` + +### `codevec status` + +Show index stats. + +``` +Index: .codevec/index.db +Files: 142 +Chunks: 1,847 +Model: openai:text-embedding-3-small +Last indexed: 2 hours ago +``` + +### `codevec serve` + +Optional: Run as HTTP server for integration with other tools. + +``` +GET /query?q=authentication&limit=10 +POST /index (webhook for CI) +``` + +## Integration with claude-flow + +Add a `CodeSearch` tool that shells out to codevec: + +```typescript +// In claude-flow's tool definitions +{ + name: "CodeSearch", + description: "Search codebase semantically. Use before Read to find relevant files.", + parameters: { + query: "string - what to search for", + limit: "number - max results (default 10)" + }, + execute: async ({ query, limit }) => { + const result = await exec(`codevec query "${query}" --limit ${limit} --json`); + return JSON.parse(result); + } +} +``` + +Update research phase prompt: +``` +WORKFLOW: +1. Use CodeSearch to find relevant code for the task +2. Use Read to examine specific files from search results +3. Write findings to research.md +``` + +## Incremental Updates + +Track file hashes to avoid re-indexing unchanged files: + +```json +// .codevec/manifest.json +{ + "src/auth.go": "sha256:abc123...", + "src/handler.go": "sha256:def456..." +} +``` + +On `codevec index --update`: +1. Walk directory +2. Compare hashes +3. Re-chunk and re-embed only changed files +4. Delete chunks from removed files + +## Language Support + +**Phase 1 (tree-sitter):** +- Go +- TypeScript/JavaScript +- Python + +**Phase 2:** +- Rust +- C/C++ +- Java + +**Fallback:** +- Sliding window for any text file + +## Tech Stack + +- **Language:** Go +- **Embeddings:** OpenAI API (default), Ollama (local) +- **Storage:** SQLite + sqlite-vec +- **Parsing:** tree-sitter (via go bindings) + +## Open Questions + +1. **Chunk size vs context:** Bigger chunks = more context but less precise. Smaller = precise but may miss context. +2. **Include comments?** They're semantically rich but noisy. +3. **Cross-file relationships:** Should we embed import graphs or call relationships? +4. **Cost:** OpenAI embeddings are cheap but not free. Cache aggressively. + +## Prior Art + +- **Sourcegraph Cody** — Similar concept, proprietary +- **Cursor** — IDE with semantic codebase understanding +- **Bloop** — Open-source semantic code search +- **Greptile** — API for codebase understanding + +## Next Steps + +1. [ ] Basic CLI skeleton (index, query, status) +2. [ ] sqlite-vec integration +3. [ ] OpenAI embedding generation +4. [ ] File walking with .gitignore respect +5. [ ] Sliding window chunker (MVP) +6. [ ] Tree-sitter chunker for Go +7. [ ] Incremental updates +8. [ ] claude-flow integration + +--- + +## Dependencies + +- `github.com/asg017/sqlite-vec-go-bindings` — sqlite-vec +- `github.com/smacker/go-tree-sitter` — tree-sitter (optional) +- OpenAI API or Ollama for embeddings -- cgit v1.2.3