diff options
| author | Clawd <ai@clawd.bot> | 2026-03-05 07:05:24 -0800 |
|---|---|---|
| committer | Clawd <ai@clawd.bot> | 2026-03-05 07:05:24 -0800 |
| commit | f7ff79118866d3198cdcc6a9c59881344bd00a4a (patch) | |
| tree | bee114d456cd3744956c05aeeea51de368102a44 | |
Initial design doc
| -rw-r--r-- | DESIGN.md | 304 | ||||
| -rw-r--r-- | README.md | 34 |
2 files changed, 338 insertions, 0 deletions
diff --git a/DESIGN.md b/DESIGN.md new file mode 100644 index 0000000..0950892 --- /dev/null +++ b/DESIGN.md | |||
| @@ -0,0 +1,304 @@ | |||
| 1 | # codevec | ||
| 2 | |||
| 3 | **Semantic code search via embeddings** | ||
| 4 | |||
| 5 | A CLI that indexes codebases for semantic search. Query by concept, get relevant code chunks with file paths and line numbers. | ||
| 6 | |||
| 7 | ## Problem | ||
| 8 | |||
| 9 | Searching code by keywords (`grep`, `ripgrep`) misses semantic matches: | ||
| 10 | - "authentication" won't find `verifyJWT()` | ||
| 11 | - "handle errors" won't find `if err != nil { return }` | ||
| 12 | - "database connection" won't find `sql.Open()` | ||
| 13 | |||
| 14 | AI coding assistants spend tokens reading files to find relevant code. Pre-computed embeddings let them jump straight to what matters. | ||
| 15 | |||
| 16 | ## Usage | ||
| 17 | |||
| 18 | ```bash | ||
| 19 | # Index current directory | ||
| 20 | codevec index . | ||
| 21 | |||
| 22 | # Query semantically | ||
| 23 | codevec query "websocket connection handling" | ||
| 24 | # src/relay.go:45-89 (0.87) | ||
| 25 | # src/handler.go:102-145 (0.82) | ||
| 26 | |||
| 27 | # Query with filters | ||
| 28 | codevec query "error handling" --ext .go --limit 5 | ||
| 29 | |||
| 30 | # Show chunk content | ||
| 31 | codevec query "authentication" --show | ||
| 32 | |||
| 33 | # Re-index (incremental, respects .gitignore) | ||
| 34 | codevec index . --update | ||
| 35 | ``` | ||
| 36 | |||
| 37 | ## Architecture | ||
| 38 | |||
| 39 | ``` | ||
| 40 | ┌─────────────────────────────────────────────────────────────┐ | ||
| 41 | │ codevec │ | ||
| 42 | │ │ | ||
| 43 | │ ┌──────────┐ ┌──────────┐ ┌──────────────────────┐ │ | ||
| 44 | │ │ Parser │───▶│ Chunker │───▶│ Embedding Generator │ │ | ||
| 45 | │ └──────────┘ └──────────┘ └──────────┬───────────┘ │ | ||
| 46 | │ │ │ │ | ||
| 47 | │ │ file list │ vectors │ | ||
| 48 | │ ▼ ▼ │ | ||
| 49 | │ ┌──────────┐ ┌──────────────┐ │ | ||
| 50 | │ │ .gitignore│ │ sqlite-vec │ │ | ||
| 51 | │ │ filter │ │ index │ │ | ||
| 52 | │ └──────────┘ └──────────────┘ │ | ||
| 53 | └─────────────────────────────────────────────────────────────┘ | ||
| 54 | |||
| 55 | Storage: .codevec/ | ||
| 56 | ├── index.db # SQLite + sqlite-vec | ||
| 57 | ├── config.json # Index settings (model, chunk size, etc.) | ||
| 58 | └── manifest.json # File hashes for incremental updates | ||
| 59 | ``` | ||
| 60 | |||
| 61 | ## Chunking Strategy | ||
| 62 | |||
| 63 | **Goal:** Create semantically meaningful chunks that respect code boundaries. | ||
| 64 | |||
| 65 | ### Approach 1: AST-Aware (preferred for supported languages) | ||
| 66 | |||
| 67 | Use tree-sitter to parse and chunk by: | ||
| 68 | - Functions/methods | ||
| 69 | - Classes/structs | ||
| 70 | - Top-level declarations | ||
| 71 | |||
| 72 | ```go | ||
| 73 | // Chunk: function | ||
| 74 | // File: src/auth.go:15-42 | ||
| 75 | func VerifyToken(token string) (*Claims, error) { | ||
| 76 | // ... | ||
| 77 | } | ||
| 78 | ``` | ||
| 79 | |||
| 80 | ### Approach 2: Sliding Window (fallback) | ||
| 81 | |||
| 82 | For unsupported languages or when AST parsing fails: | ||
| 83 | - Fixed-size chunks with overlap | ||
| 84 | - Respect line boundaries | ||
| 85 | - Include context (file path, surrounding lines) | ||
| 86 | |||
| 87 | ### Chunk Metadata | ||
| 88 | |||
| 89 | Each chunk stores: | ||
| 90 | ```json | ||
| 91 | { | ||
| 92 | "file": "src/auth.go", | ||
| 93 | "start_line": 15, | ||
| 94 | "end_line": 42, | ||
| 95 | "type": "function", | ||
| 96 | "name": "VerifyToken", | ||
| 97 | "content": "func VerifyToken...", | ||
| 98 | "hash": "abc123" | ||
| 99 | } | ||
| 100 | ``` | ||
| 101 | |||
| 102 | ## Database Schema | ||
| 103 | |||
| 104 | ```sql | ||
| 105 | CREATE TABLE chunks ( | ||
| 106 | id INTEGER PRIMARY KEY, | ||
| 107 | file TEXT NOT NULL, | ||
| 108 | start_line INTEGER NOT NULL, | ||
| 109 | end_line INTEGER NOT NULL, | ||
| 110 | chunk_type TEXT, -- function, class, block, etc. | ||
| 111 | name TEXT, -- function/class name if available | ||
| 112 | content TEXT NOT NULL, | ||
| 113 | hash TEXT NOT NULL, | ||
| 114 | created_at INTEGER | ||
| 115 | ); | ||
| 116 | |||
| 117 | CREATE TABLE embeddings ( | ||
| 118 | chunk_id INTEGER PRIMARY KEY REFERENCES chunks(id), | ||
| 119 | embedding BLOB NOT NULL -- sqlite-vec vector | ||
| 120 | ); | ||
| 121 | |||
| 122 | CREATE TABLE files ( | ||
| 123 | path TEXT PRIMARY KEY, | ||
| 124 | hash TEXT NOT NULL, | ||
| 125 | indexed_at INTEGER | ||
| 126 | ); | ||
| 127 | |||
| 128 | -- sqlite-vec virtual table for similarity search | ||
| 129 | CREATE VIRTUAL TABLE vec_chunks USING vec0( | ||
| 130 | chunk_id INTEGER PRIMARY KEY, | ||
| 131 | embedding FLOAT[1536] | ||
| 132 | ); | ||
| 133 | ``` | ||
| 134 | |||
| 135 | ## Embedding Generation | ||
| 136 | |||
| 137 | ### Options | ||
| 138 | |||
| 139 | 1. **OpenAI** — `text-embedding-3-small` (1536 dims, fast, cheap) | ||
| 140 | 2. **Ollama** — Local models (`nomic-embed-text`, `mxbai-embed-large`) | ||
| 141 | 3. **Voyage** — Code-specific embeddings (`voyage-code-2`) | ||
| 142 | |||
| 143 | ### Configuration | ||
| 144 | |||
| 145 | ```json | ||
| 146 | { | ||
| 147 | "model": "openai:text-embedding-3-small", | ||
| 148 | "chunk_max_tokens": 512, | ||
| 149 | "chunk_overlap": 50, | ||
| 150 | "languages": ["go", "typescript", "python"], | ||
| 151 | "ignore": ["vendor/", "node_modules/", "*.min.js"] | ||
| 152 | } | ||
| 153 | ``` | ||
| 154 | |||
| 155 | ## CLI Commands | ||
| 156 | |||
| 157 | ### `codevec index <path>` | ||
| 158 | |||
| 159 | Index a directory. | ||
| 160 | |||
| 161 | ``` | ||
| 162 | Flags: | ||
| 163 | --model Embedding model (default: openai:text-embedding-3-small) | ||
| 164 | --update Incremental update (only changed files) | ||
| 165 | --force Re-index everything | ||
| 166 | --ignore Additional ignore patterns | ||
| 167 | --verbose Show progress | ||
| 168 | ``` | ||
| 169 | |||
| 170 | ### `codevec query <text>` | ||
| 171 | |||
| 172 | Search for relevant code. | ||
| 173 | |||
| 174 | ``` | ||
| 175 | Flags: | ||
| 176 | --limit Max results (default: 10) | ||
| 177 | --threshold Min similarity score (default: 0.5) | ||
| 178 | --ext Filter by extension (.go, .ts, etc.) | ||
| 179 | --file Filter by file path pattern | ||
| 180 | --show Print chunk content | ||
| 181 | --json Output as JSON | ||
| 182 | ``` | ||
| 183 | |||
| 184 | ### `codevec status` | ||
| 185 | |||
| 186 | Show index stats. | ||
| 187 | |||
| 188 | ``` | ||
| 189 | Index: .codevec/index.db | ||
| 190 | Files: 142 | ||
| 191 | Chunks: 1,847 | ||
| 192 | Model: openai:text-embedding-3-small | ||
| 193 | Last indexed: 2 hours ago | ||
| 194 | ``` | ||
| 195 | |||
| 196 | ### `codevec serve` | ||
| 197 | |||
| 198 | Optional: Run as HTTP server for integration with other tools. | ||
| 199 | |||
| 200 | ``` | ||
| 201 | GET /query?q=authentication&limit=10 | ||
| 202 | POST /index (webhook for CI) | ||
| 203 | ``` | ||
| 204 | |||
| 205 | ## Integration with claude-flow | ||
| 206 | |||
| 207 | Add a `CodeSearch` tool that shells out to codevec: | ||
| 208 | |||
| 209 | ```typescript | ||
| 210 | // In claude-flow's tool definitions | ||
| 211 | { | ||
| 212 | name: "CodeSearch", | ||
| 213 | description: "Search codebase semantically. Use before Read to find relevant files.", | ||
| 214 | parameters: { | ||
| 215 | query: "string - what to search for", | ||
| 216 | limit: "number - max results (default 10)" | ||
| 217 | }, | ||
| 218 | execute: async ({ query, limit }) => { | ||
| 219 | const result = await exec(`codevec query "${query}" --limit ${limit} --json`); | ||
| 220 | return JSON.parse(result); | ||
| 221 | } | ||
| 222 | } | ||
| 223 | ``` | ||
| 224 | |||
| 225 | Update research phase prompt: | ||
| 226 | ``` | ||
| 227 | WORKFLOW: | ||
| 228 | 1. Use CodeSearch to find relevant code for the task | ||
| 229 | 2. Use Read to examine specific files from search results | ||
| 230 | 3. Write findings to research.md | ||
| 231 | ``` | ||
| 232 | |||
| 233 | ## Incremental Updates | ||
| 234 | |||
| 235 | Track file hashes to avoid re-indexing unchanged files: | ||
| 236 | |||
| 237 | ```json | ||
| 238 | // .codevec/manifest.json | ||
| 239 | { | ||
| 240 | "src/auth.go": "sha256:abc123...", | ||
| 241 | "src/handler.go": "sha256:def456..." | ||
| 242 | } | ||
| 243 | ``` | ||
| 244 | |||
| 245 | On `codevec index --update`: | ||
| 246 | 1. Walk directory | ||
| 247 | 2. Compare hashes | ||
| 248 | 3. Re-chunk and re-embed only changed files | ||
| 249 | 4. Delete chunks from removed files | ||
| 250 | |||
| 251 | ## Language Support | ||
| 252 | |||
| 253 | **Phase 1 (tree-sitter):** | ||
| 254 | - Go | ||
| 255 | - TypeScript/JavaScript | ||
| 256 | - Python | ||
| 257 | |||
| 258 | **Phase 2:** | ||
| 259 | - Rust | ||
| 260 | - C/C++ | ||
| 261 | - Java | ||
| 262 | |||
| 263 | **Fallback:** | ||
| 264 | - Sliding window for any text file | ||
| 265 | |||
| 266 | ## Tech Stack | ||
| 267 | |||
| 268 | - **Language:** Go | ||
| 269 | - **Embeddings:** OpenAI API (default), Ollama (local) | ||
| 270 | - **Storage:** SQLite + sqlite-vec | ||
| 271 | - **Parsing:** tree-sitter (via go bindings) | ||
| 272 | |||
| 273 | ## Open Questions | ||
| 274 | |||
| 275 | 1. **Chunk size vs context:** Bigger chunks = more context but less precise. Smaller = precise but may miss context. | ||
| 276 | 2. **Include comments?** They're semantically rich but noisy. | ||
| 277 | 3. **Cross-file relationships:** Should we embed import graphs or call relationships? | ||
| 278 | 4. **Cost:** OpenAI embeddings are cheap but not free. Cache aggressively. | ||
| 279 | |||
| 280 | ## Prior Art | ||
| 281 | |||
| 282 | - **Sourcegraph Cody** — Similar concept, proprietary | ||
| 283 | - **Cursor** — IDE with semantic codebase understanding | ||
| 284 | - **Bloop** — Open-source semantic code search | ||
| 285 | - **Greptile** — API for codebase understanding | ||
| 286 | |||
| 287 | ## Next Steps | ||
| 288 | |||
| 289 | 1. [ ] Basic CLI skeleton (index, query, status) | ||
| 290 | 2. [ ] sqlite-vec integration | ||
| 291 | 3. [ ] OpenAI embedding generation | ||
| 292 | 4. [ ] File walking with .gitignore respect | ||
| 293 | 5. [ ] Sliding window chunker (MVP) | ||
| 294 | 6. [ ] Tree-sitter chunker for Go | ||
| 295 | 7. [ ] Incremental updates | ||
| 296 | 8. [ ] claude-flow integration | ||
| 297 | |||
| 298 | --- | ||
| 299 | |||
| 300 | ## Dependencies | ||
| 301 | |||
| 302 | - `github.com/asg017/sqlite-vec-go-bindings` — sqlite-vec | ||
| 303 | - `github.com/smacker/go-tree-sitter` — tree-sitter (optional) | ||
| 304 | - OpenAI API or Ollama for embeddings | ||
diff --git a/README.md b/README.md new file mode 100644 index 0000000..aba79e3 --- /dev/null +++ b/README.md | |||
| @@ -0,0 +1,34 @@ | |||
| 1 | # codevec | ||
| 2 | |||
| 3 | Semantic code search via embeddings. | ||
| 4 | |||
| 5 | ```bash | ||
| 6 | codevec index . | ||
| 7 | codevec query "websocket connection handling" | ||
| 8 | ``` | ||
| 9 | |||
| 10 | ## Status | ||
| 11 | |||
| 12 | **Design phase** — see [DESIGN.md](DESIGN.md) | ||
| 13 | |||
| 14 | ## Overview | ||
| 15 | |||
| 16 | Index your codebase, query by concept. Get relevant code chunks with file paths and line numbers. | ||
| 17 | |||
| 18 | - AST-aware chunking (tree-sitter) for Go, TypeScript, Python | ||
| 19 | - sqlite-vec for fast similarity search | ||
| 20 | - Incremental updates (only re-index changed files) | ||
| 21 | - Integrates with claude-flow as a `CodeSearch` tool | ||
| 22 | |||
| 23 | ## Why | ||
| 24 | |||
| 25 | `grep` finds keywords. `codevec` finds meaning. | ||
| 26 | |||
| 27 | ```bash | ||
| 28 | # grep misses this | ||
| 29 | grep "authentication" # won't find verifyJWT() | ||
| 30 | |||
| 31 | # codevec finds it | ||
| 32 | codevec query "authentication" | ||
| 33 | # src/auth.go:15-42 VerifyJWT (0.89) | ||
| 34 | ``` | ||
