aboutsummaryrefslogtreecommitdiffstats
path: root/DESIGN.md
diff options
context:
space:
mode:
authorClawd <ai@clawd.bot>2026-03-05 07:05:24 -0800
committerClawd <ai@clawd.bot>2026-03-05 07:05:24 -0800
commitf7ff79118866d3198cdcc6a9c59881344bd00a4a (patch)
treebee114d456cd3744956c05aeeea51de368102a44 /DESIGN.md
Initial design doc
Diffstat (limited to 'DESIGN.md')
-rw-r--r--DESIGN.md304
1 files changed, 304 insertions, 0 deletions
diff --git a/DESIGN.md b/DESIGN.md
new file mode 100644
index 0000000..0950892
--- /dev/null
+++ b/DESIGN.md
@@ -0,0 +1,304 @@
1# codevec
2
3**Semantic code search via embeddings**
4
5A CLI that indexes codebases for semantic search. Query by concept, get relevant code chunks with file paths and line numbers.
6
7## Problem
8
9Searching code by keywords (`grep`, `ripgrep`) misses semantic matches:
10- "authentication" won't find `verifyJWT()`
11- "handle errors" won't find `if err != nil { return }`
12- "database connection" won't find `sql.Open()`
13
14AI coding assistants spend tokens reading files to find relevant code. Pre-computed embeddings let them jump straight to what matters.
15
16## Usage
17
18```bash
19# Index current directory
20codevec index .
21
22# Query semantically
23codevec query "websocket connection handling"
24# src/relay.go:45-89 (0.87)
25# src/handler.go:102-145 (0.82)
26
27# Query with filters
28codevec query "error handling" --ext .go --limit 5
29
30# Show chunk content
31codevec query "authentication" --show
32
33# Re-index (incremental, respects .gitignore)
34codevec index . --update
35```
36
37## Architecture
38
39```
40┌─────────────────────────────────────────────────────────────┐
41│ codevec │
42│ │
43│ ┌──────────┐ ┌──────────┐ ┌──────────────────────┐ │
44│ │ Parser │───▶│ Chunker │───▶│ Embedding Generator │ │
45│ └──────────┘ └──────────┘ └──────────┬───────────┘ │
46│ │ │ │
47│ │ file list │ vectors │
48│ ▼ ▼ │
49│ ┌──────────┐ ┌──────────────┐ │
50│ │ .gitignore│ │ sqlite-vec │ │
51│ │ filter │ │ index │ │
52│ └──────────┘ └──────────────┘ │
53└─────────────────────────────────────────────────────────────┘
54
55Storage: .codevec/
56├── index.db # SQLite + sqlite-vec
57├── config.json # Index settings (model, chunk size, etc.)
58└── manifest.json # File hashes for incremental updates
59```
60
61## Chunking Strategy
62
63**Goal:** Create semantically meaningful chunks that respect code boundaries.
64
65### Approach 1: AST-Aware (preferred for supported languages)
66
67Use tree-sitter to parse and chunk by:
68- Functions/methods
69- Classes/structs
70- Top-level declarations
71
72```go
73// Chunk: function
74// File: src/auth.go:15-42
75func VerifyToken(token string) (*Claims, error) {
76 // ...
77}
78```
79
80### Approach 2: Sliding Window (fallback)
81
82For unsupported languages or when AST parsing fails:
83- Fixed-size chunks with overlap
84- Respect line boundaries
85- Include context (file path, surrounding lines)
86
87### Chunk Metadata
88
89Each chunk stores:
90```json
91{
92 "file": "src/auth.go",
93 "start_line": 15,
94 "end_line": 42,
95 "type": "function",
96 "name": "VerifyToken",
97 "content": "func VerifyToken...",
98 "hash": "abc123"
99}
100```
101
102## Database Schema
103
104```sql
105CREATE TABLE chunks (
106 id INTEGER PRIMARY KEY,
107 file TEXT NOT NULL,
108 start_line INTEGER NOT NULL,
109 end_line INTEGER NOT NULL,
110 chunk_type TEXT, -- function, class, block, etc.
111 name TEXT, -- function/class name if available
112 content TEXT NOT NULL,
113 hash TEXT NOT NULL,
114 created_at INTEGER
115);
116
117CREATE TABLE embeddings (
118 chunk_id INTEGER PRIMARY KEY REFERENCES chunks(id),
119 embedding BLOB NOT NULL -- sqlite-vec vector
120);
121
122CREATE TABLE files (
123 path TEXT PRIMARY KEY,
124 hash TEXT NOT NULL,
125 indexed_at INTEGER
126);
127
128-- sqlite-vec virtual table for similarity search
129CREATE VIRTUAL TABLE vec_chunks USING vec0(
130 chunk_id INTEGER PRIMARY KEY,
131 embedding FLOAT[1536]
132);
133```
134
135## Embedding Generation
136
137### Options
138
1391. **OpenAI** — `text-embedding-3-small` (1536 dims, fast, cheap)
1402. **Ollama** — Local models (`nomic-embed-text`, `mxbai-embed-large`)
1413. **Voyage** — Code-specific embeddings (`voyage-code-2`)
142
143### Configuration
144
145```json
146{
147 "model": "openai:text-embedding-3-small",
148 "chunk_max_tokens": 512,
149 "chunk_overlap": 50,
150 "languages": ["go", "typescript", "python"],
151 "ignore": ["vendor/", "node_modules/", "*.min.js"]
152}
153```
154
155## CLI Commands
156
157### `codevec index <path>`
158
159Index a directory.
160
161```
162Flags:
163 --model Embedding model (default: openai:text-embedding-3-small)
164 --update Incremental update (only changed files)
165 --force Re-index everything
166 --ignore Additional ignore patterns
167 --verbose Show progress
168```
169
170### `codevec query <text>`
171
172Search for relevant code.
173
174```
175Flags:
176 --limit Max results (default: 10)
177 --threshold Min similarity score (default: 0.5)
178 --ext Filter by extension (.go, .ts, etc.)
179 --file Filter by file path pattern
180 --show Print chunk content
181 --json Output as JSON
182```
183
184### `codevec status`
185
186Show index stats.
187
188```
189Index: .codevec/index.db
190Files: 142
191Chunks: 1,847
192Model: openai:text-embedding-3-small
193Last indexed: 2 hours ago
194```
195
196### `codevec serve`
197
198Optional: Run as HTTP server for integration with other tools.
199
200```
201GET /query?q=authentication&limit=10
202POST /index (webhook for CI)
203```
204
205## Integration with claude-flow
206
207Add a `CodeSearch` tool that shells out to codevec:
208
209```typescript
210// In claude-flow's tool definitions
211{
212 name: "CodeSearch",
213 description: "Search codebase semantically. Use before Read to find relevant files.",
214 parameters: {
215 query: "string - what to search for",
216 limit: "number - max results (default 10)"
217 },
218 execute: async ({ query, limit }) => {
219 const result = await exec(`codevec query "${query}" --limit ${limit} --json`);
220 return JSON.parse(result);
221 }
222}
223```
224
225Update research phase prompt:
226```
227WORKFLOW:
2281. Use CodeSearch to find relevant code for the task
2292. Use Read to examine specific files from search results
2303. Write findings to research.md
231```
232
233## Incremental Updates
234
235Track file hashes to avoid re-indexing unchanged files:
236
237```json
238// .codevec/manifest.json
239{
240 "src/auth.go": "sha256:abc123...",
241 "src/handler.go": "sha256:def456..."
242}
243```
244
245On `codevec index --update`:
2461. Walk directory
2472. Compare hashes
2483. Re-chunk and re-embed only changed files
2494. Delete chunks from removed files
250
251## Language Support
252
253**Phase 1 (tree-sitter):**
254- Go
255- TypeScript/JavaScript
256- Python
257
258**Phase 2:**
259- Rust
260- C/C++
261- Java
262
263**Fallback:**
264- Sliding window for any text file
265
266## Tech Stack
267
268- **Language:** Go
269- **Embeddings:** OpenAI API (default), Ollama (local)
270- **Storage:** SQLite + sqlite-vec
271- **Parsing:** tree-sitter (via go bindings)
272
273## Open Questions
274
2751. **Chunk size vs context:** Bigger chunks = more context but less precise. Smaller = precise but may miss context.
2762. **Include comments?** They're semantically rich but noisy.
2773. **Cross-file relationships:** Should we embed import graphs or call relationships?
2784. **Cost:** OpenAI embeddings are cheap but not free. Cache aggressively.
279
280## Prior Art
281
282- **Sourcegraph Cody** — Similar concept, proprietary
283- **Cursor** — IDE with semantic codebase understanding
284- **Bloop** — Open-source semantic code search
285- **Greptile** — API for codebase understanding
286
287## Next Steps
288
2891. [ ] Basic CLI skeleton (index, query, status)
2902. [ ] sqlite-vec integration
2913. [ ] OpenAI embedding generation
2924. [ ] File walking with .gitignore respect
2935. [ ] Sliding window chunker (MVP)
2946. [ ] Tree-sitter chunker for Go
2957. [ ] Incremental updates
2968. [ ] claude-flow integration
297
298---
299
300## Dependencies
301
302- `github.com/asg017/sqlite-vec-go-bindings` — sqlite-vec
303- `github.com/smacker/go-tree-sitter` — tree-sitter (optional)
304- OpenAI API or Ollama for embeddings