From 1ef3bb0128d59f6092199bb58eb0127ac7808899 Mon Sep 17 00:00:00 2001 From: Clawd Date: Thu, 5 Mar 2026 07:08:43 -0800 Subject: Add implementation plan --- PLAN.md | 342 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 342 insertions(+) create mode 100644 PLAN.md diff --git a/PLAN.md b/PLAN.md new file mode 100644 index 0000000..a7253e1 --- /dev/null +++ b/PLAN.md @@ -0,0 +1,342 @@ +# codevec Implementation Plan + +**Goal:** Build a CLI that indexes Go codebases for semantic search. + +**Scope:** Go-only MVP, then expand to TypeScript/Python. + +--- + +## Phase 1: Project Skeleton + +Set up the basic Go project structure. + +``` +codevec/ +├── cmd/ +│ └── codevec/ +│ └── main.go # CLI entry point +├── internal/ +│ ├── chunker/ +│ │ └── chunker.go # Interface + Go implementation +│ ├── embedder/ +│ │ └── embedder.go # OpenAI embedding client +│ ├── index/ +│ │ └── index.go # sqlite-vec storage layer +│ └── walker/ +│ └── walker.go # File discovery + .gitignore +├── go.mod +├── go.sum +├── Makefile +└── README.md +``` + +**Tasks:** +- [ ] `go mod init code.northwest.io/codevec` +- [ ] Basic CLI with cobra or just flag package +- [ ] Subcommands: `index`, `query`, `status` +- [ ] Makefile with `build`, `test`, `install` + +--- + +## Phase 2: File Walker + +Walk a directory, respect .gitignore, filter by extension. + +**Input:** Root path +**Output:** List of `.go` files to index + +```go +type Walker struct { + root string + ignores []string // from .gitignore +} + +func (w *Walker) Walk() ([]string, error) +``` + +**Tasks:** +- [ ] Implement directory walking with `filepath.WalkDir` +- [ ] Parse `.gitignore` patterns (use `go-gitignore` or similar) +- [ ] Filter to `.go` files only (configurable later) +- [ ] Skip `vendor/`, `testdata/`, `*_test.go` by default (configurable) + +**Test:** Walk the `nostr` SDK repo, verify correct file list. + +--- + +## Phase 3: Go Chunker (tree-sitter) + +Parse Go files and extract function/type chunks. + +**Input:** File path + content +**Output:** List of chunks with metadata + +```go +type Chunk struct { + File string + StartLine int + EndLine int + Type string // "function", "method", "type", "const", "var" + Name string // function/type name + Content string // raw source code + Hash string // sha256 of content +} + +type Chunker interface { + Chunk(path string, content []byte) ([]Chunk, error) +} +``` + +**Go-specific extraction:** +- `function_declaration` → standalone functions +- `method_declaration` → methods (include receiver in name: `(*Server).Handle`) +- `type_declaration` → structs, interfaces +- `const_declaration` / `var_declaration` → top-level const/var blocks + +**Tasks:** +- [ ] Add tree-sitter dependency: `github.com/smacker/go-tree-sitter` +- [ ] Add Go grammar: `github.com/smacker/go-tree-sitter/golang` +- [ ] Implement `GoChunker` that parses and walks AST +- [ ] Extract nodes by type, capture line numbers +- [ ] Handle edge cases: empty files, syntax errors (skip gracefully) +- [ ] Chunk size limit: if function > 1000 tokens, note it but keep whole + +**Test:** Chunk `nostr/relay.go`, verify functions extracted correctly. + +--- + +## Phase 4: Embedding Generation + +Generate embeddings via OpenAI API. + +**Input:** List of chunks +**Output:** Chunks with embedding vectors + +```go +type Embedder interface { + Embed(ctx context.Context, texts []string) ([][]float32, error) +} + +type OpenAIEmbedder struct { + apiKey string + model string // "text-embedding-3-small" +} +``` + +**Batching:** OpenAI supports up to 2048 inputs per request. Batch chunks to minimize API calls. + +**Tasks:** +- [ ] Implement OpenAI embedding client (stdlib `net/http`, no SDK) +- [ ] Batch requests (100 chunks per request to stay safe) +- [ ] Handle rate limits with exponential backoff +- [ ] Config: model selection, API key from env `OPENAI_API_KEY` + +**Test:** Embed a few chunks, verify 1536-dim vectors returned. + +--- + +## Phase 5: sqlite-vec Storage + +Store chunks and embeddings in SQLite with vector search. + +**Schema:** +```sql +CREATE TABLE chunks ( + id INTEGER PRIMARY KEY, + file TEXT NOT NULL, + start_line INTEGER NOT NULL, + end_line INTEGER NOT NULL, + chunk_type TEXT, + name TEXT, + content TEXT NOT NULL, + hash TEXT NOT NULL, + created_at INTEGER DEFAULT (unixepoch()) +); + +CREATE TABLE files ( + path TEXT PRIMARY KEY, + hash TEXT NOT NULL, + indexed_at INTEGER DEFAULT (unixepoch()) +); + +CREATE VIRTUAL TABLE vec_chunks USING vec0( + id INTEGER PRIMARY KEY, + embedding FLOAT[1536] +); +``` + +**Queries:** +```sql +-- Similarity search +SELECT c.*, vec_distance_cosine(v.embedding, ?) as distance +FROM vec_chunks v +JOIN chunks c ON c.id = v.id +ORDER BY distance +LIMIT 10; +``` + +**Tasks:** +- [ ] Add sqlite-vec: `github.com/asg017/sqlite-vec-go-bindings` +- [ ] Initialize DB with schema +- [ ] Insert chunks + embeddings +- [ ] Query by vector similarity +- [ ] Store in `.codevec/index.db` + +**Test:** Insert chunks, query, verify results ranked by similarity. + +--- + +## Phase 6: CLI Commands + +Wire everything together. + +### `codevec index ` + +``` +1. Walk directory → file list +2. For each file: + a. Check if already indexed (compare file hash) + b. Parse with tree-sitter → chunks + c. Generate embeddings (batched) + d. Store in sqlite-vec +3. Update file manifest +4. Print summary +``` + +**Flags:** +- `--force` — re-index everything +- `--verbose` — show progress + +### `codevec query ` + +``` +1. Generate embedding for query text +2. Search sqlite-vec for similar chunks +3. Print results with file:line and similarity score +``` + +**Flags:** +- `--limit N` — max results (default 10) +- `--threshold F` — min similarity (default 0.5) +- `--show` — print chunk content +- `--json` — output as JSON + +### `codevec status` + +``` +1. Read index.db +2. Print stats: files, chunks, last indexed, model used +``` + +**Tasks:** +- [ ] Implement `index` command with progress bar +- [ ] Implement `query` command with formatted output +- [ ] Implement `status` command +- [ ] Add `--json` output for tool integration + +--- + +## Phase 7: Incremental Updates + +Only re-index changed files. + +**Manifest:** `.codevec/manifest.json` +```json +{ + "files": { + "src/relay.go": { + "hash": "sha256:abc...", + "indexed_at": 1709654400 + } + }, + "model": "text-embedding-3-small", + "version": 1 +} +``` + +**Logic:** +1. Walk directory +2. For each file, compute hash +3. If hash matches manifest → skip +4. If hash differs → delete old chunks, re-index +5. If file removed → delete chunks +6. Update manifest + +**Tasks:** +- [ ] Implement file hashing (sha256 of content) +- [ ] Compare against manifest +- [ ] Delete stale chunks on re-index +- [ ] Handle deleted files + +--- + +## Phase 8: Testing & Polish + +- [ ] Unit tests for chunker +- [ ] Unit tests for walker +- [ ] Integration test: index small repo, query, verify results +- [ ] Error handling: missing API key, parse failures, network errors +- [ ] README with usage examples +- [ ] `make install` to put binary in PATH + +--- + +## Future (Post-MVP) + +- TypeScript chunker (tree-sitter + TS grammar) +- Python chunker +- Ollama embedder for local/offline use +- `codevec serve` HTTP API +- Watch mode (re-index on file change) +- Import/export index + +--- + +## Dependencies + +```go +require ( + github.com/smacker/go-tree-sitter v0.0.0-... + github.com/smacker/go-tree-sitter/golang v0.0.0-... + github.com/asg017/sqlite-vec-go-bindings v0.0.0-... + github.com/sabhiram/go-gitignore v0.0.0-... // or similar +) +``` + +--- + +## Estimated Effort + +| Phase | Effort | +|-------|--------| +| 1. Skeleton | 30 min | +| 2. Walker | 1 hr | +| 3. Chunker | 2 hr | +| 4. Embedder | 1 hr | +| 5. Storage | 2 hr | +| 6. CLI | 1 hr | +| 7. Incremental | 1 hr | +| 8. Polish | 1 hr | +| **Total** | ~10 hr | + +--- + +## Open Decisions + +1. **CLI framework:** `cobra` vs stdlib `flag`? Leaning stdlib for simplicity. +2. **Config file:** YAML in `.codevec/config.yaml` or just flags? +3. **Chunk overlap:** Include N lines of context above/below functions? +4. **Test files:** Index `*_test.go` by default or skip? + +--- + +## First Milestone + +End of Phase 5: Can index a Go repo and query it. + +```bash +cd ~/vault/code/nostr +codevec index . +codevec query "publish event to relay" +# → relay.go:45-89 Publish (0.87) +``` -- cgit v1.2.3