# codevec Implementation Plan **Goal:** Build a CLI that indexes Go codebases for semantic search. **Scope:** Go-only MVP, then expand to TypeScript/Python. --- ## Phase 1: Project Skeleton Set up the basic Go project structure. ``` codevec/ ├── cmd/ │ └── codevec/ │ └── main.go # CLI entry point ├── internal/ │ ├── chunker/ │ │ └── chunker.go # Interface + Go implementation │ ├── embedder/ │ │ └── embedder.go # OpenAI embedding client │ ├── index/ │ │ └── index.go # sqlite-vec storage layer │ └── walker/ │ └── walker.go # File discovery + .gitignore ├── go.mod ├── go.sum ├── Makefile └── README.md ``` **Tasks:** - [ ] `go mod init code.northwest.io/codevec` - [ ] Basic CLI with cobra or just flag package - [ ] Subcommands: `index`, `query`, `status` - [ ] Makefile with `build`, `test`, `install` --- ## Phase 2: File Walker Walk a directory, respect .gitignore, filter by extension. **Input:** Root path **Output:** List of `.go` files to index ```go type Walker struct { root string ignores []string // from .gitignore } func (w *Walker) Walk() ([]string, error) ``` **Tasks:** - [ ] Implement directory walking with `filepath.WalkDir` - [ ] Parse `.gitignore` patterns (use `go-gitignore` or similar) - [ ] Filter to `.go` files only (configurable later) - [ ] Skip `vendor/`, `testdata/`, `*_test.go` by default (configurable) **Test:** Walk the `nostr` SDK repo, verify correct file list. --- ## Phase 3: Go Chunker (tree-sitter) Parse Go files and extract function/type chunks. **Input:** File path + content **Output:** List of chunks with metadata ```go type Chunk struct { File string StartLine int EndLine int Type string // "function", "method", "type", "const", "var" Name string // function/type name Content string // raw source code Hash string // sha256 of content } type Chunker interface { Chunk(path string, content []byte) ([]Chunk, error) } ``` **Go-specific extraction:** - `function_declaration` → standalone functions - `method_declaration` → methods (include receiver in name: `(*Server).Handle`) - `type_declaration` → structs, interfaces - `const_declaration` / `var_declaration` → top-level const/var blocks **Tasks:** - [ ] Add tree-sitter dependency: `github.com/smacker/go-tree-sitter` - [ ] Add Go grammar: `github.com/smacker/go-tree-sitter/golang` - [ ] Implement `GoChunker` that parses and walks AST - [ ] Extract nodes by type, capture line numbers - [ ] Handle edge cases: empty files, syntax errors (skip gracefully) - [ ] Chunk size limit: if function > 1000 tokens, note it but keep whole **Test:** Chunk `nostr/relay.go`, verify functions extracted correctly. --- ## Phase 4: Embedding Generation Generate embeddings via OpenAI API. **Input:** List of chunks **Output:** Chunks with embedding vectors ```go type Embedder interface { Embed(ctx context.Context, texts []string) ([][]float32, error) } type OpenAIEmbedder struct { apiKey string model string // "text-embedding-3-small" } ``` **Batching:** OpenAI supports up to 2048 inputs per request. Batch chunks to minimize API calls. **Tasks:** - [ ] Implement OpenAI embedding client (stdlib `net/http`, no SDK) - [ ] Batch requests (100 chunks per request to stay safe) - [ ] Handle rate limits with exponential backoff - [ ] Config: model selection, API key from env `OPENAI_API_KEY` **Test:** Embed a few chunks, verify 1536-dim vectors returned. --- ## Phase 5: sqlite-vec Storage Store chunks and embeddings in SQLite with vector search. **Schema:** ```sql CREATE TABLE chunks ( id INTEGER PRIMARY KEY, file TEXT NOT NULL, start_line INTEGER NOT NULL, end_line INTEGER NOT NULL, chunk_type TEXT, name TEXT, content TEXT NOT NULL, hash TEXT NOT NULL, created_at INTEGER DEFAULT (unixepoch()) ); CREATE TABLE files ( path TEXT PRIMARY KEY, hash TEXT NOT NULL, indexed_at INTEGER DEFAULT (unixepoch()) ); CREATE VIRTUAL TABLE vec_chunks USING vec0( id INTEGER PRIMARY KEY, embedding FLOAT[1536] ); ``` **Queries:** ```sql -- Similarity search SELECT c.*, vec_distance_cosine(v.embedding, ?) as distance FROM vec_chunks v JOIN chunks c ON c.id = v.id ORDER BY distance LIMIT 10; ``` **Tasks:** - [ ] Add sqlite-vec: `github.com/asg017/sqlite-vec-go-bindings` - [ ] Initialize DB with schema - [ ] Insert chunks + embeddings - [ ] Query by vector similarity - [ ] Store in `.codevec/index.db` **Test:** Insert chunks, query, verify results ranked by similarity. --- ## Phase 6: CLI Commands Wire everything together. ### `codevec index ` ``` 1. Walk directory → file list 2. For each file: a. Check if already indexed (compare file hash) b. Parse with tree-sitter → chunks c. Generate embeddings (batched) d. Store in sqlite-vec 3. Update file manifest 4. Print summary ``` **Flags:** - `--force` — re-index everything - `--verbose` — show progress ### `codevec query ` ``` 1. Generate embedding for query text 2. Search sqlite-vec for similar chunks 3. Print results with file:line and similarity score ``` **Flags:** - `--limit N` — max results (default 10) - `--threshold F` — min similarity (default 0.5) - `--show` — print chunk content - `--json` — output as JSON ### `codevec status` ``` 1. Read index.db 2. Print stats: files, chunks, last indexed, model used ``` **Tasks:** - [ ] Implement `index` command with progress bar - [ ] Implement `query` command with formatted output - [ ] Implement `status` command - [ ] Add `--json` output for tool integration --- ## Phase 7: Incremental Updates Only re-index changed files. **Manifest:** `.codevec/manifest.json` ```json { "files": { "src/relay.go": { "hash": "sha256:abc...", "indexed_at": 1709654400 } }, "model": "text-embedding-3-small", "version": 1 } ``` **Logic:** 1. Walk directory 2. For each file, compute hash 3. If hash matches manifest → skip 4. If hash differs → delete old chunks, re-index 5. If file removed → delete chunks 6. Update manifest **Tasks:** - [ ] Implement file hashing (sha256 of content) - [ ] Compare against manifest - [ ] Delete stale chunks on re-index - [ ] Handle deleted files --- ## Phase 8: Testing & Polish - [ ] Unit tests for chunker - [ ] Unit tests for walker - [ ] Integration test: index small repo, query, verify results - [ ] Error handling: missing API key, parse failures, network errors - [ ] README with usage examples - [ ] `make install` to put binary in PATH --- ## Future (Post-MVP) - TypeScript chunker (tree-sitter + TS grammar) - Python chunker - Ollama embedder for local/offline use - `codevec serve` HTTP API - Watch mode (re-index on file change) - Import/export index --- ## Dependencies ```go require ( github.com/smacker/go-tree-sitter v0.0.0-... github.com/smacker/go-tree-sitter/golang v0.0.0-... github.com/asg017/sqlite-vec-go-bindings v0.0.0-... github.com/sabhiram/go-gitignore v0.0.0-... // or similar ) ``` --- ## Estimated Effort | Phase | Effort | |-------|--------| | 1. Skeleton | 30 min | | 2. Walker | 1 hr | | 3. Chunker | 2 hr | | 4. Embedder | 1 hr | | 5. Storage | 2 hr | | 6. CLI | 1 hr | | 7. Incremental | 1 hr | | 8. Polish | 1 hr | | **Total** | ~10 hr | --- ## Open Decisions 1. **CLI framework:** `cobra` vs stdlib `flag`? Leaning stdlib for simplicity. 2. **Config file:** YAML in `.codevec/config.yaml` or just flags? 3. **Chunk overlap:** Include N lines of context above/below functions? 4. **Test files:** Index `*_test.go` by default or skip? --- ## First Milestone End of Phase 5: Can index a Go repo and query it. ```bash cd ~/vault/code/nostr codevec index . codevec query "publish event to relay" # → relay.go:45-89 Publish (0.87) ```