# codevec Implementation Plan **Goal:** Build a CLI that indexes Go codebases for semantic search. **Scope:** Go-only MVP, then expand to TypeScript/Python. --- ## Phase 1: Project Skeleton Set up the basic Go project structure. ``` codevec/ ├── cmd/ │ └── codevec/ │ └── main.go # CLI entry point ├── internal/ │ ├── chunker/ │ │ └── chunker.go # Interface + Go implementation │ ├── embedder/ │ │ └── embedder.go # OpenAI embedding client │ ├── index/ │ │ └── index.go # sqlite-vec storage layer │ └── walker/ │ └── walker.go # File discovery + .gitignore ├── go.mod ├── go.sum ├── Makefile └── README.md ``` **Tasks:** - [ ] `go mod init code.northwest.io/codevec` - [ ] Basic CLI with Cobra - [ ] Subcommands: `index`, `query`, `status` - [ ] Makefile with `build`, `install` --- ## Phase 2: File Walker Walk a directory, respect .gitignore, filter by extension. **Input:** Root path **Output:** List of `.go` files to index ```go type Walker struct { root string ignores []string // from .gitignore } func (w *Walker) Walk() ([]string, error) ``` **Tasks:** - [ ] Implement directory walking with `filepath.WalkDir` - [ ] Parse `.gitignore` patterns (use `go-gitignore` or similar) - [ ] Filter to `.go` files only (configurable later) - [ ] Skip `vendor/`, `testdata/` by default (configurable) --- ## Phase 3: Go Chunker (tree-sitter) Parse Go files and extract function/type chunks. **Input:** File path + content **Output:** List of chunks with metadata ```go type Chunk struct { File string StartLine int EndLine int Type string // "function", "method", "type", "const", "var" Name string // function/type name Content string // raw source code Hash string // sha256 of content } type Chunker interface { Chunk(path string, content []byte) ([]Chunk, error) } ``` **Go-specific extraction:** - `function_declaration` → standalone functions - `method_declaration` → methods (include receiver in name: `(*Server).Handle`) - `type_declaration` → structs, interfaces - `const_declaration` / `var_declaration` → top-level const/var blocks **Tasks:** - [ ] Add tree-sitter dependency: `github.com/smacker/go-tree-sitter` - [ ] Add Go grammar: `github.com/smacker/go-tree-sitter/golang` - [ ] Implement `GoChunker` that parses and walks AST - [ ] Extract nodes by type, capture line numbers - [ ] Handle edge cases: empty files, syntax errors (skip gracefully) - [ ] Chunk size limit: if function > 1000 tokens, note it but keep whole --- ## Phase 4: Embedding Generation Generate embeddings via OpenAI-compatible API (internal endpoint). **Input:** List of chunks **Output:** Chunks with embedding vectors ```go type Embedder interface { Embed(ctx context.Context, texts []string) ([][]float32, error) } type Embedder struct { baseURL string // defaults to OpenAI, configurable for internal API apiKey string model string // "text-embedding-3-small" } ``` **Batching:** Batch chunks to minimize API calls (~100 per request). **Config:** - `OPENAI_API_KEY` — API key (standard env var) - `OPENAI_BASE_URL` — Override endpoint for internal API (optional) - `--model` flag for model selection **Tasks:** - [ ] Implement OpenAI-compatible embedding client (stdlib `net/http`) - [ ] Support custom base URL for internal API - [ ] Batch requests - [ ] Handle rate limits with exponential backoff --- ## Phase 5: sqlite-vec Storage Store chunks and embeddings in SQLite with vector search. **Schema:** ```sql CREATE TABLE chunks ( id INTEGER PRIMARY KEY, file TEXT NOT NULL, start_line INTEGER NOT NULL, end_line INTEGER NOT NULL, chunk_type TEXT, name TEXT, content TEXT NOT NULL, hash TEXT NOT NULL, created_at INTEGER DEFAULT (unixepoch()) ); CREATE TABLE files ( path TEXT PRIMARY KEY, hash TEXT NOT NULL, indexed_at INTEGER DEFAULT (unixepoch()) ); CREATE VIRTUAL TABLE vec_chunks USING vec0( id INTEGER PRIMARY KEY, embedding FLOAT[1536] ); ``` **Queries:** ```sql -- Similarity search SELECT c.*, vec_distance_cosine(v.embedding, ?) as distance FROM vec_chunks v JOIN chunks c ON c.id = v.id ORDER BY distance LIMIT 10; ``` **Tasks:** - [ ] Add sqlite-vec: `github.com/asg017/sqlite-vec-go-bindings` - [ ] Initialize DB with schema - [ ] Insert chunks + embeddings - [ ] Query by vector similarity - [ ] Store in `.codevec/index.db` --- ## Phase 6: CLI Commands Wire everything together. ### `codevec index ` ``` 1. Walk directory → file list 2. For each file: a. Check if already indexed (compare file hash) b. Parse with tree-sitter → chunks c. Generate embeddings (batched) d. Store in sqlite-vec 3. Update file manifest 4. Print summary ``` **Flags:** - `--force` — re-index everything - `--verbose` — show progress ### `codevec query ` ``` 1. Generate embedding for query text 2. Search sqlite-vec for similar chunks 3. Print results with file:line and similarity score ``` **Flags:** - `--limit N` — max results (default 10) - `--threshold F` — min similarity (default 0.5) - `--show` — print chunk content - `--json` — output as JSON ### `codevec status` ``` 1. Read index.db 2. Print stats: files, chunks, last indexed, model used ``` **Tasks:** - [ ] Implement `index` command with progress bar - [ ] Implement `query` command with formatted output - [ ] Implement `status` command - [ ] Add `--json` output for tool integration --- ## Phase 7: Incremental Updates Only re-index changed files. **Manifest:** `.codevec/manifest.json` ```json { "files": { "src/relay.go": { "hash": "sha256:abc...", "indexed_at": 1709654400 } }, "model": "text-embedding-3-small", "version": 1 } ``` **Logic:** 1. Walk directory 2. For each file, compute hash 3. If hash matches manifest → skip 4. If hash differs → delete old chunks, re-index 5. If file removed → delete chunks 6. Update manifest **Tasks:** - [ ] Implement file hashing (sha256 of content) - [ ] Compare against manifest - [ ] Delete stale chunks on re-index - [ ] Handle deleted files --- ## Phase 8: Polish - [ ] Error handling: missing API key, parse failures, network errors - [ ] README with usage examples - [ ] `make install` to put binary in PATH --- ## Future (Post-MVP) - TypeScript chunker (tree-sitter + TS grammar) - Python chunker - Ollama embedder for local/offline use - `codevec serve` HTTP API - Watch mode (re-index on file change) - Import/export index --- ## Dependencies ```go require ( github.com/smacker/go-tree-sitter v0.0.0-... github.com/smacker/go-tree-sitter/golang v0.0.0-... github.com/asg017/sqlite-vec-go-bindings v0.0.0-... github.com/sabhiram/go-gitignore v0.0.0-... // or similar ) ``` --- ## Estimated Effort | Phase | Effort | |-------|--------| | 1. Skeleton | 30 min | | 2. Walker | 1 hr | | 3. Chunker | 2 hr | | 4. Embedder | 1 hr | | 5. Storage | 2 hr | | 6. CLI | 1 hr | | 7. Incremental | 1 hr | | 8. Polish | 30 min | | **Total** | ~9 hr | --- ## Decisions 1. **CLI framework:** Cobra 2. **Config:** Flags preferred; config file only if complexity warrants it 3. **Test files:** Index `*_test.go` by default (useful context) 4. **Tests:** None — move fast --- ## First Milestone End of Phase 5: Can index a Go repo and query it. ```bash cd ~/vault/code/nostr codevec index . codevec query "publish event to relay" # → relay.go:45-89 Publish (0.87) ```