# codevec Implementation Plan **Goal:** Build a CLI that indexes Go codebases for semantic search. **Scope:** Go-only MVP, then expand to TypeScript/Python. --- ## Phase 1: Project Skeleton Set up the basic Go project structure. ``` codevec/ ├── cmd/ │ └── codevec/ │ └── main.go # CLI entry point ├── internal/ │ ├── chunker/ │ │ └── chunker.go # Interface + Go implementation │ ├── embedder/ │ │ └── embedder.go # OpenAI embedding client │ ├── index/ │ │ └── index.go # sqlite-vec storage layer │ └── walker/ │ └── walker.go # File discovery + .gitignore ├── go.mod ├── go.sum ├── Makefile └── README.md ``` **Tasks:** - [ ] `go mod init code.northwest.io/codevec` - [ ] Basic CLI with Cobra - [ ] Subcommands: `index`, `query`, `status` - [ ] Makefile with `build`, `install` --- ## Phase 2: File Walker Walk a directory, respect .gitignore, filter by extension. **Input:** Root path **Output:** List of `.go` files to index ```go type Walker struct { root string ignores []string // from .gitignore } func (w *Walker) Walk() ([]string, error) ``` **Tasks:** - [ ] Implement directory walking with `filepath.WalkDir` - [ ] Parse `.gitignore` patterns (use `go-gitignore` or similar) - [ ] Filter to `.go` files only (configurable later) - [ ] Skip `vendor/`, `testdata/` by default (configurable) --- ## Phase 3: Go Chunker (tree-sitter) Parse Go files and extract function/type chunks. **Input:** File path + content **Output:** List of chunks with metadata ```go type Chunk struct { File string StartLine int EndLine int Type string // "function", "method", "type", "const", "var" Name string // function/type name Content string // raw source code Hash string // sha256 of content } type Chunker interface { Chunk(path string, content []byte) ([]Chunk, error) } ``` **Go-specific extraction:** - `function_declaration` → standalone functions - `method_declaration` → methods (include receiver in name: `(*Server).Handle`) - `type_declaration` → structs, interfaces - `const_declaration` / `var_declaration` → top-level const/var blocks **Tasks:** - [ ] Add tree-sitter dependency: `github.com/smacker/go-tree-sitter` - [ ] Add Go grammar: `github.com/smacker/go-tree-sitter/golang` - [ ] Implement `GoChunker` that parses and walks AST - [ ] Extract nodes by type, capture line numbers - [ ] Handle edge cases: empty files, syntax errors (skip gracefully) - [ ] Chunk size limit: if function > 1000 tokens, note it but keep whole --- ## Phase 4: Embedding Generation Provider interface with Ollama (default) and OpenAI-compatible backends. **Input:** List of chunks **Output:** Chunks with embedding vectors ```go // Provider interface — easy to swap type Embedder interface { Embed(ctx context.Context, texts []string) ([][]float32, error) Dimensions() int } // Ollama provider (default) type OllamaEmbedder struct { baseURL string // default: http://localhost:11434 model string // default: nomic-embed-text } // OpenAI-compatible provider type OpenAIEmbedder struct { baseURL string // configurable for internal API apiKey string model string // text-embedding-3-small, etc. } ``` **Provider selection via flag:** ```bash codevec index . --provider ollama --model nomic-embed-text codevec index . --provider openai --model text-embedding-3-small ``` **Config:** - `--provider` — `ollama` (default) or `openai` - `--model` — model name (provider-specific defaults) - `CODEVEC_API_KEY` — API key (OpenAI provider) - `CODEVEC_BASE_URL` — Override endpoint (both providers) **Ollama models for embeddings:** - `nomic-embed-text` — 768 dims, good general purpose - `mxbai-embed-large` — 1024 dims, higher quality **Tasks:** - [ ] Define `Embedder` interface - [ ] Implement `OllamaEmbedder` (POST to `/api/embeddings`) - [ ] Implement `OpenAIEmbedder` (POST to `/v1/embeddings`) - [ ] Provider factory based on `--provider` flag - [ ] Batch requests where supported - [ ] Handle errors gracefully (connection refused, model not found) --- ## Phase 5: sqlite-vec Storage Store chunks and embeddings in SQLite with vector search. **Schema:** ```sql CREATE TABLE chunks ( id INTEGER PRIMARY KEY, file TEXT NOT NULL, start_line INTEGER NOT NULL, end_line INTEGER NOT NULL, chunk_type TEXT, name TEXT, content TEXT NOT NULL, hash TEXT NOT NULL, created_at INTEGER DEFAULT (unixepoch()) ); CREATE TABLE files ( path TEXT PRIMARY KEY, hash TEXT NOT NULL, indexed_at INTEGER DEFAULT (unixepoch()) ); -- Dimension set at index creation based on model -- nomic-embed-text: 768, mxbai-embed-large: 1024, text-embedding-3-small: 1536 CREATE VIRTUAL TABLE vec_chunks USING vec0( id INTEGER PRIMARY KEY, embedding FLOAT[768] -- adjusted per model ); ``` **Queries:** ```sql -- Similarity search SELECT c.*, vec_distance_cosine(v.embedding, ?) as distance FROM vec_chunks v JOIN chunks c ON c.id = v.id ORDER BY distance LIMIT 10; ``` **Tasks:** - [ ] Add sqlite-vec: `github.com/asg017/sqlite-vec-go-bindings` - [ ] Initialize DB with schema - [ ] Insert chunks + embeddings - [ ] Query by vector similarity - [ ] Store in `.codevec/index.db` --- ## Phase 6: CLI Commands Wire everything together. ### `codevec index ` ``` 1. Walk directory → file list 2. For each file: a. Check if already indexed (compare file hash) b. Parse with tree-sitter → chunks c. Generate embeddings (batched) d. Store in sqlite-vec 3. Update file manifest 4. Print summary ``` **Flags:** - `--force` — re-index everything - `--verbose` — show progress ### `codevec query ` ``` 1. Generate embedding for query text 2. Search sqlite-vec for similar chunks 3. Print results with file:line and similarity score ``` **Flags:** - `--limit N` — max results (default 10) - `--threshold F` — min similarity (default 0.5) - `--show` — print chunk content - `--json` — output as JSON ### `codevec status` ``` 1. Read index.db 2. Print stats: files, chunks, last indexed, model used ``` **Tasks:** - [ ] Implement `index` command with progress bar - [ ] Implement `query` command with formatted output - [ ] Implement `status` command - [ ] Add `--json` output for tool integration --- ## Phase 7: Incremental Updates Only re-index changed files. **Manifest:** `.codevec/manifest.json` ```json { "files": { "src/relay.go": { "hash": "sha256:abc...", "indexed_at": 1709654400 } }, "provider": "ollama", "model": "nomic-embed-text", "dimensions": 768, "version": 1 } ``` **Logic:** 1. Walk directory 2. For each file, compute hash 3. If hash matches manifest → skip 4. If hash differs → delete old chunks, re-index 5. If file removed → delete chunks 6. Update manifest **Tasks:** - [ ] Implement file hashing (sha256 of content) - [ ] Compare against manifest - [ ] Delete stale chunks on re-index - [ ] Handle deleted files --- ## Phase 8: Polish - [ ] Error handling: missing API key, parse failures, network errors - [ ] README with usage examples - [ ] `make install` to put binary in PATH --- ## Future (Post-MVP) - TypeScript chunker (tree-sitter + TS grammar) - Python chunker - `codevec serve` HTTP API - Watch mode (re-index on file change) - Import/export index --- ## Dependencies ```go require ( github.com/smacker/go-tree-sitter v0.0.0-... github.com/smacker/go-tree-sitter/golang v0.0.0-... github.com/asg017/sqlite-vec-go-bindings v0.0.0-... github.com/sabhiram/go-gitignore v0.0.0-... // or similar ) ``` --- ## Decisions 1. **CLI framework:** Cobra 2. **Config:** Flags preferred; config file only if complexity warrants it 3. **Test files:** Index `*_test.go` by default (useful context) 4. **Tests:** None — move fast --- ## First Milestone End of Phase 5: Can index a Go repo and query it. ```bash cd ~/vault/code/nostr codevec index . codevec query "publish event to relay" # → relay.go:45-89 Publish (0.87) ```