# codevec Implementation Plan

**Goal:** Build a CLI that indexes Go codebases for semantic search.

**Scope:** Go-only MVP, then expand to TypeScript/Python.

---

## Phase 1: Project Skeleton

Set up the basic Go project structure.

```
codevec/
├── cmd/
│   └── codevec/
│       └── main.go           # CLI entry point
├── internal/
│   ├── chunker/
│   │   └── chunker.go        # Interface + Go implementation
│   ├── embedder/
│   │   └── embedder.go       # OpenAI embedding client
│   ├── index/
│   │   └── index.go          # sqlite-vec storage layer
│   └── walker/
│       └── walker.go         # File discovery + .gitignore
├── go.mod
├── go.sum
├── Makefile
└── README.md
```

**Tasks:**
- [ ] `go mod init code.northwest.io/codevec`
- [ ] Basic CLI with Cobra
- [ ] Subcommands: `index`, `query`, `status`
- [ ] Makefile with `build`, `install`

---

## Phase 2: File Walker

Walk a directory, respect .gitignore, filter by extension.

**Input:** Root path
**Output:** List of `.go` files to index

```go
type Walker struct {
    root    string
    ignores []string  // from .gitignore
}

func (w *Walker) Walk() ([]string, error)
```

**Tasks:**
- [ ] Implement directory walking with `filepath.WalkDir`
- [ ] Parse `.gitignore` patterns (use `go-gitignore` or similar)
- [ ] Filter to `.go` files only (configurable later)
- [ ] Skip `vendor/`, `testdata/` by default (configurable)

---

## Phase 3: Go Chunker (tree-sitter)

Parse Go files and extract function/type chunks.

**Input:** File path + content
**Output:** List of chunks with metadata

```go
type Chunk struct {
    File      string
    StartLine int
    EndLine   int
    Type      string  // "function", "method", "type", "const", "var"
    Name      string  // function/type name
    Content   string  // raw source code
    Hash      string  // sha256 of content
}

type Chunker interface {
    Chunk(path string, content []byte) ([]Chunk, error)
}
```

**Go-specific extraction:**
- `function_declaration` → standalone functions
- `method_declaration` → methods (include receiver in name: `(*Server).Handle`)
- `type_declaration` → structs, interfaces
- `const_declaration` / `var_declaration` → top-level const/var blocks

**Tasks:**
- [ ] Add tree-sitter dependency: `github.com/smacker/go-tree-sitter`
- [ ] Add Go grammar: `github.com/smacker/go-tree-sitter/golang`
- [ ] Implement `GoChunker` that parses and walks AST
- [ ] Extract nodes by type, capture line numbers
- [ ] Handle edge cases: empty files, syntax errors (skip gracefully)
- [ ] Chunk size limit: if function > 1000 tokens, note it but keep whole

---

## Phase 4: Embedding Generation

Provider interface with Ollama (default) and OpenAI-compatible backends.

**Input:** List of chunks
**Output:** Chunks with embedding vectors

```go
// Provider interface — easy to swap
type Embedder interface {
    Embed(ctx context.Context, texts []string) ([][]float32, error)
    Dimensions() int
}

// Ollama provider (default)
type OllamaEmbedder struct {
    baseURL string  // default: http://localhost:11434
    model   string  // default: nomic-embed-text
}

// OpenAI-compatible provider
type OpenAIEmbedder struct {
    baseURL string  // configurable for internal API
    apiKey  string
    model   string  // text-embedding-3-small, etc.
}
```

**Provider selection via flag:**
```bash
codevec index . --provider ollama --model nomic-embed-text
codevec index . --provider openai --model text-embedding-3-small
```

**Config:**
- `--provider` — `ollama` (default) or `openai`
- `--model` — model name (provider-specific defaults)
- `CODEVEC_API_KEY` — API key (OpenAI provider)
- `CODEVEC_BASE_URL` — Override endpoint (both providers)

**Ollama models for embeddings:**
- `nomic-embed-text` — 768 dims, good general purpose
- `mxbai-embed-large` — 1024 dims, higher quality

**Tasks:**
- [ ] Define `Embedder` interface
- [ ] Implement `OllamaEmbedder` (POST to `/api/embeddings`)
- [ ] Implement `OpenAIEmbedder` (POST to `/v1/embeddings`)
- [ ] Provider factory based on `--provider` flag
- [ ] Batch requests where supported
- [ ] Handle errors gracefully (connection refused, model not found)

---

## Phase 5: sqlite-vec Storage

Store chunks and embeddings in SQLite with vector search.

**Schema:**
```sql
CREATE TABLE chunks (
    id INTEGER PRIMARY KEY,
    file TEXT NOT NULL,
    start_line INTEGER NOT NULL,
    end_line INTEGER NOT NULL,
    chunk_type TEXT,
    name TEXT,
    content TEXT NOT NULL,
    hash TEXT NOT NULL,
    created_at INTEGER DEFAULT (unixepoch())
);

CREATE TABLE files (
    path TEXT PRIMARY KEY,
    hash TEXT NOT NULL,
    indexed_at INTEGER DEFAULT (unixepoch())
);

-- Dimension set at index creation based on model
-- nomic-embed-text: 768, mxbai-embed-large: 1024, text-embedding-3-small: 1536
CREATE VIRTUAL TABLE vec_chunks USING vec0(
    id INTEGER PRIMARY KEY,
    embedding FLOAT[768]  -- adjusted per model
);
```

**Queries:**
```sql
-- Similarity search
SELECT c.*, vec_distance_cosine(v.embedding, ?) as distance
FROM vec_chunks v
JOIN chunks c ON c.id = v.id
ORDER BY distance
LIMIT 10;
```

**Tasks:**
- [ ] Add sqlite-vec: `github.com/asg017/sqlite-vec-go-bindings`
- [ ] Initialize DB with schema
- [ ] Insert chunks + embeddings
- [ ] Query by vector similarity
- [ ] Store in `.codevec/index.db`

---

## Phase 6: CLI Commands

Wire everything together.

### `codevec index <path>`

```
1. Walk directory → file list
2. For each file:
   a. Check if already indexed (compare file hash)
   b. Parse with tree-sitter → chunks
   c. Generate embeddings (batched)
   d. Store in sqlite-vec
3. Update file manifest
4. Print summary
```

**Flags:**
- `--force` — re-index everything
- `--verbose` — show progress

### `codevec query <text>`

```
1. Generate embedding for query text
2. Search sqlite-vec for similar chunks
3. Print results with file:line and similarity score
```

**Flags:**
- `--limit N` — max results (default 10)
- `--threshold F` — min similarity (default 0.5)
- `--show` — print chunk content
- `--json` — output as JSON

### `codevec status`

```
1. Read index.db
2. Print stats: files, chunks, last indexed, model used
```

**Tasks:**
- [ ] Implement `index` command with progress bar
- [ ] Implement `query` command with formatted output
- [ ] Implement `status` command
- [ ] Add `--json` output for tool integration

---

## Phase 7: Incremental Updates

Only re-index changed files.

**Manifest:** `.codevec/manifest.json`
```json
{
  "files": {
    "src/relay.go": {
      "hash": "sha256:abc...",
      "indexed_at": 1709654400
    }
  },
  "provider": "ollama",
  "model": "nomic-embed-text",
  "dimensions": 768,
  "version": 1
}
```

**Logic:**
1. Walk directory
2. For each file, compute hash
3. If hash matches manifest → skip
4. If hash differs → delete old chunks, re-index
5. If file removed → delete chunks
6. Update manifest

**Tasks:**
- [ ] Implement file hashing (sha256 of content)
- [ ] Compare against manifest
- [ ] Delete stale chunks on re-index
- [ ] Handle deleted files

---

## Phase 8: Polish

- [ ] Error handling: missing API key, parse failures, network errors
- [ ] README with usage examples
- [ ] `make install` to put binary in PATH

---

## Future (Post-MVP)

- TypeScript chunker (tree-sitter + TS grammar)
- Python chunker
- `codevec serve` HTTP API
- Watch mode (re-index on file change)
- Import/export index

---

## Dependencies

```go
require (
    github.com/smacker/go-tree-sitter v0.0.0-...
    github.com/smacker/go-tree-sitter/golang v0.0.0-...
    github.com/asg017/sqlite-vec-go-bindings v0.0.0-...
    github.com/sabhiram/go-gitignore v0.0.0-...  // or similar
)
```

---

## Decisions

1. **CLI framework:** Cobra
2. **Config:** Flags preferred; config file only if complexity warrants it
3. **Test files:** Index `*_test.go` by default (useful context)
4. **Tests:** None — move fast

---

## First Milestone

End of Phase 5: Can index a Go repo and query it.

```bash
cd ~/vault/code/nostr
codevec index .
codevec query "publish event to relay"
# → relay.go:45-89 Publish (0.87)
```