# codevec Implementation Plan

**Goal:** Build a CLI that indexes Go codebases for semantic search.

**Scope:** Go-only MVP, then expand to TypeScript/Python.

---

## Phase 1: Project Skeleton

Set up the basic Go project structure.

```
codevec/
├── cmd/
│   └── codevec/
│       └── main.go           # CLI entry point
├── internal/
│   ├── chunker/
│   │   └── chunker.go        # Interface + Go implementation
│   ├── embedder/
│   │   └── embedder.go       # OpenAI embedding client
│   ├── index/
│   │   └── index.go          # sqlite-vec storage layer
│   └── walker/
│       └── walker.go         # File discovery + .gitignore
├── go.mod
├── go.sum
├── Makefile
└── README.md
```

**Tasks:**
- [ ] `go mod init code.northwest.io/codevec`
- [ ] Basic CLI with cobra or just flag package
- [ ] Subcommands: `index`, `query`, `status`
- [ ] Makefile with `build`, `test`, `install`

---

## Phase 2: File Walker

Walk a directory, respect .gitignore, filter by extension.

**Input:** Root path
**Output:** List of `.go` files to index

```go
type Walker struct {
    root    string
    ignores []string  // from .gitignore
}

func (w *Walker) Walk() ([]string, error)
```

**Tasks:**
- [ ] Implement directory walking with `filepath.WalkDir`
- [ ] Parse `.gitignore` patterns (use `go-gitignore` or similar)
- [ ] Filter to `.go` files only (configurable later)
- [ ] Skip `vendor/`, `testdata/`, `*_test.go` by default (configurable)

**Test:** Walk the `nostr` SDK repo, verify correct file list.

---

## Phase 3: Go Chunker (tree-sitter)

Parse Go files and extract function/type chunks.

**Input:** File path + content
**Output:** List of chunks with metadata

```go
type Chunk struct {
    File      string
    StartLine int
    EndLine   int
    Type      string  // "function", "method", "type", "const", "var"
    Name      string  // function/type name
    Content   string  // raw source code
    Hash      string  // sha256 of content
}

type Chunker interface {
    Chunk(path string, content []byte) ([]Chunk, error)
}
```

**Go-specific extraction:**
- `function_declaration` → standalone functions
- `method_declaration` → methods (include receiver in name: `(*Server).Handle`)
- `type_declaration` → structs, interfaces
- `const_declaration` / `var_declaration` → top-level const/var blocks

**Tasks:**
- [ ] Add tree-sitter dependency: `github.com/smacker/go-tree-sitter`
- [ ] Add Go grammar: `github.com/smacker/go-tree-sitter/golang`
- [ ] Implement `GoChunker` that parses and walks AST
- [ ] Extract nodes by type, capture line numbers
- [ ] Handle edge cases: empty files, syntax errors (skip gracefully)
- [ ] Chunk size limit: if function > 1000 tokens, note it but keep whole

**Test:** Chunk `nostr/relay.go`, verify functions extracted correctly.

---

## Phase 4: Embedding Generation

Generate embeddings via OpenAI API.

**Input:** List of chunks
**Output:** Chunks with embedding vectors

```go
type Embedder interface {
    Embed(ctx context.Context, texts []string) ([][]float32, error)
}

type OpenAIEmbedder struct {
    apiKey string
    model  string  // "text-embedding-3-small"
}
```

**Batching:** OpenAI supports up to 2048 inputs per request. Batch chunks to minimize API calls.

**Tasks:**
- [ ] Implement OpenAI embedding client (stdlib `net/http`, no SDK)
- [ ] Batch requests (100 chunks per request to stay safe)
- [ ] Handle rate limits with exponential backoff
- [ ] Config: model selection, API key from env `OPENAI_API_KEY`

**Test:** Embed a few chunks, verify 1536-dim vectors returned.

---

## Phase 5: sqlite-vec Storage

Store chunks and embeddings in SQLite with vector search.

**Schema:**
```sql
CREATE TABLE chunks (
    id INTEGER PRIMARY KEY,
    file TEXT NOT NULL,
    start_line INTEGER NOT NULL,
    end_line INTEGER NOT NULL,
    chunk_type TEXT,
    name TEXT,
    content TEXT NOT NULL,
    hash TEXT NOT NULL,
    created_at INTEGER DEFAULT (unixepoch())
);

CREATE TABLE files (
    path TEXT PRIMARY KEY,
    hash TEXT NOT NULL,
    indexed_at INTEGER DEFAULT (unixepoch())
);

CREATE VIRTUAL TABLE vec_chunks USING vec0(
    id INTEGER PRIMARY KEY,
    embedding FLOAT[1536]
);
```

**Queries:**
```sql
-- Similarity search
SELECT c.*, vec_distance_cosine(v.embedding, ?) as distance
FROM vec_chunks v
JOIN chunks c ON c.id = v.id
ORDER BY distance
LIMIT 10;
```

**Tasks:**
- [ ] Add sqlite-vec: `github.com/asg017/sqlite-vec-go-bindings`
- [ ] Initialize DB with schema
- [ ] Insert chunks + embeddings
- [ ] Query by vector similarity
- [ ] Store in `.codevec/index.db`

**Test:** Insert chunks, query, verify results ranked by similarity.

---

## Phase 6: CLI Commands

Wire everything together.

### `codevec index <path>`

```
1. Walk directory → file list
2. For each file:
   a. Check if already indexed (compare file hash)
   b. Parse with tree-sitter → chunks
   c. Generate embeddings (batched)
   d. Store in sqlite-vec
3. Update file manifest
4. Print summary
```

**Flags:**
- `--force` — re-index everything
- `--verbose` — show progress

### `codevec query <text>`

```
1. Generate embedding for query text
2. Search sqlite-vec for similar chunks
3. Print results with file:line and similarity score
```

**Flags:**
- `--limit N` — max results (default 10)
- `--threshold F` — min similarity (default 0.5)
- `--show` — print chunk content
- `--json` — output as JSON

### `codevec status`

```
1. Read index.db
2. Print stats: files, chunks, last indexed, model used
```

**Tasks:**
- [ ] Implement `index` command with progress bar
- [ ] Implement `query` command with formatted output
- [ ] Implement `status` command
- [ ] Add `--json` output for tool integration

---

## Phase 7: Incremental Updates

Only re-index changed files.

**Manifest:** `.codevec/manifest.json`
```json
{
  "files": {
    "src/relay.go": {
      "hash": "sha256:abc...",
      "indexed_at": 1709654400
    }
  },
  "model": "text-embedding-3-small",
  "version": 1
}
```

**Logic:**
1. Walk directory
2. For each file, compute hash
3. If hash matches manifest → skip
4. If hash differs → delete old chunks, re-index
5. If file removed → delete chunks
6. Update manifest

**Tasks:**
- [ ] Implement file hashing (sha256 of content)
- [ ] Compare against manifest
- [ ] Delete stale chunks on re-index
- [ ] Handle deleted files

---

## Phase 8: Testing & Polish

- [ ] Unit tests for chunker
- [ ] Unit tests for walker
- [ ] Integration test: index small repo, query, verify results
- [ ] Error handling: missing API key, parse failures, network errors
- [ ] README with usage examples
- [ ] `make install` to put binary in PATH

---

## Future (Post-MVP)

- TypeScript chunker (tree-sitter + TS grammar)
- Python chunker
- Ollama embedder for local/offline use
- `codevec serve` HTTP API
- Watch mode (re-index on file change)
- Import/export index

---

## Dependencies

```go
require (
    github.com/smacker/go-tree-sitter v0.0.0-...
    github.com/smacker/go-tree-sitter/golang v0.0.0-...
    github.com/asg017/sqlite-vec-go-bindings v0.0.0-...
    github.com/sabhiram/go-gitignore v0.0.0-...  // or similar
)
```

---

## Estimated Effort

| Phase | Effort |
|-------|--------|
| 1. Skeleton | 30 min |
| 2. Walker | 1 hr |
| 3. Chunker | 2 hr |
| 4. Embedder | 1 hr |
| 5. Storage | 2 hr |
| 6. CLI | 1 hr |
| 7. Incremental | 1 hr |
| 8. Polish | 1 hr |
| **Total** | ~10 hr |

---

## Open Decisions

1. **CLI framework:** `cobra` vs stdlib `flag`? Leaning stdlib for simplicity.
2. **Config file:** YAML in `.codevec/config.yaml` or just flags?
3. **Chunk overlap:** Include N lines of context above/below functions?
4. **Test files:** Index `*_test.go` by default or skip?

---

## First Milestone

End of Phase 5: Can index a Go repo and query it.

```bash
cd ~/vault/code/nostr
codevec index .
codevec query "publish event to relay"
# → relay.go:45-89 Publish (0.87)
```