# codevec

**Semantic code search via embeddings**

A CLI that indexes codebases for semantic search. Query by concept, get relevant code chunks with file paths and line numbers.

## Problem

Searching code by keywords (`grep`, `ripgrep`) misses semantic matches:
- "authentication" won't find `verifyJWT()` 
- "handle errors" won't find `if err != nil { return }`
- "database connection" won't find `sql.Open()`

AI coding assistants spend tokens reading files to find relevant code. Pre-computed embeddings let them jump straight to what matters.

## Usage

```bash
# Index current directory
codevec index .

# Query semantically
codevec query "websocket connection handling"
# src/relay.go:45-89 (0.87)
# src/handler.go:102-145 (0.82)

# Query with filters
codevec query "error handling" --ext .go --limit 5

# Show chunk content
codevec query "authentication" --show

# Re-index (incremental, respects .gitignore)
codevec index . --update
```

## Architecture

```
┌─────────────────────────────────────────────────────────────┐
│                         codevec                             │
│                                                             │
│  ┌──────────┐    ┌──────────┐    ┌──────────────────────┐   │
│  │  Parser  │───▶│ Chunker  │───▶│ Embedding Generator  │   │
│  └──────────┘    └──────────┘    └──────────┬───────────┘   │
│       │                                      │              │
│       │ file list                            │ vectors      │
│       ▼                                      ▼              │
│  ┌──────────┐                         ┌──────────────┐      │
│  │ .gitignore│                        │  sqlite-vec  │      │
│  │  filter   │                        │    index     │      │
│  └──────────┘                         └──────────────┘      │
└─────────────────────────────────────────────────────────────┘

Storage: .codevec/
├── index.db        # SQLite + sqlite-vec
├── config.json     # Index settings (model, chunk size, etc.)
└── manifest.json   # File hashes for incremental updates
```

## Chunking Strategy

**Goal:** Create semantically meaningful chunks that respect code boundaries.

### Approach 1: AST-Aware (preferred for supported languages)

Use tree-sitter to parse and chunk by:
- Functions/methods
- Classes/structs
- Top-level declarations

```go
// Chunk: function
// File: src/auth.go:15-42
func VerifyToken(token string) (*Claims, error) {
    // ...
}
```

### Approach 2: Sliding Window (fallback)

For unsupported languages or when AST parsing fails:
- Fixed-size chunks with overlap
- Respect line boundaries
- Include context (file path, surrounding lines)

### Chunk Metadata

Each chunk stores:
```json
{
  "file": "src/auth.go",
  "start_line": 15,
  "end_line": 42,
  "type": "function",
  "name": "VerifyToken",
  "content": "func VerifyToken...",
  "hash": "abc123"
}
```

## Database Schema

```sql
CREATE TABLE chunks (
    id INTEGER PRIMARY KEY,
    file TEXT NOT NULL,
    start_line INTEGER NOT NULL,
    end_line INTEGER NOT NULL,
    chunk_type TEXT,  -- function, class, block, etc.
    name TEXT,        -- function/class name if available
    content TEXT NOT NULL,
    hash TEXT NOT NULL,
    created_at INTEGER
);

CREATE TABLE embeddings (
    chunk_id INTEGER PRIMARY KEY REFERENCES chunks(id),
    embedding BLOB NOT NULL  -- sqlite-vec vector
);

CREATE TABLE files (
    path TEXT PRIMARY KEY,
    hash TEXT NOT NULL,
    indexed_at INTEGER
);

-- sqlite-vec virtual table for similarity search
CREATE VIRTUAL TABLE vec_chunks USING vec0(
    chunk_id INTEGER PRIMARY KEY,
    embedding FLOAT[1536]
);
```

## Embedding Generation

### Options

1. **OpenAI** — `text-embedding-3-small` (1536 dims, fast, cheap)
2. **Ollama** — Local models (`nomic-embed-text`, `mxbai-embed-large`)
3. **Voyage** — Code-specific embeddings (`voyage-code-2`)

### Configuration

```json
{
  "model": "openai:text-embedding-3-small",
  "chunk_max_tokens": 512,
  "chunk_overlap": 50,
  "languages": ["go", "typescript", "python"],
  "ignore": ["vendor/", "node_modules/", "*.min.js"]
}
```

## CLI Commands

### `codevec index <path>`

Index a directory.

```
Flags:
  --model        Embedding model (default: openai:text-embedding-3-small)
  --update       Incremental update (only changed files)
  --force        Re-index everything
  --ignore       Additional ignore patterns
  --verbose      Show progress
```

### `codevec query <text>`

Search for relevant code.

```
Flags:
  --limit        Max results (default: 10)
  --threshold    Min similarity score (default: 0.5)
  --ext          Filter by extension (.go, .ts, etc.)
  --file         Filter by file path pattern
  --show         Print chunk content
  --json         Output as JSON
```

### `codevec status`

Show index stats.

```
Index: .codevec/index.db
Files: 142
Chunks: 1,847
Model: openai:text-embedding-3-small
Last indexed: 2 hours ago
```

### `codevec serve`

Optional: Run as HTTP server for integration with other tools.

```
GET /query?q=authentication&limit=10
POST /index (webhook for CI)
```

## Integration with claude-flow

Add a `CodeSearch` tool that shells out to codevec:

```typescript
// In claude-flow's tool definitions
{
  name: "CodeSearch",
  description: "Search codebase semantically. Use before Read to find relevant files.",
  parameters: {
    query: "string - what to search for",
    limit: "number - max results (default 10)"
  },
  execute: async ({ query, limit }) => {
    const result = await exec(`codevec query "${query}" --limit ${limit} --json`);
    return JSON.parse(result);
  }
}
```

Update research phase prompt:
```
WORKFLOW:
1. Use CodeSearch to find relevant code for the task
2. Use Read to examine specific files from search results
3. Write findings to research.md
```

## Incremental Updates

Track file hashes to avoid re-indexing unchanged files:

```json
// .codevec/manifest.json
{
  "src/auth.go": "sha256:abc123...",
  "src/handler.go": "sha256:def456..."
}
```

On `codevec index --update`:
1. Walk directory
2. Compare hashes
3. Re-chunk and re-embed only changed files
4. Delete chunks from removed files

## Language Support

**Phase 1 (tree-sitter):**
- Go
- TypeScript/JavaScript
- Python

**Phase 2:**
- Rust
- C/C++
- Java

**Fallback:**
- Sliding window for any text file

## Tech Stack

- **Language:** Go
- **Embeddings:** OpenAI API (default), Ollama (local)
- **Storage:** SQLite + sqlite-vec
- **Parsing:** tree-sitter (via go bindings)

## Open Questions

1. **Chunk size vs context:** Bigger chunks = more context but less precise. Smaller = precise but may miss context.
2. **Include comments?** They're semantically rich but noisy.
3. **Cross-file relationships:** Should we embed import graphs or call relationships?
4. **Cost:** OpenAI embeddings are cheap but not free. Cache aggressively.

## Prior Art

- **Sourcegraph Cody** — Similar concept, proprietary
- **Cursor** — IDE with semantic codebase understanding
- **Bloop** — Open-source semantic code search
- **Greptile** — API for codebase understanding

## Next Steps

1. [ ] Basic CLI skeleton (index, query, status)
2. [ ] sqlite-vec integration
3. [ ] OpenAI embedding generation
4. [ ] File walking with .gitignore respect
5. [ ] Sliding window chunker (MVP)
6. [ ] Tree-sitter chunker for Go
7. [ ] Incremental updates
8. [ ] claude-flow integration

---

## Dependencies

- `github.com/asg017/sqlite-vec-go-bindings` — sqlite-vec
- `github.com/smacker/go-tree-sitter` — tree-sitter (optional)
- OpenAI API or Ollama for embeddings