From f7ff79118866d3198cdcc6a9c59881344bd00a4a Mon Sep 17 00:00:00 2001
From: Clawd <ai@clawd.bot>
Date: Thu, 5 Mar 2026 07:05:24 -0800
Subject: Initial design doc

---
 DESIGN.md | 304 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 304 insertions(+)
 create mode 100644 DESIGN.md

(limited to 'DESIGN.md')

diff --git a/DESIGN.md b/DESIGN.md
new file mode 100644
index 0000000..0950892
--- /dev/null
+++ b/DESIGN.md
@@ -0,0 +1,304 @@
+# codevec
+
+**Semantic code search via embeddings**
+
+A CLI that indexes codebases for semantic search. Query by concept, get relevant code chunks with file paths and line numbers.
+
+## Problem
+
+Searching code by keywords (`grep`, `ripgrep`) misses semantic matches:
+- "authentication" won't find `verifyJWT()` 
+- "handle errors" won't find `if err != nil { return }`
+- "database connection" won't find `sql.Open()`
+
+AI coding assistants spend tokens reading files to find relevant code. Pre-computed embeddings let them jump straight to what matters.
+
+## Usage
+
+```bash
+# Index current directory
+codevec index .
+
+# Query semantically
+codevec query "websocket connection handling"
+# src/relay.go:45-89 (0.87)
+# src/handler.go:102-145 (0.82)
+
+# Query with filters
+codevec query "error handling" --ext .go --limit 5
+
+# Show chunk content
+codevec query "authentication" --show
+
+# Re-index (incremental, respects .gitignore)
+codevec index . --update
+```
+
+## Architecture
+
+```
+┌─────────────────────────────────────────────────────────────┐
+│                         codevec                             │
+│                                                             │
+│  ┌──────────┐    ┌──────────┐    ┌──────────────────────┐   │
+│  │  Parser  │───▶│ Chunker  │───▶│ Embedding Generator  │   │
+│  └──────────┘    └──────────┘    └──────────┬───────────┘   │
+│       │                                      │              │
+│       │ file list                            │ vectors      │
+│       ▼                                      ▼              │
+│  ┌──────────┐                         ┌──────────────┐      │
+│  │ .gitignore│                        │  sqlite-vec  │      │
+│  │  filter   │                        │    index     │      │
+│  └──────────┘                         └──────────────┘      │
+└─────────────────────────────────────────────────────────────┘
+
+Storage: .codevec/
+├── index.db        # SQLite + sqlite-vec
+├── config.json     # Index settings (model, chunk size, etc.)
+└── manifest.json   # File hashes for incremental updates
+```
+
+## Chunking Strategy
+
+**Goal:** Create semantically meaningful chunks that respect code boundaries.
+
+### Approach 1: AST-Aware (preferred for supported languages)
+
+Use tree-sitter to parse and chunk by:
+- Functions/methods
+- Classes/structs
+- Top-level declarations
+
+```go
+// Chunk: function
+// File: src/auth.go:15-42
+func VerifyToken(token string) (*Claims, error) {
+    // ...
+}
+```
+
+### Approach 2: Sliding Window (fallback)
+
+For unsupported languages or when AST parsing fails:
+- Fixed-size chunks with overlap
+- Respect line boundaries
+- Include context (file path, surrounding lines)
+
+### Chunk Metadata
+
+Each chunk stores:
+```json
+{
+  "file": "src/auth.go",
+  "start_line": 15,
+  "end_line": 42,
+  "type": "function",
+  "name": "VerifyToken",
+  "content": "func VerifyToken...",
+  "hash": "abc123"
+}
+```
+
+## Database Schema
+
+```sql
+CREATE TABLE chunks (
+    id INTEGER PRIMARY KEY,
+    file TEXT NOT NULL,
+    start_line INTEGER NOT NULL,
+    end_line INTEGER NOT NULL,
+    chunk_type TEXT,  -- function, class, block, etc.
+    name TEXT,        -- function/class name if available
+    content TEXT NOT NULL,
+    hash TEXT NOT NULL,
+    created_at INTEGER
+);
+
+CREATE TABLE embeddings (
+    chunk_id INTEGER PRIMARY KEY REFERENCES chunks(id),
+    embedding BLOB NOT NULL  -- sqlite-vec vector
+);
+
+CREATE TABLE files (
+    path TEXT PRIMARY KEY,
+    hash TEXT NOT NULL,
+    indexed_at INTEGER
+);
+
+-- sqlite-vec virtual table for similarity search
+CREATE VIRTUAL TABLE vec_chunks USING vec0(
+    chunk_id INTEGER PRIMARY KEY,
+    embedding FLOAT[1536]
+);
+```
+
+## Embedding Generation
+
+### Options
+
+1. **OpenAI** — `text-embedding-3-small` (1536 dims, fast, cheap)
+2. **Ollama** — Local models (`nomic-embed-text`, `mxbai-embed-large`)
+3. **Voyage** — Code-specific embeddings (`voyage-code-2`)
+
+### Configuration
+
+```json
+{
+  "model": "openai:text-embedding-3-small",
+  "chunk_max_tokens": 512,
+  "chunk_overlap": 50,
+  "languages": ["go", "typescript", "python"],
+  "ignore": ["vendor/", "node_modules/", "*.min.js"]
+}
+```
+
+## CLI Commands
+
+### `codevec index <path>`
+
+Index a directory.
+
+```
+Flags:
+  --model        Embedding model (default: openai:text-embedding-3-small)
+  --update       Incremental update (only changed files)
+  --force        Re-index everything
+  --ignore       Additional ignore patterns
+  --verbose      Show progress
+```
+
+### `codevec query <text>`
+
+Search for relevant code.
+
+```
+Flags:
+  --limit        Max results (default: 10)
+  --threshold    Min similarity score (default: 0.5)
+  --ext          Filter by extension (.go, .ts, etc.)
+  --file         Filter by file path pattern
+  --show         Print chunk content
+  --json         Output as JSON
+```
+
+### `codevec status`
+
+Show index stats.
+
+```
+Index: .codevec/index.db
+Files: 142
+Chunks: 1,847
+Model: openai:text-embedding-3-small
+Last indexed: 2 hours ago
+```
+
+### `codevec serve`
+
+Optional: Run as HTTP server for integration with other tools.
+
+```
+GET /query?q=authentication&limit=10
+POST /index (webhook for CI)
+```
+
+## Integration with claude-flow
+
+Add a `CodeSearch` tool that shells out to codevec:
+
+```typescript
+// In claude-flow's tool definitions
+{
+  name: "CodeSearch",
+  description: "Search codebase semantically. Use before Read to find relevant files.",
+  parameters: {
+    query: "string - what to search for",
+    limit: "number - max results (default 10)"
+  },
+  execute: async ({ query, limit }) => {
+    const result = await exec(`codevec query "${query}" --limit ${limit} --json`);
+    return JSON.parse(result);
+  }
+}
+```
+
+Update research phase prompt:
+```
+WORKFLOW:
+1. Use CodeSearch to find relevant code for the task
+2. Use Read to examine specific files from search results
+3. Write findings to research.md
+```
+
+## Incremental Updates
+
+Track file hashes to avoid re-indexing unchanged files:
+
+```json
+// .codevec/manifest.json
+{
+  "src/auth.go": "sha256:abc123...",
+  "src/handler.go": "sha256:def456..."
+}
+```
+
+On `codevec index --update`:
+1. Walk directory
+2. Compare hashes
+3. Re-chunk and re-embed only changed files
+4. Delete chunks from removed files
+
+## Language Support
+
+**Phase 1 (tree-sitter):**
+- Go
+- TypeScript/JavaScript
+- Python
+
+**Phase 2:**
+- Rust
+- C/C++
+- Java
+
+**Fallback:**
+- Sliding window for any text file
+
+## Tech Stack
+
+- **Language:** Go
+- **Embeddings:** OpenAI API (default), Ollama (local)
+- **Storage:** SQLite + sqlite-vec
+- **Parsing:** tree-sitter (via go bindings)
+
+## Open Questions
+
+1. **Chunk size vs context:** Bigger chunks = more context but less precise. Smaller = precise but may miss context.
+2. **Include comments?** They're semantically rich but noisy.
+3. **Cross-file relationships:** Should we embed import graphs or call relationships?
+4. **Cost:** OpenAI embeddings are cheap but not free. Cache aggressively.
+
+## Prior Art
+
+- **Sourcegraph Cody** — Similar concept, proprietary
+- **Cursor** — IDE with semantic codebase understanding
+- **Bloop** — Open-source semantic code search
+- **Greptile** — API for codebase understanding
+
+## Next Steps
+
+1. [ ] Basic CLI skeleton (index, query, status)
+2. [ ] sqlite-vec integration
+3. [ ] OpenAI embedding generation
+4. [ ] File walking with .gitignore respect
+5. [ ] Sliding window chunker (MVP)
+6. [ ] Tree-sitter chunker for Go
+7. [ ] Incremental updates
+8. [ ] claude-flow integration
+
+---
+
+## Dependencies
+
+- `github.com/asg017/sqlite-vec-go-bindings` — sqlite-vec
+- `github.com/smacker/go-tree-sitter` — tree-sitter (optional)
+- OpenAI API or Ollama for embeddings
-- 
cgit v1.2.3