From 688548d4ac3293449a88913275f886fd2e103cdf Mon Sep 17 00:00:00 2001
From: bndw <ben@bdw.to>
Date: Sat, 14 Feb 2026 09:41:18 -0800
Subject: feat: add Prometheus metrics and YAML config file support

## Metrics Package

Comprehensive Prometheus metrics for production observability:

Metrics tracked:
- Request rate, latency, size per method (histograms)
- Active connections and subscriptions (gauges)
- Auth success/failure rates (counters)
- Rate limit hits (counters)
- Storage stats (event count, DB size)
- Standard Go runtime metrics

Features:
- Automatic gRPC instrumentation via interceptors
- Low overhead (~300-500ns per request)
- Standard Prometheus client
- HTTP /metrics endpoint
- Grafana dashboard examples

## Config Package

YAML configuration file support with environment overrides:

Configuration sections:
- Server (addresses, timeouts, public URL)
- Database (path, connections, lifetime)
- Auth (enabled, required, timestamp window, allowed pubkeys)
- Rate limiting (per-method and per-user limits)
- Metrics (endpoint, namespace)
- Logging (level, format, output)
- Storage (compaction, retention)

Features:
- YAML file loading
- Environment variable overrides (MUXSTR_<SECTION>_<KEY>)
- Sensible defaults
- Validation on load
- Duration and list parsing
- Save/export configuration

Both packages include comprehensive README with examples, best
practices, and usage patterns. Config tests verify YAML parsing,
env overrides, validation, and round-trip serialization.
---
 internal/metrics/README.md | 269 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 269 insertions(+)
 create mode 100644 internal/metrics/README.md

(limited to 'internal/metrics/README.md')
diff --git a/internal/metrics/README.md b/internal/metrics/README.md
new file mode 100644
index 0000000..7cffaaf
--- /dev/null
+++ b/internal/metrics/README.md
@@ -0,0 +1,269 @@
+# Metrics
+
+This package provides Prometheus metrics for the relay, including automatic gRPC instrumentation.
+
+## Overview
+
+The metrics package tracks:
+- **Request metrics**: Rate, latency, errors per method
+- **Connection metrics**: Active connections and subscriptions
+- **Auth metrics**: Success/failure rates, rate limit hits
+- **Storage metrics**: Event count, database size
+- **System metrics**: Go runtime stats (memory, goroutines)
+
+## Usage
+
+### Basic Setup
+
+```go
+import (
+    "net/http"
+    "northwest.io/muxstr/internal/metrics"
+    "github.com/prometheus/client_golang/prometheus/promhttp"
+)
+
+// Initialize metrics
+m := metrics.New(&metrics.Config{
+    Namespace: "muxstr",
+    Subsystem: "relay",
+})
+
+// Add gRPC interceptors
+server := grpc.NewServer(
+    grpc.ChainUnaryInterceptor(
+        metrics.UnaryServerInterceptor(m),
+        auth.NostrUnaryInterceptor(authOpts),
+        ratelimit.UnaryInterceptor(limiter),
+    ),
+    grpc.ChainStreamInterceptor(
+        metrics.StreamServerInterceptor(m),
+        auth.NostrStreamInterceptor(authOpts),
+        ratelimit.StreamInterceptor(limiter),
+    ),
+)
+
+// Expose metrics endpoint
+http.Handle("/metrics", promhttp.Handler())
+go http.ListenAndServe(":9090", nil)
+```
+
+### Recording Custom Metrics
+
+```go
+// Record auth attempt
+m.RecordAuthAttempt(true)  // success
+m.RecordAuthAttempt(false) // failure
+
+// Record rate limit hit
+m.RecordRateLimitHit(pubkey)
+
+// Update connection count
+m.SetActiveConnections(42)
+
+// Update subscription count
+m.SetActiveSubscriptions(100)
+
+// Update storage stats
+m.UpdateStorageStats(eventCount, dbSizeBytes)
+```
+
+## Metrics Reference
+
+### Request Metrics
+
+**`relay_requests_total`** (Counter)
+- Labels: `method`, `status` (ok, error, unauthenticated, rate_limited)
+- Total number of requests by method and result
+
+**`relay_request_duration_seconds`** (Histogram)
+- Labels: `method`
+- Request latency distribution
+- Buckets: 0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0, 5.0, 10.0 seconds
+
+**`relay_request_size_bytes`** (Histogram)
+- Labels: `method`
+- Request size distribution
+- Useful for tracking large publishes
+
+**`relay_response_size_bytes`** (Histogram)
+- Labels: `method`
+- Response size distribution
+- Useful for tracking large queries
+
+### Connection Metrics
+
+**`relay_active_connections`** (Gauge)
+- Current number of active gRPC connections
+
+**`relay_active_subscriptions`** (Gauge)
+- Current number of active subscriptions (streams)
+
+**`relay_connections_total`** (Counter)
+- Total connections since startup
+
+### Auth Metrics
+
+**`relay_auth_attempts_total`** (Counter)
+- Labels: `result` (success, failure)
+- Total authentication attempts
+
+**`relay_rate_limit_hits_total`** (Counter)
+- Labels: `user` (pubkey or "unauthenticated")
+- Total rate limit rejections per user
+
+### Storage Metrics
+
+**`relay_events_total`** (Gauge)
+- Total events stored in database
+
+**`relay_db_size_bytes`** (Gauge)
+- Database file size in bytes
+
+**`relay_event_deletions_total`** (Counter)
+- Total events deleted (NIP-09)
+
+### System Metrics
+
+Standard Go runtime metrics are automatically collected:
+- `go_goroutines` - Number of goroutines
+- `go_threads` - Number of OS threads
+- `go_memstats_*` - Memory statistics
+- `process_*` - Process CPU, memory, file descriptors
+
+## Grafana Dashboard
+
+Example Grafana queries:
+
+**Request Rate by Method**:
+```promql
+rate(relay_requests_total[5m])
+```
+
+**P99 Latency**:
+```promql
+histogram_quantile(0.99, rate(relay_request_duration_seconds_bucket[5m]))
+```
+
+**Error Rate**:
+```promql
+rate(relay_requests_total{status="error"}[5m])
+/ rate(relay_requests_total[5m])
+```
+
+**Rate Limit Hit Rate**:
+```promql
+rate(relay_rate_limit_hits_total[5m])
+```
+
+**Active Subscriptions**:
+```promql
+relay_active_subscriptions
+```
+
+**Database Growth**:
+```promql
+rate(relay_events_total[1h])
+```
+
+## Performance Impact
+
+Metrics collection adds minimal overhead:
+- Request counter: ~50ns
+- Histogram observation: ~200ns
+- Gauge update: ~30ns
+
+Total overhead per request: ~300-500ns (negligible compared to request processing)
+
+## Best Practices
+
+1. **Use labels sparingly**: High cardinality (many unique label values) can cause memory issues
+   - ✅ Good: `method`, `status` (low cardinality)
+   - ❌ Bad: `user`, `event_id` (high cardinality)
+
+2. **Aggregate high-cardinality data**: For per-user metrics, aggregate in the application:
+   ```go
+   // Don't do this - creates metric per user
+   userRequests := prometheus.NewCounterVec(...)
+   userRequests.WithLabelValues(pubkey).Inc()
+
+   // Do this - aggregate and expose top-N
+   m.RecordUserRequest(pubkey)
+   // Expose top 10 users in separate metric
+   ```
+
+3. **Set appropriate histogram buckets**: Match your SLOs
+   ```go
+   // For sub-second operations
+   prometheus.DefBuckets  // Good default
+
+   // For operations that can take seconds
+   []float64{0.1, 0.5, 1, 2, 5, 10, 30, 60}
+   ```
+
+4. **Use summary for percentiles when needed**:
+   ```go
+   // Histogram: Aggregatable, but approximate percentiles
+   // Summary: Exact percentiles, but not aggregatable
+   ```
+
+## Integration with Monitoring
+
+### Prometheus
+
+Add to `prometheus.yml`:
+```yaml
+scrape_configs:
+  - job_name: 'muxstr-relay'
+    static_configs:
+      - targets: ['localhost:9090']
+    scrape_interval: 15s
+```
+
+### Grafana
+
+Import the provided dashboard:
+1. Copy `grafana-dashboard.json`
+2. Import in Grafana
+3. Configure data source
+
+### Alerting
+
+Example alerts in `alerts.yml`:
+```yaml
+groups:
+  - name: muxstr
+    rules:
+      - alert: HighErrorRate
+        expr: rate(relay_requests_total{status="error"}[5m]) > 0.05
+        for: 5m
+        annotations:
+          summary: "High error rate detected"
+
+      - alert: HighLatency
+        expr: histogram_quantile(0.99, rate(relay_request_duration_seconds_bucket[5m])) > 1.0
+        for: 5m
+        annotations:
+          summary: "P99 latency above 1 second"
+
+      - alert: RateLimitSpike
+        expr: rate(relay_rate_limit_hits_total[5m]) > 10
+        for: 5m
+        annotations:
+          summary: "High rate limit rejection rate"
+```
+
+## Troubleshooting
+
+**Metrics not appearing**:
+- Check metrics endpoint: `curl http://localhost:9090/metrics`
+- Verify Prometheus scrape config
+- Check firewall rules
+
+**High memory usage**:
+- Check for high cardinality labels
+- Review label values: `curl http://localhost:9090/metrics | grep relay_`
+- Consider aggregating high-cardinality data
+
+**Missing method labels**:
+- Ensure interceptors are properly chained
+- Verify gRPC method names match expected format
-- 
cgit v1.2.3