# Metrics This package provides Prometheus metrics for the relay, including automatic gRPC instrumentation. ## Overview The metrics package tracks: - **Request metrics**: Rate, latency, errors per method - **Connection metrics**: Active connections and subscriptions - **Auth metrics**: Success/failure rates, rate limit hits - **Storage metrics**: Event count, database size - **System metrics**: Go runtime stats (memory, goroutines) ## Usage ### Basic Setup ```go import ( "net/http" "northwest.io/muxstr/internal/metrics" "github.com/prometheus/client_golang/prometheus/promhttp" ) // Initialize metrics m := metrics.New(&metrics.Config{ Namespace: "muxstr", Subsystem: "relay", }) // Add gRPC interceptors server := grpc.NewServer( grpc.ChainUnaryInterceptor( metrics.UnaryServerInterceptor(m), auth.NostrUnaryInterceptor(authOpts), ratelimit.UnaryInterceptor(limiter), ), grpc.ChainStreamInterceptor( metrics.StreamServerInterceptor(m), auth.NostrStreamInterceptor(authOpts), ratelimit.StreamInterceptor(limiter), ), ) // Expose metrics endpoint http.Handle("/metrics", promhttp.Handler()) go http.ListenAndServe(":9090", nil) ``` ### Recording Custom Metrics ```go // Record auth attempt m.RecordAuthAttempt(true) // success m.RecordAuthAttempt(false) // failure // Record rate limit hit m.RecordRateLimitHit(pubkey) // Update connection count m.SetActiveConnections(42) // Update subscription count m.SetActiveSubscriptions(100) // Update storage stats m.UpdateStorageStats(eventCount, dbSizeBytes) ``` ## Metrics Reference ### Request Metrics **`relay_requests_total`** (Counter) - Labels: `method`, `status` (ok, error, unauthenticated, rate_limited) - Total number of requests by method and result **`relay_request_duration_seconds`** (Histogram) - Labels: `method` - Request latency distribution - Buckets: 0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0, 5.0, 10.0 seconds **`relay_request_size_bytes`** (Histogram) - Labels: `method` - Request size distribution - Useful for tracking large publishes **`relay_response_size_bytes`** (Histogram) - Labels: `method` - Response size distribution - Useful for tracking large queries ### Connection Metrics **`relay_active_connections`** (Gauge) - Current number of active gRPC connections **`relay_active_subscriptions`** (Gauge) - Current number of active subscriptions (streams) **`relay_connections_total`** (Counter) - Total connections since startup ### Auth Metrics **`relay_auth_attempts_total`** (Counter) - Labels: `result` (success, failure) - Total authentication attempts **`relay_rate_limit_hits_total`** (Counter) - Labels: `user` (pubkey or "unauthenticated") - Total rate limit rejections per user ### Storage Metrics **`relay_events_total`** (Gauge) - Total events stored in database **`relay_db_size_bytes`** (Gauge) - Database file size in bytes **`relay_event_deletions_total`** (Counter) - Total events deleted (NIP-09) ### System Metrics Standard Go runtime metrics are automatically collected: - `go_goroutines` - Number of goroutines - `go_threads` - Number of OS threads - `go_memstats_*` - Memory statistics - `process_*` - Process CPU, memory, file descriptors ## Grafana Dashboard Example Grafana queries: **Request Rate by Method**: ```promql rate(relay_requests_total[5m]) ``` **P99 Latency**: ```promql histogram_quantile(0.99, rate(relay_request_duration_seconds_bucket[5m])) ``` **Error Rate**: ```promql rate(relay_requests_total{status="error"}[5m]) / rate(relay_requests_total[5m]) ``` **Rate Limit Hit Rate**: ```promql rate(relay_rate_limit_hits_total[5m]) ``` **Active Subscriptions**: ```promql relay_active_subscriptions ``` **Database Growth**: ```promql rate(relay_events_total[1h]) ``` ## Performance Impact Metrics collection adds minimal overhead: - Request counter: ~50ns - Histogram observation: ~200ns - Gauge update: ~30ns Total overhead per request: ~300-500ns (negligible compared to request processing) ## Best Practices 1. **Use labels sparingly**: High cardinality (many unique label values) can cause memory issues - ✅ Good: `method`, `status` (low cardinality) - ❌ Bad: `user`, `event_id` (high cardinality) 2. **Aggregate high-cardinality data**: For per-user metrics, aggregate in the application: ```go // Don't do this - creates metric per user userRequests := prometheus.NewCounterVec(...) userRequests.WithLabelValues(pubkey).Inc() // Do this - aggregate and expose top-N m.RecordUserRequest(pubkey) // Expose top 10 users in separate metric ``` 3. **Set appropriate histogram buckets**: Match your SLOs ```go // For sub-second operations prometheus.DefBuckets // Good default // For operations that can take seconds []float64{0.1, 0.5, 1, 2, 5, 10, 30, 60} ``` 4. **Use summary for percentiles when needed**: ```go // Histogram: Aggregatable, but approximate percentiles // Summary: Exact percentiles, but not aggregatable ``` ## Integration with Monitoring ### Prometheus Add to `prometheus.yml`: ```yaml scrape_configs: - job_name: 'muxstr-relay' static_configs: - targets: ['localhost:9090'] scrape_interval: 15s ``` ### Grafana Import the provided dashboard: 1. Copy `grafana-dashboard.json` 2. Import in Grafana 3. Configure data source ### Alerting Example alerts in `alerts.yml`: ```yaml groups: - name: muxstr rules: - alert: HighErrorRate expr: rate(relay_requests_total{status="error"}[5m]) > 0.05 for: 5m annotations: summary: "High error rate detected" - alert: HighLatency expr: histogram_quantile(0.99, rate(relay_request_duration_seconds_bucket[5m])) > 1.0 for: 5m annotations: summary: "P99 latency above 1 second" - alert: RateLimitSpike expr: rate(relay_rate_limit_hits_total[5m]) > 10 for: 5m annotations: summary: "High rate limit rejection rate" ``` ## Troubleshooting **Metrics not appearing**: - Check metrics endpoint: `curl http://localhost:9090/metrics` - Verify Prometheus scrape config - Check firewall rules **High memory usage**: - Check for high cardinality labels - Review label values: `curl http://localhost:9090/metrics | grep relay_` - Consider aggregating high-cardinality data **Missing method labels**: - Ensure interceptors are properly chained - Verify gRPC method names match expected format