diff options
| author | bndw <ben@bdw.to> | 2026-02-14 09:41:18 -0800 |
|---|---|---|
| committer | bndw <ben@bdw.to> | 2026-02-14 09:41:18 -0800 |
| commit | 688548d4ac3293449a88913275f886fd2e103cdf (patch) | |
| tree | 5bf83c9a9b50863b6201ebf5066ee6855fefe725 /internal/metrics/README.md | |
| parent | f0169fa1f9d2e2a5d1c292b9080da10ef0878953 (diff) | |
feat: add Prometheus metrics and YAML config file support
## Metrics Package
Comprehensive Prometheus metrics for production observability:
Metrics tracked:
- Request rate, latency, size per method (histograms)
- Active connections and subscriptions (gauges)
- Auth success/failure rates (counters)
- Rate limit hits (counters)
- Storage stats (event count, DB size)
- Standard Go runtime metrics
Features:
- Automatic gRPC instrumentation via interceptors
- Low overhead (~300-500ns per request)
- Standard Prometheus client
- HTTP /metrics endpoint
- Grafana dashboard examples
## Config Package
YAML configuration file support with environment overrides:
Configuration sections:
- Server (addresses, timeouts, public URL)
- Database (path, connections, lifetime)
- Auth (enabled, required, timestamp window, allowed pubkeys)
- Rate limiting (per-method and per-user limits)
- Metrics (endpoint, namespace)
- Logging (level, format, output)
- Storage (compaction, retention)
Features:
- YAML file loading
- Environment variable overrides (MUXSTR_<SECTION>_<KEY>)
- Sensible defaults
- Validation on load
- Duration and list parsing
- Save/export configuration
Both packages include comprehensive README with examples, best
practices, and usage patterns. Config tests verify YAML parsing,
env overrides, validation, and round-trip serialization.
Diffstat (limited to 'internal/metrics/README.md')
| -rw-r--r-- | internal/metrics/README.md | 269 |
1 files changed, 269 insertions, 0 deletions
diff --git a/internal/metrics/README.md b/internal/metrics/README.md new file mode 100644 index 0000000..7cffaaf --- /dev/null +++ b/internal/metrics/README.md | |||
| @@ -0,0 +1,269 @@ | |||
| 1 | # Metrics | ||
| 2 | |||
| 3 | This package provides Prometheus metrics for the relay, including automatic gRPC instrumentation. | ||
| 4 | |||
| 5 | ## Overview | ||
| 6 | |||
| 7 | The metrics package tracks: | ||
| 8 | - **Request metrics**: Rate, latency, errors per method | ||
| 9 | - **Connection metrics**: Active connections and subscriptions | ||
| 10 | - **Auth metrics**: Success/failure rates, rate limit hits | ||
| 11 | - **Storage metrics**: Event count, database size | ||
| 12 | - **System metrics**: Go runtime stats (memory, goroutines) | ||
| 13 | |||
| 14 | ## Usage | ||
| 15 | |||
| 16 | ### Basic Setup | ||
| 17 | |||
| 18 | ```go | ||
| 19 | import ( | ||
| 20 | "net/http" | ||
| 21 | "northwest.io/muxstr/internal/metrics" | ||
| 22 | "github.com/prometheus/client_golang/prometheus/promhttp" | ||
| 23 | ) | ||
| 24 | |||
| 25 | // Initialize metrics | ||
| 26 | m := metrics.New(&metrics.Config{ | ||
| 27 | Namespace: "muxstr", | ||
| 28 | Subsystem: "relay", | ||
| 29 | }) | ||
| 30 | |||
| 31 | // Add gRPC interceptors | ||
| 32 | server := grpc.NewServer( | ||
| 33 | grpc.ChainUnaryInterceptor( | ||
| 34 | metrics.UnaryServerInterceptor(m), | ||
| 35 | auth.NostrUnaryInterceptor(authOpts), | ||
| 36 | ratelimit.UnaryInterceptor(limiter), | ||
| 37 | ), | ||
| 38 | grpc.ChainStreamInterceptor( | ||
| 39 | metrics.StreamServerInterceptor(m), | ||
| 40 | auth.NostrStreamInterceptor(authOpts), | ||
| 41 | ratelimit.StreamInterceptor(limiter), | ||
| 42 | ), | ||
| 43 | ) | ||
| 44 | |||
| 45 | // Expose metrics endpoint | ||
| 46 | http.Handle("/metrics", promhttp.Handler()) | ||
| 47 | go http.ListenAndServe(":9090", nil) | ||
| 48 | ``` | ||
| 49 | |||
| 50 | ### Recording Custom Metrics | ||
| 51 | |||
| 52 | ```go | ||
| 53 | // Record auth attempt | ||
| 54 | m.RecordAuthAttempt(true) // success | ||
| 55 | m.RecordAuthAttempt(false) // failure | ||
| 56 | |||
| 57 | // Record rate limit hit | ||
| 58 | m.RecordRateLimitHit(pubkey) | ||
| 59 | |||
| 60 | // Update connection count | ||
| 61 | m.SetActiveConnections(42) | ||
| 62 | |||
| 63 | // Update subscription count | ||
| 64 | m.SetActiveSubscriptions(100) | ||
| 65 | |||
| 66 | // Update storage stats | ||
| 67 | m.UpdateStorageStats(eventCount, dbSizeBytes) | ||
| 68 | ``` | ||
| 69 | |||
| 70 | ## Metrics Reference | ||
| 71 | |||
| 72 | ### Request Metrics | ||
| 73 | |||
| 74 | **`relay_requests_total`** (Counter) | ||
| 75 | - Labels: `method`, `status` (ok, error, unauthenticated, rate_limited) | ||
| 76 | - Total number of requests by method and result | ||
| 77 | |||
| 78 | **`relay_request_duration_seconds`** (Histogram) | ||
| 79 | - Labels: `method` | ||
| 80 | - Request latency distribution | ||
| 81 | - Buckets: 0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0, 5.0, 10.0 seconds | ||
| 82 | |||
| 83 | **`relay_request_size_bytes`** (Histogram) | ||
| 84 | - Labels: `method` | ||
| 85 | - Request size distribution | ||
| 86 | - Useful for tracking large publishes | ||
| 87 | |||
| 88 | **`relay_response_size_bytes`** (Histogram) | ||
| 89 | - Labels: `method` | ||
| 90 | - Response size distribution | ||
| 91 | - Useful for tracking large queries | ||
| 92 | |||
| 93 | ### Connection Metrics | ||
| 94 | |||
| 95 | **`relay_active_connections`** (Gauge) | ||
| 96 | - Current number of active gRPC connections | ||
| 97 | |||
| 98 | **`relay_active_subscriptions`** (Gauge) | ||
| 99 | - Current number of active subscriptions (streams) | ||
| 100 | |||
| 101 | **`relay_connections_total`** (Counter) | ||
| 102 | - Total connections since startup | ||
| 103 | |||
| 104 | ### Auth Metrics | ||
| 105 | |||
| 106 | **`relay_auth_attempts_total`** (Counter) | ||
| 107 | - Labels: `result` (success, failure) | ||
| 108 | - Total authentication attempts | ||
| 109 | |||
| 110 | **`relay_rate_limit_hits_total`** (Counter) | ||
| 111 | - Labels: `user` (pubkey or "unauthenticated") | ||
| 112 | - Total rate limit rejections per user | ||
| 113 | |||
| 114 | ### Storage Metrics | ||
| 115 | |||
| 116 | **`relay_events_total`** (Gauge) | ||
| 117 | - Total events stored in database | ||
| 118 | |||
| 119 | **`relay_db_size_bytes`** (Gauge) | ||
| 120 | - Database file size in bytes | ||
| 121 | |||
| 122 | **`relay_event_deletions_total`** (Counter) | ||
| 123 | - Total events deleted (NIP-09) | ||
| 124 | |||
| 125 | ### System Metrics | ||
| 126 | |||
| 127 | Standard Go runtime metrics are automatically collected: | ||
| 128 | - `go_goroutines` - Number of goroutines | ||
| 129 | - `go_threads` - Number of OS threads | ||
| 130 | - `go_memstats_*` - Memory statistics | ||
| 131 | - `process_*` - Process CPU, memory, file descriptors | ||
| 132 | |||
| 133 | ## Grafana Dashboard | ||
| 134 | |||
| 135 | Example Grafana queries: | ||
| 136 | |||
| 137 | **Request Rate by Method**: | ||
| 138 | ```promql | ||
| 139 | rate(relay_requests_total[5m]) | ||
| 140 | ``` | ||
| 141 | |||
| 142 | **P99 Latency**: | ||
| 143 | ```promql | ||
| 144 | histogram_quantile(0.99, rate(relay_request_duration_seconds_bucket[5m])) | ||
| 145 | ``` | ||
| 146 | |||
| 147 | **Error Rate**: | ||
| 148 | ```promql | ||
| 149 | rate(relay_requests_total{status="error"}[5m]) | ||
| 150 | / rate(relay_requests_total[5m]) | ||
| 151 | ``` | ||
| 152 | |||
| 153 | **Rate Limit Hit Rate**: | ||
| 154 | ```promql | ||
| 155 | rate(relay_rate_limit_hits_total[5m]) | ||
| 156 | ``` | ||
| 157 | |||
| 158 | **Active Subscriptions**: | ||
| 159 | ```promql | ||
| 160 | relay_active_subscriptions | ||
| 161 | ``` | ||
| 162 | |||
| 163 | **Database Growth**: | ||
| 164 | ```promql | ||
| 165 | rate(relay_events_total[1h]) | ||
| 166 | ``` | ||
| 167 | |||
| 168 | ## Performance Impact | ||
| 169 | |||
| 170 | Metrics collection adds minimal overhead: | ||
| 171 | - Request counter: ~50ns | ||
| 172 | - Histogram observation: ~200ns | ||
| 173 | - Gauge update: ~30ns | ||
| 174 | |||
| 175 | Total overhead per request: ~300-500ns (negligible compared to request processing) | ||
| 176 | |||
| 177 | ## Best Practices | ||
| 178 | |||
| 179 | 1. **Use labels sparingly**: High cardinality (many unique label values) can cause memory issues | ||
| 180 | - ✅ Good: `method`, `status` (low cardinality) | ||
| 181 | - ❌ Bad: `user`, `event_id` (high cardinality) | ||
| 182 | |||
| 183 | 2. **Aggregate high-cardinality data**: For per-user metrics, aggregate in the application: | ||
| 184 | ```go | ||
| 185 | // Don't do this - creates metric per user | ||
| 186 | userRequests := prometheus.NewCounterVec(...) | ||
| 187 | userRequests.WithLabelValues(pubkey).Inc() | ||
| 188 | |||
| 189 | // Do this - aggregate and expose top-N | ||
| 190 | m.RecordUserRequest(pubkey) | ||
| 191 | // Expose top 10 users in separate metric | ||
| 192 | ``` | ||
| 193 | |||
| 194 | 3. **Set appropriate histogram buckets**: Match your SLOs | ||
| 195 | ```go | ||
| 196 | // For sub-second operations | ||
| 197 | prometheus.DefBuckets // Good default | ||
| 198 | |||
| 199 | // For operations that can take seconds | ||
| 200 | []float64{0.1, 0.5, 1, 2, 5, 10, 30, 60} | ||
| 201 | ``` | ||
| 202 | |||
| 203 | 4. **Use summary for percentiles when needed**: | ||
| 204 | ```go | ||
| 205 | // Histogram: Aggregatable, but approximate percentiles | ||
| 206 | // Summary: Exact percentiles, but not aggregatable | ||
| 207 | ``` | ||
| 208 | |||
| 209 | ## Integration with Monitoring | ||
| 210 | |||
| 211 | ### Prometheus | ||
| 212 | |||
| 213 | Add to `prometheus.yml`: | ||
| 214 | ```yaml | ||
| 215 | scrape_configs: | ||
| 216 | - job_name: 'muxstr-relay' | ||
| 217 | static_configs: | ||
| 218 | - targets: ['localhost:9090'] | ||
| 219 | scrape_interval: 15s | ||
| 220 | ``` | ||
| 221 | |||
| 222 | ### Grafana | ||
| 223 | |||
| 224 | Import the provided dashboard: | ||
| 225 | 1. Copy `grafana-dashboard.json` | ||
| 226 | 2. Import in Grafana | ||
| 227 | 3. Configure data source | ||
| 228 | |||
| 229 | ### Alerting | ||
| 230 | |||
| 231 | Example alerts in `alerts.yml`: | ||
| 232 | ```yaml | ||
| 233 | groups: | ||
| 234 | - name: muxstr | ||
| 235 | rules: | ||
| 236 | - alert: HighErrorRate | ||
| 237 | expr: rate(relay_requests_total{status="error"}[5m]) > 0.05 | ||
| 238 | for: 5m | ||
| 239 | annotations: | ||
| 240 | summary: "High error rate detected" | ||
| 241 | |||
| 242 | - alert: HighLatency | ||
| 243 | expr: histogram_quantile(0.99, rate(relay_request_duration_seconds_bucket[5m])) > 1.0 | ||
| 244 | for: 5m | ||
| 245 | annotations: | ||
| 246 | summary: "P99 latency above 1 second" | ||
| 247 | |||
| 248 | - alert: RateLimitSpike | ||
| 249 | expr: rate(relay_rate_limit_hits_total[5m]) > 10 | ||
| 250 | for: 5m | ||
| 251 | annotations: | ||
| 252 | summary: "High rate limit rejection rate" | ||
| 253 | ``` | ||
| 254 | |||
| 255 | ## Troubleshooting | ||
| 256 | |||
| 257 | **Metrics not appearing**: | ||
| 258 | - Check metrics endpoint: `curl http://localhost:9090/metrics` | ||
| 259 | - Verify Prometheus scrape config | ||
| 260 | - Check firewall rules | ||
| 261 | |||
| 262 | **High memory usage**: | ||
| 263 | - Check for high cardinality labels | ||
| 264 | - Review label values: `curl http://localhost:9090/metrics | grep relay_` | ||
| 265 | - Consider aggregating high-cardinality data | ||
| 266 | |||
| 267 | **Missing method labels**: | ||
| 268 | - Ensure interceptors are properly chained | ||
| 269 | - Verify gRPC method names match expected format | ||
