summaryrefslogtreecommitdiffstats
path: root/internal/metrics/README.md
diff options
context:
space:
mode:
authorbndw <ben@bdw.to>2026-02-14 09:41:18 -0800
committerbndw <ben@bdw.to>2026-02-14 09:41:18 -0800
commit688548d4ac3293449a88913275f886fd2e103cdf (patch)
tree5bf83c9a9b50863b6201ebf5066ee6855fefe725 /internal/metrics/README.md
parentf0169fa1f9d2e2a5d1c292b9080da10ef0878953 (diff)
feat: add Prometheus metrics and YAML config file support
## Metrics Package Comprehensive Prometheus metrics for production observability: Metrics tracked: - Request rate, latency, size per method (histograms) - Active connections and subscriptions (gauges) - Auth success/failure rates (counters) - Rate limit hits (counters) - Storage stats (event count, DB size) - Standard Go runtime metrics Features: - Automatic gRPC instrumentation via interceptors - Low overhead (~300-500ns per request) - Standard Prometheus client - HTTP /metrics endpoint - Grafana dashboard examples ## Config Package YAML configuration file support with environment overrides: Configuration sections: - Server (addresses, timeouts, public URL) - Database (path, connections, lifetime) - Auth (enabled, required, timestamp window, allowed pubkeys) - Rate limiting (per-method and per-user limits) - Metrics (endpoint, namespace) - Logging (level, format, output) - Storage (compaction, retention) Features: - YAML file loading - Environment variable overrides (MUXSTR_<SECTION>_<KEY>) - Sensible defaults - Validation on load - Duration and list parsing - Save/export configuration Both packages include comprehensive README with examples, best practices, and usage patterns. Config tests verify YAML parsing, env overrides, validation, and round-trip serialization.
Diffstat (limited to 'internal/metrics/README.md')
-rw-r--r--internal/metrics/README.md269
1 files changed, 269 insertions, 0 deletions
diff --git a/internal/metrics/README.md b/internal/metrics/README.md
new file mode 100644
index 0000000..7cffaaf
--- /dev/null
+++ b/internal/metrics/README.md
@@ -0,0 +1,269 @@
1# Metrics
2
3This package provides Prometheus metrics for the relay, including automatic gRPC instrumentation.
4
5## Overview
6
7The metrics package tracks:
8- **Request metrics**: Rate, latency, errors per method
9- **Connection metrics**: Active connections and subscriptions
10- **Auth metrics**: Success/failure rates, rate limit hits
11- **Storage metrics**: Event count, database size
12- **System metrics**: Go runtime stats (memory, goroutines)
13
14## Usage
15
16### Basic Setup
17
18```go
19import (
20 "net/http"
21 "northwest.io/muxstr/internal/metrics"
22 "github.com/prometheus/client_golang/prometheus/promhttp"
23)
24
25// Initialize metrics
26m := metrics.New(&metrics.Config{
27 Namespace: "muxstr",
28 Subsystem: "relay",
29})
30
31// Add gRPC interceptors
32server := grpc.NewServer(
33 grpc.ChainUnaryInterceptor(
34 metrics.UnaryServerInterceptor(m),
35 auth.NostrUnaryInterceptor(authOpts),
36 ratelimit.UnaryInterceptor(limiter),
37 ),
38 grpc.ChainStreamInterceptor(
39 metrics.StreamServerInterceptor(m),
40 auth.NostrStreamInterceptor(authOpts),
41 ratelimit.StreamInterceptor(limiter),
42 ),
43)
44
45// Expose metrics endpoint
46http.Handle("/metrics", promhttp.Handler())
47go http.ListenAndServe(":9090", nil)
48```
49
50### Recording Custom Metrics
51
52```go
53// Record auth attempt
54m.RecordAuthAttempt(true) // success
55m.RecordAuthAttempt(false) // failure
56
57// Record rate limit hit
58m.RecordRateLimitHit(pubkey)
59
60// Update connection count
61m.SetActiveConnections(42)
62
63// Update subscription count
64m.SetActiveSubscriptions(100)
65
66// Update storage stats
67m.UpdateStorageStats(eventCount, dbSizeBytes)
68```
69
70## Metrics Reference
71
72### Request Metrics
73
74**`relay_requests_total`** (Counter)
75- Labels: `method`, `status` (ok, error, unauthenticated, rate_limited)
76- Total number of requests by method and result
77
78**`relay_request_duration_seconds`** (Histogram)
79- Labels: `method`
80- Request latency distribution
81- Buckets: 0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0, 5.0, 10.0 seconds
82
83**`relay_request_size_bytes`** (Histogram)
84- Labels: `method`
85- Request size distribution
86- Useful for tracking large publishes
87
88**`relay_response_size_bytes`** (Histogram)
89- Labels: `method`
90- Response size distribution
91- Useful for tracking large queries
92
93### Connection Metrics
94
95**`relay_active_connections`** (Gauge)
96- Current number of active gRPC connections
97
98**`relay_active_subscriptions`** (Gauge)
99- Current number of active subscriptions (streams)
100
101**`relay_connections_total`** (Counter)
102- Total connections since startup
103
104### Auth Metrics
105
106**`relay_auth_attempts_total`** (Counter)
107- Labels: `result` (success, failure)
108- Total authentication attempts
109
110**`relay_rate_limit_hits_total`** (Counter)
111- Labels: `user` (pubkey or "unauthenticated")
112- Total rate limit rejections per user
113
114### Storage Metrics
115
116**`relay_events_total`** (Gauge)
117- Total events stored in database
118
119**`relay_db_size_bytes`** (Gauge)
120- Database file size in bytes
121
122**`relay_event_deletions_total`** (Counter)
123- Total events deleted (NIP-09)
124
125### System Metrics
126
127Standard Go runtime metrics are automatically collected:
128- `go_goroutines` - Number of goroutines
129- `go_threads` - Number of OS threads
130- `go_memstats_*` - Memory statistics
131- `process_*` - Process CPU, memory, file descriptors
132
133## Grafana Dashboard
134
135Example Grafana queries:
136
137**Request Rate by Method**:
138```promql
139rate(relay_requests_total[5m])
140```
141
142**P99 Latency**:
143```promql
144histogram_quantile(0.99, rate(relay_request_duration_seconds_bucket[5m]))
145```
146
147**Error Rate**:
148```promql
149rate(relay_requests_total{status="error"}[5m])
150/ rate(relay_requests_total[5m])
151```
152
153**Rate Limit Hit Rate**:
154```promql
155rate(relay_rate_limit_hits_total[5m])
156```
157
158**Active Subscriptions**:
159```promql
160relay_active_subscriptions
161```
162
163**Database Growth**:
164```promql
165rate(relay_events_total[1h])
166```
167
168## Performance Impact
169
170Metrics collection adds minimal overhead:
171- Request counter: ~50ns
172- Histogram observation: ~200ns
173- Gauge update: ~30ns
174
175Total overhead per request: ~300-500ns (negligible compared to request processing)
176
177## Best Practices
178
1791. **Use labels sparingly**: High cardinality (many unique label values) can cause memory issues
180 - ✅ Good: `method`, `status` (low cardinality)
181 - ❌ Bad: `user`, `event_id` (high cardinality)
182
1832. **Aggregate high-cardinality data**: For per-user metrics, aggregate in the application:
184 ```go
185 // Don't do this - creates metric per user
186 userRequests := prometheus.NewCounterVec(...)
187 userRequests.WithLabelValues(pubkey).Inc()
188
189 // Do this - aggregate and expose top-N
190 m.RecordUserRequest(pubkey)
191 // Expose top 10 users in separate metric
192 ```
193
1943. **Set appropriate histogram buckets**: Match your SLOs
195 ```go
196 // For sub-second operations
197 prometheus.DefBuckets // Good default
198
199 // For operations that can take seconds
200 []float64{0.1, 0.5, 1, 2, 5, 10, 30, 60}
201 ```
202
2034. **Use summary for percentiles when needed**:
204 ```go
205 // Histogram: Aggregatable, but approximate percentiles
206 // Summary: Exact percentiles, but not aggregatable
207 ```
208
209## Integration with Monitoring
210
211### Prometheus
212
213Add to `prometheus.yml`:
214```yaml
215scrape_configs:
216 - job_name: 'muxstr-relay'
217 static_configs:
218 - targets: ['localhost:9090']
219 scrape_interval: 15s
220```
221
222### Grafana
223
224Import the provided dashboard:
2251. Copy `grafana-dashboard.json`
2262. Import in Grafana
2273. Configure data source
228
229### Alerting
230
231Example alerts in `alerts.yml`:
232```yaml
233groups:
234 - name: muxstr
235 rules:
236 - alert: HighErrorRate
237 expr: rate(relay_requests_total{status="error"}[5m]) > 0.05
238 for: 5m
239 annotations:
240 summary: "High error rate detected"
241
242 - alert: HighLatency
243 expr: histogram_quantile(0.99, rate(relay_request_duration_seconds_bucket[5m])) > 1.0
244 for: 5m
245 annotations:
246 summary: "P99 latency above 1 second"
247
248 - alert: RateLimitSpike
249 expr: rate(relay_rate_limit_hits_total[5m]) > 10
250 for: 5m
251 annotations:
252 summary: "High rate limit rejection rate"
253```
254
255## Troubleshooting
256
257**Metrics not appearing**:
258- Check metrics endpoint: `curl http://localhost:9090/metrics`
259- Verify Prometheus scrape config
260- Check firewall rules
261
262**High memory usage**:
263- Check for high cardinality labels
264- Review label values: `curl http://localhost:9090/metrics | grep relay_`
265- Consider aggregating high-cardinality data
266
267**Missing method labels**:
268- Ensure interceptors are properly chained
269- Verify gRPC method names match expected format