# Metrics

This package provides Prometheus metrics for the relay, including automatic gRPC instrumentation.

## Overview

The metrics package tracks:
- **Request metrics**: Rate, latency, errors per method
- **Connection metrics**: Active connections and subscriptions
- **Auth metrics**: Success/failure rates, rate limit hits
- **Storage metrics**: Event count, database size
- **System metrics**: Go runtime stats (memory, goroutines)

## Usage

### Basic Setup

```go
import (
    "net/http"
    "northwest.io/muxstr/internal/metrics"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

// Initialize metrics
m := metrics.New(&metrics.Config{
    Namespace: "muxstr",
    Subsystem: "relay",
})

// Add gRPC interceptors
server := grpc.NewServer(
    grpc.ChainUnaryInterceptor(
        metrics.UnaryServerInterceptor(m),
        auth.NostrUnaryInterceptor(authOpts),
        ratelimit.UnaryInterceptor(limiter),
    ),
    grpc.ChainStreamInterceptor(
        metrics.StreamServerInterceptor(m),
        auth.NostrStreamInterceptor(authOpts),
        ratelimit.StreamInterceptor(limiter),
    ),
)

// Expose metrics endpoint
http.Handle("/metrics", promhttp.Handler())
go http.ListenAndServe(":9090", nil)
```

### Recording Custom Metrics

```go
// Record auth attempt
m.RecordAuthAttempt(true)  // success
m.RecordAuthAttempt(false) // failure

// Record rate limit hit
m.RecordRateLimitHit(pubkey)

// Update connection count
m.SetActiveConnections(42)

// Update subscription count
m.SetActiveSubscriptions(100)

// Update storage stats
m.UpdateStorageStats(eventCount, dbSizeBytes)
```

## Metrics Reference

### Request Metrics

**`relay_requests_total`** (Counter)
- Labels: `method`, `status` (ok, error, unauthenticated, rate_limited)
- Total number of requests by method and result

**`relay_request_duration_seconds`** (Histogram)
- Labels: `method`
- Request latency distribution
- Buckets: 0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0, 5.0, 10.0 seconds

**`relay_request_size_bytes`** (Histogram)
- Labels: `method`
- Request size distribution
- Useful for tracking large publishes

**`relay_response_size_bytes`** (Histogram)
- Labels: `method`
- Response size distribution
- Useful for tracking large queries

### Connection Metrics

**`relay_active_connections`** (Gauge)
- Current number of active gRPC connections

**`relay_active_subscriptions`** (Gauge)
- Current number of active subscriptions (streams)

**`relay_connections_total`** (Counter)
- Total connections since startup

### Auth Metrics

**`relay_auth_attempts_total`** (Counter)
- Labels: `result` (success, failure)
- Total authentication attempts

**`relay_rate_limit_hits_total`** (Counter)
- Labels: `user` (pubkey or "unauthenticated")
- Total rate limit rejections per user

### Storage Metrics

**`relay_events_total`** (Gauge)
- Total events stored in database

**`relay_db_size_bytes`** (Gauge)
- Database file size in bytes

**`relay_event_deletions_total`** (Counter)
- Total events deleted (NIP-09)

### System Metrics

Standard Go runtime metrics are automatically collected:
- `go_goroutines` - Number of goroutines
- `go_threads` - Number of OS threads
- `go_memstats_*` - Memory statistics
- `process_*` - Process CPU, memory, file descriptors

## Grafana Dashboard

Example Grafana queries:

**Request Rate by Method**:
```promql
rate(relay_requests_total[5m])
```

**P99 Latency**:
```promql
histogram_quantile(0.99, rate(relay_request_duration_seconds_bucket[5m]))
```

**Error Rate**:
```promql
rate(relay_requests_total{status="error"}[5m])
/ rate(relay_requests_total[5m])
```

**Rate Limit Hit Rate**:
```promql
rate(relay_rate_limit_hits_total[5m])
```

**Active Subscriptions**:
```promql
relay_active_subscriptions
```

**Database Growth**:
```promql
rate(relay_events_total[1h])
```

## Performance Impact

Metrics collection adds minimal overhead:
- Request counter: ~50ns
- Histogram observation: ~200ns
- Gauge update: ~30ns

Total overhead per request: ~300-500ns (negligible compared to request processing)

## Best Practices

1. **Use labels sparingly**: High cardinality (many unique label values) can cause memory issues
   - ✅ Good: `method`, `status` (low cardinality)
   - ❌ Bad: `user`, `event_id` (high cardinality)

2. **Aggregate high-cardinality data**: For per-user metrics, aggregate in the application:
   ```go
   // Don't do this - creates metric per user
   userRequests := prometheus.NewCounterVec(...)
   userRequests.WithLabelValues(pubkey).Inc()

   // Do this - aggregate and expose top-N
   m.RecordUserRequest(pubkey)
   // Expose top 10 users in separate metric
   ```

3. **Set appropriate histogram buckets**: Match your SLOs
   ```go
   // For sub-second operations
   prometheus.DefBuckets  // Good default

   // For operations that can take seconds
   []float64{0.1, 0.5, 1, 2, 5, 10, 30, 60}
   ```

4. **Use summary for percentiles when needed**:
   ```go
   // Histogram: Aggregatable, but approximate percentiles
   // Summary: Exact percentiles, but not aggregatable
   ```

## Integration with Monitoring

### Prometheus

Add to `prometheus.yml`:
```yaml
scrape_configs:
  - job_name: 'muxstr-relay'
    static_configs:
      - targets: ['localhost:9090']
    scrape_interval: 15s
```

### Grafana

Import the provided dashboard:
1. Copy `grafana-dashboard.json`
2. Import in Grafana
3. Configure data source

### Alerting

Example alerts in `alerts.yml`:
```yaml
groups:
  - name: muxstr
    rules:
      - alert: HighErrorRate
        expr: rate(relay_requests_total{status="error"}[5m]) > 0.05
        for: 5m
        annotations:
          summary: "High error rate detected"

      - alert: HighLatency
        expr: histogram_quantile(0.99, rate(relay_request_duration_seconds_bucket[5m])) > 1.0
        for: 5m
        annotations:
          summary: "P99 latency above 1 second"

      - alert: RateLimitSpike
        expr: rate(relay_rate_limit_hits_total[5m]) > 10
        for: 5m
        annotations:
          summary: "High rate limit rejection rate"
```

## Troubleshooting

**Metrics not appearing**:
- Check metrics endpoint: `curl http://localhost:9090/metrics`
- Verify Prometheus scrape config
- Check firewall rules

**High memory usage**:
- Check for high cardinality labels
- Review label values: `curl http://localhost:9090/metrics | grep relay_`
- Consider aggregating high-cardinality data

**Missing method labels**:
- Ensure interceptors are properly chained
- Verify gRPC method names match expected format