Monitor nfyio with Prometheus and Grafana

Running nfyio in production without monitoring is flying blind. This guide sets up Prometheus for metrics collection, Grafana for dashboards, and alerting for the things that matter.

Architecture

nfyio gateway ─────┐
nfyio storage ──────┤
nfyio agents ───────┤──► Prometheus ──► Grafana
PostgreSQL ─────────┤         │
Redis ──────────────┤         ▼
SeaweedFS ──────────┘    Alertmanager ──► Slack/PagerDuty

Prerequisites

A running nfyio instance
Docker Compose or Kubernetes
Ports: 9090 (Prometheus), 3001 (Grafana), 9093 (Alertmanager)

Add Monitoring to Docker Compose

Append to your docker-compose.yml:

prometheus:
  image: prom/prometheus:v2.51.0
  ports:
    - "9090:9090"
  volumes:
    - ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml
    - ./monitoring/alert-rules.yml:/etc/prometheus/alert-rules.yml
    - prometheus-data:/prometheus
  command:
    - '--config.file=/etc/prometheus/prometheus.yml'
    - '--storage.tsdb.retention.time=30d'
  restart: unless-stopped

grafana:
  image: grafana/grafana:10.4.0
  ports:
    - "3001:3000"
  environment:
    GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_PASSWORD:-admin}
    GF_INSTALL_PLUGINS: grafana-clock-panel,grafana-piechart-panel
  volumes:
    - grafana-data:/var/lib/grafana
    - ./monitoring/dashboards:/etc/grafana/provisioning/dashboards
    - ./monitoring/datasources.yml:/etc/grafana/provisioning/datasources/default.yml
  restart: unless-stopped

alertmanager:
  image: prom/alertmanager:v0.27.0
  ports:
    - "9093:9093"
  volumes:
    - ./monitoring/alertmanager.yml:/etc/alertmanager/alertmanager.yml
  restart: unless-stopped

volumes:
  prometheus-data:
  grafana-data:

Prometheus Configuration

Create monitoring/prometheus.yml:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - alert-rules.yml

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

scrape_configs:
  - job_name: 'nfyio-gateway'
    static_configs:
      - targets: ['gateway:3000']
    metrics_path: /metrics

  - job_name: 'nfyio-storage'
    static_configs:
      - targets: ['storage:7007']
    metrics_path: /metrics

  - job_name: 'nfyio-agents'
    static_configs:
      - targets: ['agents:7010']
    metrics_path: /metrics

  - job_name: 'postgresql'
    static_configs:
      - targets: ['postgres-exporter:9187']

  - job_name: 'redis'
    static_configs:
      - targets: ['redis-exporter:9121']

  - job_name: 'seaweedfs'
    static_configs:
      - targets: ['seaweedfs-master:9333']
    metrics_path: /metrics

Alert Rules

Create monitoring/alert-rules.yml:

groups:
  - name: nfyio-critical
    rules:
      - alert: GatewayDown
        expr: up{job="nfyio-gateway"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "nfyio gateway is down"
          description: "Gateway {{ $labels.instance }} has been unreachable for 1 minute."

      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High 5xx error rate ({{ $value | humanizePercentage }})"

      - alert: StorageSpaceLow
        expr: (node_filesystem_avail_bytes{mountpoint="/data"} / node_filesystem_size_bytes{mountpoint="/data"}) < 0.1
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Storage space below 10%"

      - alert: PostgreSQLConnectionsHigh
        expr: pg_stat_activity_count > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "PostgreSQL connections above 80"

      - alert: RedisMemoryHigh
        expr: redis_memory_used_bytes / redis_memory_max_bytes > 0.85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Redis memory usage above 85%"

      - alert: EmbeddingQueueBacklog
        expr: nfyio_embedding_queue_length > 1000
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Embedding queue backlog exceeds 1000 items"

      - alert: AgentErrorRate
        expr: rate(nfyio_agent_errors_total[5m]) > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Agent error rate elevated"

Alertmanager Configuration

Create monitoring/alertmanager.yml:

global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'slack-notifications'
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty-critical'
      repeat_interval: 1h

receivers:
  - name: 'slack-notifications'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
        channel: '#nfyio-alerts'
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'

  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: 'YOUR_PAGERDUTY_KEY'

Grafana Datasource

Create monitoring/datasources.yml:

apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: true

Key Metrics to Track

Gateway Metrics

# Request rate
rate(http_requests_total{job="nfyio-gateway"}[5m])

# P99 latency
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{job="nfyio-gateway"}[5m]))

# Error rate
rate(http_requests_total{job="nfyio-gateway",status=~"5.."}[5m])

Storage Metrics

# Upload throughput (bytes/sec)
rate(nfyio_storage_bytes_uploaded_total[5m])

# Object count
nfyio_storage_objects_total

# Bucket sizes
nfyio_storage_bucket_size_bytes

Embedding Pipeline

# Queue length
nfyio_embedding_queue_length

# Processing rate
rate(nfyio_embeddings_processed_total[5m])

# Average embedding latency
rate(nfyio_embedding_duration_seconds_sum[5m]) / rate(nfyio_embedding_duration_seconds_count[5m])

Verify the Setup

docker compose up -d prometheus grafana alertmanager

# Check Prometheus targets
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, health: .health}'

# Check Grafana
curl -s http://localhost:3001/api/health | jq

{
  "commit": "abc1234",
  "database": "ok",
  "version": "10.4.0"
}

Open Grafana at http://localhost:3001, log in with admin / your password, and import the nfyio dashboard (ID: nfyio-overview).

Key Takeaways

Prometheus scrapes metrics from all nfyio services, PostgreSQL, Redis, and SeaweedFS
Alert rules catch gateway outages, high error rates, storage pressure, and embedding queue backlogs before they become incidents
Grafana provides real-time dashboards for request rate, latency percentiles, storage throughput, and embedding pipeline health
Alertmanager routes critical alerts to PagerDuty and warnings to Slack
The entire monitoring stack deploys alongside nfyio in Docker Compose or Kubernetes

For more on production operations, see the installation guide and backup guide.