Monitor nfyio with Prometheus and Grafana
Set up full observability for your nfyio deployment — metrics collection with Prometheus, dashboards with Grafana, and alerting for critical incidents.
nfyio Team
Talya Smart & Technoplatz JV
Running nfyio in production without monitoring is flying blind. This guide sets up Prometheus for metrics collection, Grafana for dashboards, and alerting for the things that matter.
Architecture
nfyio gateway ─────┐
nfyio storage ──────┤
nfyio agents ───────┤──► Prometheus ──► Grafana
PostgreSQL ─────────┤ │
Redis ──────────────┤ ▼
SeaweedFS ──────────┘ Alertmanager ──► Slack/PagerDuty
Prerequisites
- A running nfyio instance
- Docker Compose or Kubernetes
- Ports: 9090 (Prometheus), 3001 (Grafana), 9093 (Alertmanager)
Add Monitoring to Docker Compose
Append to your docker-compose.yml:
prometheus:
image: prom/prometheus:v2.51.0
ports:
- "9090:9090"
volumes:
- ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml
- ./monitoring/alert-rules.yml:/etc/prometheus/alert-rules.yml
- prometheus-data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention.time=30d'
restart: unless-stopped
grafana:
image: grafana/grafana:10.4.0
ports:
- "3001:3000"
environment:
GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_PASSWORD:-admin}
GF_INSTALL_PLUGINS: grafana-clock-panel,grafana-piechart-panel
volumes:
- grafana-data:/var/lib/grafana
- ./monitoring/dashboards:/etc/grafana/provisioning/dashboards
- ./monitoring/datasources.yml:/etc/grafana/provisioning/datasources/default.yml
restart: unless-stopped
alertmanager:
image: prom/alertmanager:v0.27.0
ports:
- "9093:9093"
volumes:
- ./monitoring/alertmanager.yml:/etc/alertmanager/alertmanager.yml
restart: unless-stopped
volumes:
prometheus-data:
grafana-data:
Prometheus Configuration
Create monitoring/prometheus.yml:
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- alert-rules.yml
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
scrape_configs:
- job_name: 'nfyio-gateway'
static_configs:
- targets: ['gateway:3000']
metrics_path: /metrics
- job_name: 'nfyio-storage'
static_configs:
- targets: ['storage:7007']
metrics_path: /metrics
- job_name: 'nfyio-agents'
static_configs:
- targets: ['agents:7010']
metrics_path: /metrics
- job_name: 'postgresql'
static_configs:
- targets: ['postgres-exporter:9187']
- job_name: 'redis'
static_configs:
- targets: ['redis-exporter:9121']
- job_name: 'seaweedfs'
static_configs:
- targets: ['seaweedfs-master:9333']
metrics_path: /metrics
Alert Rules
Create monitoring/alert-rules.yml:
groups:
- name: nfyio-critical
rules:
- alert: GatewayDown
expr: up{job="nfyio-gateway"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "nfyio gateway is down"
description: "Gateway {{ $labels.instance }} has been unreachable for 1 minute."
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "High 5xx error rate ({{ $value | humanizePercentage }})"
- alert: StorageSpaceLow
expr: (node_filesystem_avail_bytes{mountpoint="/data"} / node_filesystem_size_bytes{mountpoint="/data"}) < 0.1
for: 10m
labels:
severity: warning
annotations:
summary: "Storage space below 10%"
- alert: PostgreSQLConnectionsHigh
expr: pg_stat_activity_count > 80
for: 5m
labels:
severity: warning
annotations:
summary: "PostgreSQL connections above 80"
- alert: RedisMemoryHigh
expr: redis_memory_used_bytes / redis_memory_max_bytes > 0.85
for: 5m
labels:
severity: warning
annotations:
summary: "Redis memory usage above 85%"
- alert: EmbeddingQueueBacklog
expr: nfyio_embedding_queue_length > 1000
for: 10m
labels:
severity: warning
annotations:
summary: "Embedding queue backlog exceeds 1000 items"
- alert: AgentErrorRate
expr: rate(nfyio_agent_errors_total[5m]) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "Agent error rate elevated"
Alertmanager Configuration
Create monitoring/alertmanager.yml:
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'slack-notifications'
routes:
- match:
severity: critical
receiver: 'pagerduty-critical'
repeat_interval: 1h
receivers:
- name: 'slack-notifications'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
channel: '#nfyio-alerts'
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
- name: 'pagerduty-critical'
pagerduty_configs:
- service_key: 'YOUR_PAGERDUTY_KEY'
Grafana Datasource
Create monitoring/datasources.yml:
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: true
Key Metrics to Track
Gateway Metrics
# Request rate
rate(http_requests_total{job="nfyio-gateway"}[5m])
# P99 latency
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{job="nfyio-gateway"}[5m]))
# Error rate
rate(http_requests_total{job="nfyio-gateway",status=~"5.."}[5m])
Storage Metrics
# Upload throughput (bytes/sec)
rate(nfyio_storage_bytes_uploaded_total[5m])
# Object count
nfyio_storage_objects_total
# Bucket sizes
nfyio_storage_bucket_size_bytes
Embedding Pipeline
# Queue length
nfyio_embedding_queue_length
# Processing rate
rate(nfyio_embeddings_processed_total[5m])
# Average embedding latency
rate(nfyio_embedding_duration_seconds_sum[5m]) / rate(nfyio_embedding_duration_seconds_count[5m])
Verify the Setup
docker compose up -d prometheus grafana alertmanager
# Check Prometheus targets
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, health: .health}'
# Check Grafana
curl -s http://localhost:3001/api/health | jq
{
"commit": "abc1234",
"database": "ok",
"version": "10.4.0"
}
Open Grafana at http://localhost:3001, log in with admin / your password, and import the nfyio dashboard (ID: nfyio-overview).
Key Takeaways
- Prometheus scrapes metrics from all nfyio services, PostgreSQL, Redis, and SeaweedFS
- Alert rules catch gateway outages, high error rates, storage pressure, and embedding queue backlogs before they become incidents
- Grafana provides real-time dashboards for request rate, latency percentiles, storage throughput, and embedding pipeline health
- Alertmanager routes critical alerts to PagerDuty and warnings to Slack
- The entire monitoring stack deploys alongside nfyio in Docker Compose or Kubernetes
For more on production operations, see the installation guide and backup guide.
Written by
nfyio Team
Talya Smart & Technoplatz JV
Building the future of web design at Anti-Gravity. Passionate about creating beautiful, accessible experiences.