Scalability Guide

This guide covers scaling strategies for NFYio at high load: horizontal scaling, database scaling, caching, and multi-region deployment.

Horizontal Scaling

API Gateway

Scale the gateway horizontally behind a load balancer:

# Kubernetes Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nfyio-gateway
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nfyio-gateway
  template:
    spec:
      containers:
      - name: gateway
        image: nfyio/gateway:latest
        ports:
        - containerPort: 3000

Component	Scaling Strategy
API Gateway	Add replicas; stateless, scales linearly
Load Balancer	Round-robin or least-connections

Storage Nodes

SeaweedFS scales by adding volume nodes:

# Add more volume nodes
seaweedfs-volume-1:
  image: chrislusf/seaweedfs
  command: volume -mserver=seaweedfs-master:9333 -port=8080
seaweedfs-volume-2:
  image: chrislusf/seaweedfs
  command: volume -mserver=seaweedfs-master:9333 -port=8080

Each volume node adds capacity and throughput. The master distributes writes across volumes.

Embedding Workers

For high-volume embedding pipelines, scale worker replicas:

nfyio-embedding-worker:
  deploy:
    replicas: 4
  environment:
    - OPENAI_API_KEY=${OPENAI_API_KEY}
    - BATCH_SIZE=32

Setting	Impact
Replicas	Throughput (linear)
BATCH_SIZE	API efficiency (larger = fewer calls, more memory)

Database Scaling

Read Replicas

Offload read traffic to replicas. Use PostgreSQL streaming replication:

# Primary
postgres-primary:
  image: pgvector/pgvector:pg16
  environment:
    - POSTGRES_REPLICATION_MODE=master

# Read replica
postgres-replica:
  image: pgvector/pgvector:pg16
  environment:
    - POSTGRES_REPLICATION_MODE=slave
    - POSTGRES_MASTER_HOST=postgres-primary

Route read queries (SELECT, list operations) to replicas. Writes go to primary.

Connection Pooling

Use PgBouncer to handle connection spikes:

[databases]
nfyio = host=postgres port=5432 dbname=nfyio

[pgbouncer]
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 50
reserve_pool_size = 25

Parameter	Purpose
max_client_conn	Total client connections
default_pool_size	DB connections per database
reserve_pool_size	Extra for burst

pgvector Optimization

For vector similarity search at scale:

-- IVFFlat index (faster build, good for < 1M vectors)
CREATE INDEX idx_embeddings_ivfflat ON embeddings
  USING ivfflat (embedding vector_cosine_ops)
  WITH (lists = 100);

-- HNSW index (faster query, better recall, slower build)
CREATE INDEX idx_embeddings_hnsw ON embeddings
  USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);

Index	Build Time	Query Speed	Recall
IVFFlat	Fast	Good	Good
HNSW	Slower	Faster	Better

Caching Strategies

Redis

Use Redis for session, rate limit, and query caching:

redis:
  image: redis:7-alpine
  command: redis-server --maxmemory 2gb --maxmemory-policy allkeys-lru

Use Case	TTL	Key Pattern
Session	24h	`session:{id}`
Rate limit	1m	`ratelimit:{key}:{window}`
Query cache	5m	`query:{hash}`
Embedding cache	24h	`emb:{hash}`

CDN Caching

Cache public objects at the edge:

Header	Effect
`Cache-Control: public, max-age=3600`	Cache 1 hour
`Cache-Control: s-maxage=86400`	CDN cache 24h
`Vary: Authorization`	Separate cache per auth

Query Result Caching

Cache expensive RAG or list operations:

const cacheKey = `list:${bucket}:${prefix}:${page}`;
let result = await redis.get(cacheKey);
if (!result) {
  result = await s3.listObjectsV2({ Bucket: bucket, Prefix: prefix });
  await redis.setex(cacheKey, 60, JSON.stringify(result));
}
return JSON.parse(result);

Multi-Region Deployment

Architecture

Region A (Primary)          Region B (DR/Read)
┌─────────────────────┐     ┌─────────────────────┐
│ Gateway             │     │ Gateway (read)      │
│ Storage (primary)   │────▶│ Storage (replica)   │
│ PostgreSQL (primary)│───▶│ PostgreSQL (replica)│
│ Redis (primary)     │     │ Redis (replica)     │
└─────────────────────┘     └─────────────────────┘

Considerations

Aspect	Strategy
Data replication	Async replication (PostgreSQL, SeaweedFS)
Routing	GeoDNS or latency-based routing
Consistency	Eventually consistent for cross-region reads
Failover	Manual or automated (RTO/RPO defined)

Cross-Region Object Replication

For object storage, use replication rules:

{
  "replication": {
    "role": "source",
    "rules": [
      {
        "id": "replicate-to-region-b",
        "status": "enabled",
        "destination": {
          "bucket": "arn:nfyio:storage:region-b::my-bucket",
          "storage_class": "STANDARD"
        },
        "filter": { "prefix": "critical/" }
      }
    ]
  }
}

Scaling Checklist

API Gateway replicas behind load balancer
Storage volume nodes scaled for capacity
Embedding workers scaled for throughput
Read replicas for database
PgBouncer for connection pooling
pgvector index tuned (IVFFlat or HNSW)
Redis for sessions and caching
CDN for public objects
Multi-region plan if required

Next Steps

Performance Optimization — Tuning for efficiency
Cost Optimization — Scaling cost-effectively
Architecture — System overview
Storage Overview — Storage scaling