Use Custom Embedding Models and Local LLMs with nfyio

nfyio’s default embedding pipeline uses OpenAI’s text-embedding-3-small. For air-gapped environments, data privacy requirements, or cost optimization, you can swap in any embedding model — including local models running on your own GPU hardware.

Why Run Local Embeddings?

Factor	OpenAI API	Local Model
Data privacy	Data sent to OpenAI	Data stays on-premises
Cost	$0.02/1M tokens	GPU cost only (fixed)
Latency	50-200ms per batch	10-50ms per batch (GPU)
Rate limits	3,500 RPM (tier 1)	Unlimited
Internet required	Yes	No
Model customization	No	Fine-tune on your data

Option 1: Ollama (Recommended for Simplicity)

Ollama runs embedding models locally with a single command.

Install and Start Ollama

# Install
curl -fsSL https://ollama.com/install.sh | sh

# Pull an embedding model
ollama pull nomic-embed-text

# Verify it works
curl http://localhost:11434/api/embeddings -d '{
  "model": "nomic-embed-text",
  "prompt": "test sentence for embedding"
}' | jq '.embedding | length'

Output: 768 (dimensions)

Configure nfyio to Use Ollama

Update your .env or Docker Compose environment:

EMBEDDING_PROVIDER=ollama
EMBEDDING_MODEL=nomic-embed-text
EMBEDDING_BASE_URL=http://localhost:11434
EMBEDDING_DIMENSIONS=768

Or via the API:

curl -X PATCH http://localhost:3000/api/v1/config/embeddings \
  -H "Authorization: Bearer $JWT" \
  -H "Content-Type: application/json" \
  -d '{
    "provider": "ollama",
    "model": "nomic-embed-text",
    "base_url": "http://localhost:11434",
    "dimensions": 768
  }'

Docker Compose with Ollama

ollama:
  image: ollama/ollama:latest
  ports:
    - "11434:11434"
  volumes:
    - ollama-data:/root/.ollama
  deploy:
    resources:
      reservations:
        devices:
          - driver: nvidia
            count: 1
            capabilities: [gpu]
  restart: unless-stopped

volumes:
  ollama-data:

Pull models after startup:

docker exec ollama ollama pull nomic-embed-text

Available Embedding Models for Ollama

Model	Dimensions	Size	Speed	Quality
`nomic-embed-text`	768	274 MB	Fast	Good
`mxbai-embed-large`	1024	670 MB	Medium	Better
`all-minilm`	384	46 MB	Very fast	Acceptable
`snowflake-arctic-embed`	1024	670 MB	Medium	Best

Option 2: HuggingFace Inference Server

For production workloads with custom fine-tuned models.

Deploy with Text Embeddings Inference (TEI)

docker run -p 8080:80 \
  --gpus all \
  ghcr.io/huggingface/text-embeddings-inference:1.2 \
  --model-id BAAI/bge-large-en-v1.5 \
  --max-batch-tokens 16384

Test:

curl http://localhost:8080/embed -d '{
  "inputs": "test embedding sentence"
}' | jq '.[0] | length'

Output: 1024

Configure nfyio

EMBEDDING_PROVIDER=openai-compatible
EMBEDDING_MODEL=BAAI/bge-large-en-v1.5
EMBEDDING_BASE_URL=http://localhost:8080
EMBEDDING_DIMENSIONS=1024
EMBEDDING_API_KEY=not-needed

The HuggingFace TEI server exposes an OpenAI-compatible /v1/embeddings endpoint, so nfyio connects to it using the same client.

Option 3: vLLM (for GPU Clusters)

vLLM provides high-throughput inference for both embeddings and generation.

vllm serve BAAI/bge-large-en-v1.5 \
  --task embed \
  --port 8000 \
  --max-model-len 512 \
  --gpu-memory-utilization 0.8

Configure nfyio:

EMBEDDING_PROVIDER=openai-compatible
EMBEDDING_BASE_URL=http://localhost:8000/v1
EMBEDDING_MODEL=BAAI/bge-large-en-v1.5
EMBEDDING_DIMENSIONS=1024

Migrate Existing Embeddings

When switching models, existing embeddings become incompatible (different dimensions and vector space). Re-embed all objects:

# Trigger re-embedding for a bucket
curl -X POST http://localhost:3000/api/v1/buckets/my-bucket/reindex \
  -H "Authorization: Bearer $JWT" \
  -H "Content-Type: application/json" \
  -d '{"force": true}'

{
  "status": "queued",
  "objects_to_process": 14238,
  "estimated_time_seconds": 420
}

Monitor progress:

watch -n 5 'curl -s http://localhost:3000/api/v1/buckets/my-bucket/reindex/status \
  -H "Authorization: Bearer $JWT" | jq'

Update pgvector Index

After re-embedding with a different dimension, recreate the index:

-- Drop old index
DROP INDEX IF EXISTS embeddings_vector_idx;

-- Alter column dimension (768 for nomic-embed-text)
ALTER TABLE embeddings
  ALTER COLUMN embedding TYPE vector(768);

-- Recreate index
CREATE INDEX embeddings_vector_idx
ON embeddings USING hnsw (embedding vector_cosine_ops)
WITH (m = 24, ef_construction = 200);

VACUUM ANALYZE embeddings;

Fine-Tuning Your Own Model

For domain-specific accuracy, fine-tune on your own data.

Generate Training Data from nfyio

# Export search queries and relevant documents
curl -X GET http://localhost:3000/api/v1/analytics/search-pairs \
  -H "Authorization: Bearer $JWT" \
  -o training-pairs.jsonl

{"query": "kubernetes pod restart policy", "positive": "Set restartPolicy to Always...", "negative": "Redis configuration guide..."}
{"query": "pgvector index tuning", "positive": "HNSW parameters m and ef_construction...", "negative": "SeaweedFS volume compaction..."}

Fine-Tune with Sentence Transformers

from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader
import json

model = SentenceTransformer("BAAI/bge-base-en-v1.5")

train_examples = []
with open("training-pairs.jsonl") as f:
    for line in f:
        pair = json.loads(line)
        train_examples.append(InputExample(
            texts=[pair["query"], pair["positive"], pair["negative"]]
        ))

train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
train_loss = losses.TripletLoss(model=model)

model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=3,
    warmup_steps=100,
    output_path="./nfyio-custom-embeddings"
)

Deploy the fine-tuned model:

docker run -p 8080:80 --gpus all \
  -v ./nfyio-custom-embeddings:/model \
  ghcr.io/huggingface/text-embeddings-inference:1.2 \
  --model-id /model

Comparison: OpenAI vs Local

# Benchmark: 1000 documents, 512 tokens each

# OpenAI text-embedding-3-small
time curl -X POST http://localhost:3000/api/v1/benchmark/embed \
  -H "Authorization: Bearer $JWT" \
  -d '{"provider": "openai", "count": 1000}'
# Result: 45.2s, $0.01

# Ollama nomic-embed-text (RTX 4090)
time curl -X POST http://localhost:3000/api/v1/benchmark/embed \
  -H "Authorization: Bearer $JWT" \
  -d '{"provider": "ollama", "count": 1000}'
# Result: 8.7s, $0.00

# HuggingFace TEI bge-large (A100)
time curl -X POST http://localhost:3000/api/v1/benchmark/embed \
  -H "Authorization: Bearer $JWT" \
  -d '{"provider": "openai-compatible", "count": 1000}'
# Result: 3.1s, $0.00

Key Takeaways

Ollama is the fastest path to local embeddings — one command to install, one to pull a model
HuggingFace TEI provides an OpenAI-compatible API, making it a drop-in replacement
When switching models, always re-index existing embeddings and update the pgvector index dimensions
Fine-tuning on your own search data improves retrieval accuracy by 15-30% for domain-specific queries
Local models eliminate per-token costs and API rate limits — critical for high-volume embedding pipelines
For air-gapped deployments, local models are the only option

For embedding pipeline configuration, see the API reference. For performance tuning, see the performance guide.