Use Custom Embedding Models and Local LLMs with nfyio
Replace OpenAI with local embedding models using Ollama, HuggingFace Transformers, or vLLM. Keep data on-premises and reduce costs.
nfyio Team
Talya Smart & Technoplatz JV
nfyio’s default embedding pipeline uses OpenAI’s text-embedding-3-small. For air-gapped environments, data privacy requirements, or cost optimization, you can swap in any embedding model — including local models running on your own GPU hardware.
Why Run Local Embeddings?
| Factor | OpenAI API | Local Model |
|---|---|---|
| Data privacy | Data sent to OpenAI | Data stays on-premises |
| Cost | $0.02/1M tokens | GPU cost only (fixed) |
| Latency | 50-200ms per batch | 10-50ms per batch (GPU) |
| Rate limits | 3,500 RPM (tier 1) | Unlimited |
| Internet required | Yes | No |
| Model customization | No | Fine-tune on your data |
Option 1: Ollama (Recommended for Simplicity)
Ollama runs embedding models locally with a single command.
Install and Start Ollama
# Install
curl -fsSL https://ollama.com/install.sh | sh
# Pull an embedding model
ollama pull nomic-embed-text
# Verify it works
curl http://localhost:11434/api/embeddings -d '{
"model": "nomic-embed-text",
"prompt": "test sentence for embedding"
}' | jq '.embedding | length'
Output: 768 (dimensions)
Configure nfyio to Use Ollama
Update your .env or Docker Compose environment:
EMBEDDING_PROVIDER=ollama
EMBEDDING_MODEL=nomic-embed-text
EMBEDDING_BASE_URL=http://localhost:11434
EMBEDDING_DIMENSIONS=768
Or via the API:
curl -X PATCH http://localhost:3000/api/v1/config/embeddings \
-H "Authorization: Bearer $JWT" \
-H "Content-Type: application/json" \
-d '{
"provider": "ollama",
"model": "nomic-embed-text",
"base_url": "http://localhost:11434",
"dimensions": 768
}'
Docker Compose with Ollama
ollama:
image: ollama/ollama:latest
ports:
- "11434:11434"
volumes:
- ollama-data:/root/.ollama
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
restart: unless-stopped
volumes:
ollama-data:
Pull models after startup:
docker exec ollama ollama pull nomic-embed-text
Available Embedding Models for Ollama
| Model | Dimensions | Size | Speed | Quality |
|---|---|---|---|---|
nomic-embed-text | 768 | 274 MB | Fast | Good |
mxbai-embed-large | 1024 | 670 MB | Medium | Better |
all-minilm | 384 | 46 MB | Very fast | Acceptable |
snowflake-arctic-embed | 1024 | 670 MB | Medium | Best |
Option 2: HuggingFace Inference Server
For production workloads with custom fine-tuned models.
Deploy with Text Embeddings Inference (TEI)
docker run -p 8080:80 \
--gpus all \
ghcr.io/huggingface/text-embeddings-inference:1.2 \
--model-id BAAI/bge-large-en-v1.5 \
--max-batch-tokens 16384
Test:
curl http://localhost:8080/embed -d '{
"inputs": "test embedding sentence"
}' | jq '.[0] | length'
Output: 1024
Configure nfyio
EMBEDDING_PROVIDER=openai-compatible
EMBEDDING_MODEL=BAAI/bge-large-en-v1.5
EMBEDDING_BASE_URL=http://localhost:8080
EMBEDDING_DIMENSIONS=1024
EMBEDDING_API_KEY=not-needed
The HuggingFace TEI server exposes an OpenAI-compatible /v1/embeddings endpoint, so nfyio connects to it using the same client.
Option 3: vLLM (for GPU Clusters)
vLLM provides high-throughput inference for both embeddings and generation.
vllm serve BAAI/bge-large-en-v1.5 \
--task embed \
--port 8000 \
--max-model-len 512 \
--gpu-memory-utilization 0.8
Configure nfyio:
EMBEDDING_PROVIDER=openai-compatible
EMBEDDING_BASE_URL=http://localhost:8000/v1
EMBEDDING_MODEL=BAAI/bge-large-en-v1.5
EMBEDDING_DIMENSIONS=1024
Migrate Existing Embeddings
When switching models, existing embeddings become incompatible (different dimensions and vector space). Re-embed all objects:
# Trigger re-embedding for a bucket
curl -X POST http://localhost:3000/api/v1/buckets/my-bucket/reindex \
-H "Authorization: Bearer $JWT" \
-H "Content-Type: application/json" \
-d '{"force": true}'
{
"status": "queued",
"objects_to_process": 14238,
"estimated_time_seconds": 420
}
Monitor progress:
watch -n 5 'curl -s http://localhost:3000/api/v1/buckets/my-bucket/reindex/status \
-H "Authorization: Bearer $JWT" | jq'
Update pgvector Index
After re-embedding with a different dimension, recreate the index:
-- Drop old index
DROP INDEX IF EXISTS embeddings_vector_idx;
-- Alter column dimension (768 for nomic-embed-text)
ALTER TABLE embeddings
ALTER COLUMN embedding TYPE vector(768);
-- Recreate index
CREATE INDEX embeddings_vector_idx
ON embeddings USING hnsw (embedding vector_cosine_ops)
WITH (m = 24, ef_construction = 200);
VACUUM ANALYZE embeddings;
Fine-Tuning Your Own Model
For domain-specific accuracy, fine-tune on your own data.
Generate Training Data from nfyio
# Export search queries and relevant documents
curl -X GET http://localhost:3000/api/v1/analytics/search-pairs \
-H "Authorization: Bearer $JWT" \
-o training-pairs.jsonl
{"query": "kubernetes pod restart policy", "positive": "Set restartPolicy to Always...", "negative": "Redis configuration guide..."}
{"query": "pgvector index tuning", "positive": "HNSW parameters m and ef_construction...", "negative": "SeaweedFS volume compaction..."}
Fine-Tune with Sentence Transformers
from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader
import json
model = SentenceTransformer("BAAI/bge-base-en-v1.5")
train_examples = []
with open("training-pairs.jsonl") as f:
for line in f:
pair = json.loads(line)
train_examples.append(InputExample(
texts=[pair["query"], pair["positive"], pair["negative"]]
))
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
train_loss = losses.TripletLoss(model=model)
model.fit(
train_objectives=[(train_dataloader, train_loss)],
epochs=3,
warmup_steps=100,
output_path="./nfyio-custom-embeddings"
)
Deploy the fine-tuned model:
docker run -p 8080:80 --gpus all \
-v ./nfyio-custom-embeddings:/model \
ghcr.io/huggingface/text-embeddings-inference:1.2 \
--model-id /model
Comparison: OpenAI vs Local
# Benchmark: 1000 documents, 512 tokens each
# OpenAI text-embedding-3-small
time curl -X POST http://localhost:3000/api/v1/benchmark/embed \
-H "Authorization: Bearer $JWT" \
-d '{"provider": "openai", "count": 1000}'
# Result: 45.2s, $0.01
# Ollama nomic-embed-text (RTX 4090)
time curl -X POST http://localhost:3000/api/v1/benchmark/embed \
-H "Authorization: Bearer $JWT" \
-d '{"provider": "ollama", "count": 1000}'
# Result: 8.7s, $0.00
# HuggingFace TEI bge-large (A100)
time curl -X POST http://localhost:3000/api/v1/benchmark/embed \
-H "Authorization: Bearer $JWT" \
-d '{"provider": "openai-compatible", "count": 1000}'
# Result: 3.1s, $0.00
Key Takeaways
- Ollama is the fastest path to local embeddings — one command to install, one to pull a model
- HuggingFace TEI provides an OpenAI-compatible API, making it a drop-in replacement
- When switching models, always re-index existing embeddings and update the pgvector index dimensions
- Fine-tuning on your own search data improves retrieval accuracy by 15-30% for domain-specific queries
- Local models eliminate per-token costs and API rate limits — critical for high-volume embedding pipelines
- For air-gapped deployments, local models are the only option
For embedding pipeline configuration, see the API reference. For performance tuning, see the performance guide.
Written by
nfyio Team
Talya Smart & Technoplatz JV
Building the future of web design at Anti-Gravity. Passionate about creating beautiful, accessible experiences.