RAG Agents | nfyio

RAG (Retrieval Augmented Generation) agents combine semantic search with LLM generation. They ingest your documents, index them as embeddings, and answer questions by retrieving relevant chunks and passing them to the LLM as context. Use RAG agents for document Q&A, knowledge-base chatbots, and internal search.

What is RAG?

RAG augments LLM responses with retrieved context from your documents. Instead of relying solely on the model’s training data, the model receives relevant passages from your corpus and generates answers grounded in that context. This reduces hallucinations and keeps answers up-to-date with your data.

RAG Flow

Ingest — Documents are uploaded to an S3 bucket
Chunk — Documents are split into overlapping chunks (e.g., 512 tokens)
Embed — Chunks are converted to vector embeddings
Index — Embeddings are stored in pgvector
Query — User question is embedded and similar chunks are retrieved
Generate — Retrieved chunks + question are sent to the LLM for answer generation

Architecture

┌──────────────┐     ┌──────────────┐     ┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│  Documents   │────►│  Embeddings  │────►│  Vector DB   │────►│  Retrieval   │────►│     LLM      │
│  (S3 bucket) │     │  (OpenAI,    │     │  (pgvector)  │     │  (Top-K)     │     │  (GPT-4o,    │
│              │     │   Voyage)    │     │              │     │              │     │   Claude)    │
└──────────────┘     └──────────────┘     └──────────────┘     └──────────────┘     └──────────────┘
       │                     │                    │                    │                    │
       │                     │                    │                    │                    │
       ▼                     ▼                    ▼                    ▼                    ▼
   Chunk size            Model config         Similarity           Top-K value         System prompt
   256/512/1024          text-embedding-3     cosine/L2            (e.g., 5)          + context

Creating a RAG Agent

1. Prepare Your Bucket

Create an S3 bucket and upload documents (PDF, DOCX, TXT, Markdown, images). The agent will watch the bucket or a specific prefix for new objects.

aws s3 cp ./docs/ s3://my-knowledge-base/ --recursive \
  --endpoint-url https://storage.yourdomain.com

2. Configure Embedding Model

Choose an embedding model:

Model	Dimensions	Use Case
`text-embedding-3-small`	1536	Fast, cost-effective
`text-embedding-3-large`	3072	Higher quality
`voyage-3.5-lite`	1024	Alternative, good for long docs

3. Set Chunk Size

Chunk size affects retrieval quality and cost:

Size (tokens)	Pros	Cons
256	Fine-grained, many chunks	More API calls, may miss context
512	Balanced	Default for most use cases
1024	More context per chunk	Fewer chunks, coarser retrieval

4. Configure LLM

Select the LLM for generation:

GPT-4o — Best quality, recommended for production
Claude 3 — Strong alternative
GPT-3.5-turbo — Faster, lower cost for simple Q&A

5. Set Top-K

Top-K controls how many chunks are retrieved per query. Typical values: 3–10. Higher values give more context but increase token usage and latency.

JSON Configuration Example

{
  "name": "company-knowledge-base",
  "type": "rag",
  "bucket": "my-knowledge-base",
  "prefix": "docs/",
  "embedding": {
    "model": "text-embedding-3-small",
    "chunkSize": 512,
    "chunkOverlap": 64
  },
  "llm": {
    "model": "gpt-4o",
    "temperature": 0.2,
    "maxTokens": 2048
  },
  "retrieval": {
    "topK": 5,
    "similarityThreshold": 0.7
  },
  "systemPrompt": "You are a helpful assistant. Answer based only on the provided context. If the context does not contain the answer, say so."
}

Querying via Chat UI

Use the NFYio Chat UI to interact with your RAG agent:

Select the agent from the dropdown
Type your question
The agent retrieves relevant chunks and streams the response

Conversations are persisted as threads. You can resume previous threads or start new ones.

Querying via API

Chat Completion (Streaming)

curl -X POST "https://api.yourdomain.com/v1/agents/company-knowledge-base/chat" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user", "content": "What is our refund policy?"}
    ],
    "stream": true
  }'

Chat Completion (Non-Streaming)

curl -X POST "https://api.yourdomain.com/v1/agents/company-knowledge-base/chat" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user", "content": "What is our refund policy?"}
    ],
    "stream": false
  }'

Response Format (Non-Streaming)

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Our refund policy allows returns within 30 days..."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 450,
    "completion_tokens": 120,
    "total_tokens": 570
  }
}

With Thread ID (Conversation History)

curl -X POST "https://api.yourdomain.com/v1/agents/company-knowledge-base/chat" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "threadId": "thread_xyz789",
    "messages": [
      {"role": "user", "content": "What is our refund policy?"}
    ],
    "stream": true
  }'

Best Practices

Chunking

Use overlap (e.g., 64 tokens) to avoid splitting sentences awkwardly
For structured docs (tables, lists), consider smaller chunks
For narrative text, 512 tokens is a good default

Retrieval

Start with Top-K = 5 and tune based on answer quality
Use similarityThreshold to filter low-relevance chunks
Consider hybrid search (keyword + semantic) for mixed queries

System Prompts

Instruct the model to answer only from the provided context
Add fallback behavior: “If the context doesn’t contain the answer, say ‘I don’t have that information.’”
Include domain-specific instructions (tone, format, citations)

Re-indexing

Re-index when you add or update documents
Use incremental indexing when possible to avoid full re-embedding
Monitor embedding and indexing latency for large corpora

Next Steps

Embeddings & Vector Search — Deep dive into embeddings and chunk strategies
LLM Agents — Direct LLM interaction without retrieval
Agent Tools — Extend RAG with custom tools