RAG Agents

Retrieval Augmented Generation agents for document Q&A. Learn architecture, embedding models, chunk strategies, and how to create and query RAG agents via Chat UI and API.

RAG (Retrieval Augmented Generation) agents combine semantic search with LLM generation. They ingest your documents, index them as embeddings, and answer questions by retrieving relevant chunks and passing them to the LLM as context. Use RAG agents for document Q&A, knowledge-base chatbots, and internal search.

What is RAG?

RAG augments LLM responses with retrieved context from your documents. Instead of relying solely on the model’s training data, the model receives relevant passages from your corpus and generates answers grounded in that context. This reduces hallucinations and keeps answers up-to-date with your data.

RAG Flow

  1. Ingest — Documents are uploaded to an S3 bucket
  2. Chunk — Documents are split into overlapping chunks (e.g., 512 tokens)
  3. Embed — Chunks are converted to vector embeddings
  4. Index — Embeddings are stored in pgvector
  5. Query — User question is embedded and similar chunks are retrieved
  6. Generate — Retrieved chunks + question are sent to the LLM for answer generation

Architecture

┌──────────────┐     ┌──────────────┐     ┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│  Documents   │────►│  Embeddings  │────►│  Vector DB   │────►│  Retrieval   │────►│     LLM      │
│  (S3 bucket) │     │  (OpenAI,    │     │  (pgvector)  │     │  (Top-K)     │     │  (GPT-4o,    │
│              │     │   Voyage)    │     │              │     │              │     │   Claude)    │
└──────────────┘     └──────────────┘     └──────────────┘     └──────────────┘     └──────────────┘
       │                     │                    │                    │                    │
       │                     │                    │                    │                    │
       ▼                     ▼                    ▼                    ▼                    ▼
   Chunk size            Model config         Similarity           Top-K value         System prompt
   256/512/1024          text-embedding-3     cosine/L2            (e.g., 5)          + context

Creating a RAG Agent

1. Prepare Your Bucket

Create an S3 bucket and upload documents (PDF, DOCX, TXT, Markdown, images). The agent will watch the bucket or a specific prefix for new objects.

aws s3 cp ./docs/ s3://my-knowledge-base/ --recursive \
  --endpoint-url https://storage.yourdomain.com

2. Configure Embedding Model

Choose an embedding model:

ModelDimensionsUse Case
text-embedding-3-small1536Fast, cost-effective
text-embedding-3-large3072Higher quality
voyage-3.5-lite1024Alternative, good for long docs

3. Set Chunk Size

Chunk size affects retrieval quality and cost:

Size (tokens)ProsCons
256Fine-grained, many chunksMore API calls, may miss context
512BalancedDefault for most use cases
1024More context per chunkFewer chunks, coarser retrieval

4. Configure LLM

Select the LLM for generation:

  • GPT-4o — Best quality, recommended for production
  • Claude 3 — Strong alternative
  • GPT-3.5-turbo — Faster, lower cost for simple Q&A

5. Set Top-K

Top-K controls how many chunks are retrieved per query. Typical values: 3–10. Higher values give more context but increase token usage and latency.

JSON Configuration Example

{
  "name": "company-knowledge-base",
  "type": "rag",
  "bucket": "my-knowledge-base",
  "prefix": "docs/",
  "embedding": {
    "model": "text-embedding-3-small",
    "chunkSize": 512,
    "chunkOverlap": 64
  },
  "llm": {
    "model": "gpt-4o",
    "temperature": 0.2,
    "maxTokens": 2048
  },
  "retrieval": {
    "topK": 5,
    "similarityThreshold": 0.7
  },
  "systemPrompt": "You are a helpful assistant. Answer based only on the provided context. If the context does not contain the answer, say so."
}

Querying via Chat UI

Use the NFYio Chat UI to interact with your RAG agent:

  1. Select the agent from the dropdown
  2. Type your question
  3. The agent retrieves relevant chunks and streams the response

Conversations are persisted as threads. You can resume previous threads or start new ones.

Querying via API

Chat Completion (Streaming)

curl -X POST "https://api.yourdomain.com/v1/agents/company-knowledge-base/chat" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user", "content": "What is our refund policy?"}
    ],
    "stream": true
  }'

Chat Completion (Non-Streaming)

curl -X POST "https://api.yourdomain.com/v1/agents/company-knowledge-base/chat" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user", "content": "What is our refund policy?"}
    ],
    "stream": false
  }'

Response Format (Non-Streaming)

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Our refund policy allows returns within 30 days..."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 450,
    "completion_tokens": 120,
    "total_tokens": 570
  }
}

With Thread ID (Conversation History)

curl -X POST "https://api.yourdomain.com/v1/agents/company-knowledge-base/chat" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "threadId": "thread_xyz789",
    "messages": [
      {"role": "user", "content": "What is our refund policy?"}
    ],
    "stream": true
  }'

Best Practices

Chunking

  • Use overlap (e.g., 64 tokens) to avoid splitting sentences awkwardly
  • For structured docs (tables, lists), consider smaller chunks
  • For narrative text, 512 tokens is a good default

Retrieval

  • Start with Top-K = 5 and tune based on answer quality
  • Use similarityThreshold to filter low-relevance chunks
  • Consider hybrid search (keyword + semantic) for mixed queries

System Prompts

  • Instruct the model to answer only from the provided context
  • Add fallback behavior: “If the context doesn’t contain the answer, say ‘I don’t have that information.’”
  • Include domain-specific instructions (tone, format, citations)

Re-indexing

  • Re-index when you add or update documents
  • Use incremental indexing when possible to avoid full re-embedding
  • Monitor embedding and indexing latency for large corpora

Next Steps