RAG for Agents: Augmented Memory

What is RAG?

Retrieval-Augmented Generation (RAG) is a pattern that combines relevant information retrieval with text generation by an LLM. For agents, RAG works as a long-term memory that allows them to access specific knowledge.

Basic RAG Architecture

from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
 
class AgentMemory:
    def __init__(self):
        self.embeddings = OpenAIEmbeddings()
        self.vectorstore = Chroma(
            embedding_function=self.embeddings
        )
 
    def store(self, text: str, metadata: dict):
        self.vectorstore.add_texts(
            texts=[text],
            metadatas=[metadata]
        )
 
    def recall(self, query: str, k: int = 5):
        return self.vectorstore.similarity_search(
            query, k=k
        )

Chunking Strategies

How we split documents directly impacts retrieval quality:

Fixed-size chunks

Split by fixed token count. Simple but may cut context.

Semantic chunking

Split by semantic units (paragraphs, sections). Better coherence.

Recursive chunking

Split hierarchically, preserving document structure.

Hybrid Search

Combine semantic search with keyword search for better results:

Dense retrieval: Embedding-based search (semantic similarity)
Sparse retrieval: BM25 or TF-IDF (keyword matching)
Hybrid: Combine both with reciprocal rank fusion

RAG Evaluation

Key metrics for evaluating your RAG pipeline:

Context relevance: Are the retrieved chunks relevant?
Faithfulness: Is the answer faithful to the retrieved context?
Answer quality: Is the answer helpful and complete?

Conclusion

RAG is an essential component for agents that need to access specific knowledge. The key is optimizing each stage of the pipeline: ingestion, chunking, embedding, retrieval, and generation.