RAG for Agents: Augmented Memory
How to implement Retrieval-Augmented Generation to give your agents contextual knowledge and long-term memory.
What is RAG?
Retrieval-Augmented Generation (RAG) is a pattern that combines relevant information retrieval with text generation by an LLM. For agents, RAG works as a long-term memory that allows them to access specific knowledge.
Basic RAG Architecture
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
class AgentMemory:
def __init__(self):
self.embeddings = OpenAIEmbeddings()
self.vectorstore = Chroma(
embedding_function=self.embeddings
)
def store(self, text: str, metadata: dict):
self.vectorstore.add_texts(
texts=[text],
metadatas=[metadata]
)
def recall(self, query: str, k: int = 5):
return self.vectorstore.similarity_search(
query, k=k
)Chunking Strategies
How we split documents directly impacts retrieval quality:
Fixed-size chunks
Split by fixed token count. Simple but may cut context.
Semantic chunking
Split by semantic units (paragraphs, sections). Better coherence.
Recursive chunking
Split hierarchically, preserving document structure.
Hybrid Search
Combine semantic search with keyword search for better results:
- Dense retrieval: Embedding-based search (semantic similarity)
- Sparse retrieval: BM25 or TF-IDF (keyword matching)
- Hybrid: Combine both with reciprocal rank fusion
RAG Evaluation
Key metrics for evaluating your RAG pipeline:
- Context relevance: Are the retrieved chunks relevant?
- Faithfulness: Is the answer faithful to the retrieved context?
- Answer quality: Is the answer helpful and complete?
Conclusion
RAG is an essential component for agents that need to access specific knowledge. The key is optimizing each stage of the pipeline: ingestion, chunking, embedding, retrieval, and generation.