Retrieval-Augmented Generation (RAG)

You want to cross-reference a model’s answers with the original content so you can see what it is basing its answer on” — Luis Lastras, director of language technologies at IBM Research.

Retrieval-Augmented Generation (RAG) techniques are methods in AI that combine information retrieval and text generation to enhance the capabilities of models, especially in tasks requiring detailed and accurate responses. Here's a breakdown of how RAG techniques work and why they are important:

Here’s a cheat sheet of 7 of the most popular RAG architectures.

How RAG Techniques Work

Retrieval Component: The system first retrieves relevant information from a large corpus or database. This is typically done using a search engine or a specialized retrieval model that can query a collection of documents or knowledge base.
Generation Component: After retrieving relevant information, the system uses a generative model (such as GPT or another language model) to generate a response based on the retrieved information. This model synthesizes the data to produce coherent and contextually appropriate answers or content.

Rag cookbooks

RAG cookbooks

RAG techniques

Key Components

Retrieval Model: This part of the system is responsible for fetching relevant documents or pieces of information. Techniques like dense retrieval (using embeddings) or traditional keyword-based search can be used.
Generative Model: This model generates text based on the retrieved information. It leverages large language models to create responses that are not only informative but also contextually accurate and fluent.

Benefits of RAG Techniques

- Enhanced Accuracy: By retrieving specific and relevant information, the generative model can provide more accurate and detailed responses compared to relying solely on pre-trained knowledge.

- Contextual Relevance: The generative model can tailor its responses based on the context provided by the retrieved information, leading to answers that are more relevant to the query.

- Scalability: RAG techniques allow for handling a large volume of information and generating responses dynamically, which is particularly useful in complex or specialized domains.

Applications

- Question Answering: RAG techniques are often used in systems that answer user queries by first retrieving relevant documents and then generating a comprehensive answer.

- Content Creation: These techniques can help in generating content that is well-informed and contextually accurate by pulling from a vast array of sources.

- Conversational Agents: In chatbots and virtual assistants, RAG methods can enhance the quality of interactions by providing more precise and informative responses based on real-time retrieval of information.

In essence, Retrieval-Augmented Generation is a powerful approach that leverages the strengths of both information retrieval and text generation to create more accurate, relevant, and contextually appropriate outputs in AI systems.

Techniques

Want to level up your AI engineering? RAG remains the top Generative AI use case for most organizations. Enhancing model input with external data can improve accuracy and contextual richness of responses. Success depends on using the right techniques effectively. Here are 21 RAG techniques to master:

Simple RAG: Encodes document content into a vector store, enabling quick retrieval of relevant information to enhance model responses.
Context Enrichment: Adds surrounding context to each retrieved chunk, improving the coherence and completeness of the returned information.
Multi-faceted Filtering: Applies various filtering techniques (metadata, similarity thresholds, etc.) to refine and improve the quality of retrieved results.
Fusion Retrieval: Combines vector-based similarity search with keyword-based retrieval to improve document retrieval.
Intelligent Reranking: Reassesses and reorders initially retrieved documents to ensure that the most pertinent information is prioritized for subsequent processing.
Query Transformation: Modifies or expands the original query with query rewriting, step-back prompting, and sub-query decomposition.
Hierarchical Indicies: First identifies relevant document sections through summaries, then drills down to specific details within those sections.
Hypothetical Questions (HyDE): HyDE transforms queries into hypothetical documents that contain answers, bridging the gap between query and document distributions in vector space.
Choose Chunk Size: Selects an appropriate fixed size for text chunks to balance context preservation and retrieval efficiency.
Semantic Chunking: Unlike traditional methods that split text by fixed character/word counts, semantic chunking creates more meaningful, context-aware segments.
Context Compression: Compresses and extracts the most pertinent parts of documents in the context of a given query.
Explainable Retrieval: Not only retrieves relevant documents based on a query but also provides explanations for why each retrieved document is relevant.
Retrieval w/Feedback: Utilizes user feedback on the relevance and quality of retrieved documents and generated responses to fine-tune retrieval and ranking models.
Adaptive Retrieval: Classifies queries into different categories and uses tailored retrieval strategies (factual, analytical, contextual, etc.) for each, considering user context and preferences.
Iterative Retrieval: Analyzes initial results and generates follow-up queries to fill in gaps or clarify information.
Ensemble Retrieval: Applies different embedding models or retrieval algorithms and uses voting or weighting mechanisms to determine the final set of retrieved documents.
Graph RAG: Retrieves entities and their relationships from a knowledge graph relevant to the query, combining with unstructured text for more informative responses.
Multi-Modal Retrieval: Integrates models that can retrieve and understand documents that integrate modalities, combining insights from text, images, and videos.
RAPTOR: Uses abstractive summarization to recursively process and summarize retrieved documents, then distills the information in a tree structure for hierarchical reasoning.
Self RAG: Multi-step process including retrieval decision, document retrieval, relevance evaluation, response generation, and more to improve model responses.
Corrective RAG: Dynamically evaluates and corrects the retrieval process, combining document retrieval, web search, and models to improve responses.

RAG Video overview

An overview from IBM expert Marina Danilevsky. Read more about RAG on IBM's website.

RAG vs. Fine tuning: Which AI strategy should you choose?

Incorporating artificial intelligence (AI) into an organization isn't a matter of flipping a switch; it requires careful customization to suit specific business needs. When adapting large language models (LLMs) for the enterprise, there are typically two primary strategies to choose from: fine tuning and retrieval augmented generation (RAG). While fine tuning focuses on shaping the model's responses and behavior, RAG relies on integrating external data into the model's workflow. Both approaches customize LLM behavior and output, but each is uniquely suited to different use cases and types of data. So, let’s explore each method to help you determine the best fit for your needs.

In the following video, I present the strengths, weaknesses, and common applications of both techniques:

The role of fine tuning

Fine tuning can adapt a general-purpose LLM into a domain-specific expert by training it on curated datasets. This process adjusts the model’s internal parameters, embedding static, foundational knowledge and aligning its behavior with specific tasks or industries. For example, a fine-tuned model in healthcare can generate accurate, context-aware responses while understanding compliance standards.

fine tuning

This approach is best suited for scenarios involving stable, unchanging data or tasks requiring consistent outputs, such as regulatory documentation or specialized terminology. While resource-intensive, fine-tuning ensures precise alignment with enterprise goals, making it ideal for long-term, static use cases.

Integrating with your data: The case for RAG

Due to the historical nature of training and fine-tuning being a complicated process, RAG has been adopted by both developers and enterprises to complement general-purpose LLMs with up-to-date and external data. This means having the ability to take a model off the shelf, regardless of whether it’s proprietary or open source, and enable access to data repositories and databases without re-training. You’ll typically encounter these steps in the approach:

Data transformation: Converting enterprise data into a format accessible by the AI, such as embedding knowledge into a searchable format.
Storage in a knowledge base: Organizing data within a knowledge library, which the model can access in real time.
Response generation: Using the knowledge base, the AI system retrieves relevant information to generate accurate, evidence-backed answers.

fine tuning

For customer support applications, RAG allows the LLM to draw from source data, delivering accurate responses that foster more trust and transparency. This evidence-backed answer component is important, as the idea of overconfidence, or hallucinations, is an issue when adopting AI into business use cases. However, it’s important to note that tuning and maintaining a RAG system is complex, requiring robust data pipelines to pull and feed timely information to the model during usage.

The best of both worlds: Combining fine tuning and RAG

Much like how businesses can benefit from a hybrid cloud and on-premise approach for their workloads, an AI strategy can combine fine tuning and RAG to best meet their needs. This results in an LLM being a subject matter expert in a specific field, deeply understanding specific content and terminology while staying current. For example, you could combine both resulting in a fine-tuned model on your domain-specific data to understand your industry’s context, while leveraging RAG for up-to-date information from databases and content stores. Scenarios such as financial analysis or regulatory compliance are just a few situations where the combined strategy would be immensely helpful.

While we’ve covered how these approaches can be helpful for business use cases, it’s important to understand that these techniques, specifically with fine tuning, allow us to set core behavioral parameters in these AI models themselves. This is incredibly important as we use AI to reflect our values and goals, not just our expectations. Enterprises should begin by figuring out what sticks and what doesn’t, and use those lessons to continue building great things.

If you'd like to learn how to start fine tuning models for experimentation and development, try out InstructLab, an open source project supported by IBM and Red Hat, by completing this tutorial, "Contributing knowledge to the open source Granite model using InstructLab."