RAG with Multi-Query pattern

TeeTracker
3 min readFeb 19, 2024

--

Additional discussion of Multi-Query pattern in RAG

The so-called Multi-Query mostly refers to using the origin query to ask the LLM to generate multiple “similar” queries, so called sub-queries. An earlier article introduced the Know-How based on this pattern, LangChain has “similar” direct support, and the implementation of Llama-Index is also very simple. You can check it out here:

This article will skip over the details of retrieval and instead introduce 2 approaches that directly use the “question and answer” of sub-queries as “context”, commonly referred to as “decomposition”.

Overview

To generate sub-queries and their corresponding answers, along with the original query, using a multi-query pattern will incur additional overhead. If using OpenAI or other non-”local” hosted models, be sure to optimize prompts and other aspects.

Sub-queries in Bundle

When we use LLM to generate “similar” sub-queries based on the origin query, we then search these sub-queries to get the corresponding answers. After that, we use the sub-queries and corresponding answers, along with the retrieval of the origin query itself, as the “context” for the query. Essentially, this is similar to Multi-Query Retrieval, but the difference is that we skip the retrieval details and discuss it directly at a higher level, which is the query level. That’s why this article mainly focuses on the Llama-Index code.

Sub-queries in step down

Actually, this pattern is very similar to the Prio-reasoning. We will run each sub-query one by one, and the result of each sub-query (question-answer pair) will be stored in a “memory”. The questioning of each sub-query is based not only on its own retrieval as context but also on this “memory”. In other words, each sub-query’s search is influenced by previous sub-queries. When the original query comes in for a search, it is influenced by all sub-queries and the original query’s own retrieval.

Furthermore

The MultiStepQueryEngine of Llama-Index is also a kind of “decomposition” implementation, and this article has a know-how introduction to it:

Code

Streamlit app

more codes you can find in

--

--