RAG with Hypothetical Document Embeddings(HyDE)

3 min readMar 3, 2024


Hypothetical Document Embeddings, two-steps process of RAG. Implementation of HyDE retrieval by Llama-Index, hybrid, local or remote LLMs.

Hypothetical Document Embeddings (HyDE) is a method that transforms documents into vector representations, or embeddings, using two main components. The first component involves a generative task that employs a language model to follow specific instructions, aiming to capture relevance even if the generated document is not real and may contain factual errors. The second component is a document-document similarity task managed by a contrastive encoder. This part encodes the generated document into an embedding vector, acting as a lossy compressor that filters out extra (hallucinated) details from the embedding. HyDE has been shown to outperform the state-of-the-art unsupervised dense retriever, Contriever, and demonstrate performance comparable to fine-tuned retrievers across various tasks and languages. This method simplifies dense retrieval into these two tasks.

Interested in reading this paper. I’ve got 2 pieces of Python code here for some applied level reference:

It’s pretty obvious, the main difference between a regular or vanilla RAG and HyDE is that a LLM is added in between, aiming to generate a hypothetical document within the user’s questioning domain. The rest of it, there’s no difference between the two.

I used Llama-Index for this, if you’re interested in using LangChain, you can refer to this notebook.

Here I’m doing an extension, not from the original paper

If we can use the data source as context to enhance the user query, that is, an updated query, and then generate a hypothetical document with the updated query, here is my intention:

The query can be updated by the LLM with the data source context and use updated query to generate hypothetical document.

This approach has one advantage, which is that using the same LLM to generate an updated query and the subsequent retriever can keep the retrieved results stable. You can understand it like this: when multiple queries express the same position but use different content, the LLM should actually retrieve almost the same indices as the retriever.


Considering the whole flow involves at least 2 LLM calls, with the extension is 3 calls, we must carefully design which LLMs are “local” and which LLMs are “remote”, Ollama is obviously the preferred tool.

Additional Read (optional)