Hypothetical Document Embedding(HYDE)

5 min read

When using Retrieval Augmented Generation(RAG) with LLMs, we inject relevant documents' text into the prompt so that the LLM's responses are coherent and contextually appropriate. The source document retrieval uses a semantic or vector search to find most relevant docs - for example cosine similarity search.

Here cosine similarity search is between the embedding vectors of the question and the documents. The embedding vectors of documents are dense here but the question embedding is small. Because of the small embedding vector, the question will fail to capture relevant patterns for a good similarity search.

Hypothetical Document Embedding addresses the above problem. When using HYDE, we ask LLM to answer the question based on its training knowledge. The LLM's response is a hypothetical document. Even when this is a hallucinated answer, this hypothetical document is likely to be closer to the cluster of documents where real answer lies. We perform cosine similarity search between the embedding vectors of the hypothetical document and our documents to find the most relevant docs for zero shot prompting. Lets demo this using Wikipedia Simple English dataset.

Install dependencies

# use python 3.10
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
pip install lancedb==0.3.0 pyarrow openai

Download Wikipedia Simple English dataset

with precomputed OpenAI embedding vectors

wget https://huggingface.co/datasets/Supabase/wikipedia-en-embeddings/resolve/main/wiki_openai.ndjson.gz?download=true
mv 'wiki_openai.ndjson.gz?download=true' 'wiki_openai.ndjson.gz'
gunzip 'wiki_openai.ndjson.gz'

Setup lancedb

# Why lancedb?
# Easy to setup serverless DB with space saving parquet file format support. 
# Vector search doesn't load the entire dataset to memory. Such streaming operation enables us to query large dataset on small machines.

import lancedb
import pyarrow as pa

 def setup_lancedb(db):
   schema = pa.schema(
                  [
                      pa.field("vector", pa.list_(pa.float32(), list_size=1536)),
                      pa.field("id", pa.string()),
                      pa.field("body", pa.string()),
                  ]
              )
    db.create_table("wikipedia", schema=schema)

# Create lancedb table
db = lancedb.connect('lance.db')
setup_lancedb(db)

Populate DB

import json

tbl = db.open_table("wikipedia")

data = []
with open("wiki_openai.ndjson", "rt") as f:
  for i, line in enumerate(f):
      record = json.loads(line)
      data.append({
        "vector": record['text-embedding-ada-002'],
        "id": record["id"],
        "body": record["body"]
      })
      try:
        if i % 20000 == 0:
          print(i)
          tbl.add(data)
          data = []
      except Exception:
        logger.exception(f"Error in tbl add, line => {i}")
    tbl.add(data)
  
  

Embedding function

from openai import OpenAI

def embedding_encode(text):
  client = OpenAI()
  response = client.embeddings.create(
    input=text,
    model="text-embedding-ada-002"
  )
  return response.data[0].embedding

Cosine similarity

def similarity_search(tbl, text):
   vector = embedding_encode(text)
   q = tbl.search(vector)
   df = q.metric("cosine").limit(10).to_df()
   df = df[df["_distance"] < 0.15]. # minimum distance
   return df.to_dict(orient="records")

# cosine similarity without HYDE
import os
os.environ['OPENAI_API_KEY'] = '<OPEN AI API KEY>'

r = similarity_search(tbl, text="Who wrote a children's book based on Greece that has a name meaning monster of ocean?")
>>> r
>>> [] empty list

Here cosine similarity search doesn't find any matching documents. Now try with HYDE

# I asked ChatGPT and got the hypothetical answer listed below

q =  "Who wrote a children's book based on Greece that has a name meaning monster of ocean?"
a = """Based on the description of "a children's book based on Greece that has a name meaning Monster of Oceans," it's challenging to pinpoint a specific book or author. Greek mythology does feature sea monsters, like the Kraken or Scylla and Charybdis, but these specific creatures are not commonly the central figures in children's books.However, Rick Riordan, the author of the "Percy Jackson & the Olympians" series, has written extensively with themes from Greek mythology and features various mythical creatures, including those from the sea. His books, while not exactly children's books but rather more suited for a middle-grade and young adult audience, often involve mythological monsters and adventures that could metaphorically be described as monsters of oceans.
The first book in the series, "The Lightning Thief," introduces Percy Jackson, who discovers he is a demigod and the son of Poseidon, the Greek god of the sea. This connection to Poseidon brings a significant oceanic element to the series. However, it's important to note that the series itself doesn't have a title that directly translates to "Monster of Oceans."
If the book in question is less mainstream or from a specific cultural or regional background, additional details would be helpful for a more accurate identification."""

# try cosine similarity with question and the hypothetical answer's embedding.
ha = f"{q}:{a}"
r2 = similarity_search(tbl, text=ha)
>>> r2
>>> [{'vector': array([ 0.00493553, -0.02298702, -0.02723138, ..., -0.01739278,
         -0.00264786, -0.05295712], dtype=float32),
  'id': '164062',
  'body': 'Title: The Sea of Monsters Content: The Sea of Monsters is a fantasy-adventure novel based on Greek mythology. It was written by Rick Riordan and published in 2006. It is the second novel in the Percy Jackson & the Olympians series and is the sequel to The Lightning Thief. This book is about the adventures of thirteen-year-old demigod Percy Jackson. It describes how he and his friends rescue his satyr friend Grover from the Cyclops Polyphemus. They also save Camp Half-Blood, a summer camp for demigods, from a Titan attack. They bring the Golden Fleece from Polyphemus\' island which cures Thalia\'s poisoned pine tree, that protected the camp. The Sea of Monsters received good reviews and was nominated for many awards such as the 2006 Book Sense Top Ten Summer Pick and the 2009 Mark Twain Award. It sold over 100,000 copies in paperback. It was released as an audiobook on September 6, 2006. The Sea of Monsters is followed by The Titan\'s Curse, which is the third book of the five in the series. ==Plot== Like the first book, The Lightning Thief, this novel is of the fantasy genre. Reviewers have called it a "fast paced" book with humor and action, which discusses acceptance and family love. ===Summary=== Percy Jackson is a 13 year-old demigod. He is playing dodgeball against large seventh graders. He.....





With the help of hypothetical answer, we are able to find cosine similarity matches relevant to the question.

LLM Hypothetical Document Embedding(HYDE) zero shot prompting RAG Retrieval Augmented Generation