#321: Working With Metadata in Chroma

When we use the search function of Chroma, we can ask for cats and find documents related to felines. While this is great to get all cat related content, it may match on too many documents. This "fuzzy" search is a feature made possible by the embedding of our search term and how vectors relate to each other. But that also means we cannot simply get a stricter mode if we need one.

Chroma offers us an addition to the query method that allows us to filter based on the metadata of a document. That way we can combine the "fuzzy" vector search with a strict search based on metadata. Let us see how we can use it with our data.

Preparation

See the post from last week on how we can use ChromaDB in Python. We again need a client and a collection that we can initialise this way:

import chromadb
from chromadb.config import Settings

# create client - opt out of telemetry
telemetry_off = Settings(anonymized_telemetry=False)
client = chromadb.PersistentClient(path="demo.chroma", settings=telemetry_off)

# create a collection
collection = client.get_or_create_collection(name="my_collection")

Add metadata

Last week we used documents and identifiers to store data in Chroma. If we want to filter with metadata, we need a dictionary of metadata for each document. In this dictionary we can add everything we want to describe the document – as long as we keep the data flat:

# add a few documents with meta data
docs = [
    # A Study in Scarlet
    "Holmes and Watson first meet, begin sharing lodgings, and Holmes applies observation and deduction to a puzzling case involving a mysterious death and a trail leading beyond London.",
    # The Sign of the Four
    "A client’s inheritance mystery leads Holmes and Watson into a hunt tied to a hidden treasure, shifting alliances, and a river pursuit that turns the investigation into an action-heavy chase.",
    # The Hound of the Baskervilles
    "A legendary curse and a threatened heir draw Holmes and Watson to the moors, where superstition clashes with rational inquiry as they track the source of seemingly supernatural danger.",
    # The Adventure of the Speckled Band
    "A terrified woman seeks Holmes’s help after her sister’s strange death; clues in a secluded estate point to a dangerous method of murder and a sinister household secret.",
    # The Final Problem
    "Holmes confronts a criminal mastermind who orchestrates crimes from the shadows, and the conflict escalates into a decisive showdown away from London.",
    # The Adventure of the Scandal in Bohemia
    "A royal client fears blackmail over a compromising photograph; Holmes faces an exceptionally clever opponent and learns that not every contest ends with a tidy victory.",
]

metadatas = [
    {
        "work_type": "novel",
        "title": "A Study in Scarlet",
        "collection": None,
        "year": 1887,
        "setting": "London",
        "themes": "origin_story|deduction|mystery",
        "characters": "Sherlock Holmes|Dr. John Watson",
        "has_holmes": True,
        "has_watson": True,
        "has_moriarty": False,
        "has_irene_adler": False,
    },
    {
        "work_type": "novel",
        "title": "The Sign of the Four",
        "collection": None,
        "year": 1890,
        "setting": "London|River Thames",
        "themes": "treasure|investigation|pursuit",
        "characters": "Sherlock Holmes|Dr. John Watson",
        "has_holmes": True,
        "has_watson": True,
        "has_moriarty": False,
        "has_irene_adler": False,
    },
    {
        "work_type": "novel",
        "title": "The Hound of the Baskervilles",
        "collection": None,
        "year": 1902,
        "setting": "Devonshire|moor",
        "themes": "superstition_vs_reason|inheritance|threat",
        "characters": "Sherlock Holmes|Dr. John Watson",
        "has_holmes": True,
        "has_watson": True,
        "has_moriarty": False,
        "has_irene_adler": False,
    },
    {
        "work_type": "short_story",
        "title": "The Adventure of the Speckled Band",
        "collection": "The Adventures of Sherlock Holmes",
        "year": 1892,
        "setting": "English countryside|estate",
        "themes": "family_secret|murder_method|investigation",
        "characters": "Sherlock Holmes|Dr. John Watson",
        "has_holmes": True,
        "has_watson": True,
        "has_moriarty": False,
        "has_irene_adler": False,
    },
    {
        "work_type": "short_story",
        "title": "The Final Problem",
        "collection": "The Memoirs of Sherlock Holmes",
        "year": 1893,
        "setting": "London|Switzerland",
        "themes": "mastermind|pursuit|showdown",
        "characters": "Sherlock Holmes|Dr. John Watson|Professor Moriarty",
        "has_holmes": True,
        "has_watson": True,
        "has_moriarty": True,
        "has_irene_adler": False,
    },
    {
        "work_type": "short_story",
        "title": "A Scandal in Bohemia",
        "collection": "The Adventures of Sherlock Holmes",
        "year": 1891,
        "setting": "London",
        "themes": "blackmail|clever_adversary|social_status",
        "characters": "Sherlock Holmes|Dr. John Watson|Irene Adler",
        "has_holmes": True,
        "has_watson": True,
        "has_moriarty": False,
        "has_irene_adler": True,
    },
]

ids = [f"holmes-canon-{i}" for i in range(len(docs))]

We then can put all this together and use the upsert() method to add it to Chroma:

collection.upsert(
    ids=ids,
    documents=docs,
    metadatas=metadatas,
)

Filter with metadata

With all this extra data in place, we can use the where parameter when we query our data:

question = "criminal mastermind behind a web of crimes"
results = collection.query(
    query_texts=[question],
    n_results=5,
    where={"has_moriarty": True}
)

print(question)
for id, document, distance in zip(results["ids"][0], results["documents"][0], results["distances"][0]):
    print(f"#{id}: >{document}< with a distance of {distance}")

This filters the data before it goes through the search for vector similarities, what can drastically change the results we get:

criminal mastermind behind a web of crimes
#holmes-canon-4: >Holmes confronts a criminal mastermind who orchestrates crimes 
from the shadows, and the conflict escalates into a decisive showdown away from 
London.< with a distance of 1.0794227123260498

With the additional metadata, we can filter our embeddings during queries, enabling us to concentrate on the most relevant documents and avoid spending time searching through an overly large dataset. Next week, we will apply this new understanding when we extract content from Markdown files for a Python Friday RAG.