#321: Working With Metadata in Chroma
When we use the search function of Chroma, we can ask for cats and find documents related to felines. While this is great to get all cat related content, it may match on too many documents. This "fuzzy" search is a feature made possible by the embedding of our search term and how vectors relate to each other. But that also means we cannot simply get a stricter mode if we need one.
Chroma offers us an addition to the query method that allows us to filter based on the metadata of a document. That way we can combine the "fuzzy" vector search with a strict search based on metadata. Let us see how we can use it with our data.
Preparation
See the post from last week on how we can use ChromaDB in Python. We again need a client and a collection that we can initialise this way:
Add metadata
Last week we used documents and identifiers to store data in Chroma. If we want to filter with metadata, we need a dictionary of metadata for each document. In this dictionary we can add everything we want to describe the document – as long as we keep the data flat:
We then can put all this together and use the upsert() method to add it to Chroma:
Filter with metadata
With all this extra data in place, we can use the where parameter when we query our data:
This filters the data before it goes through the search for vector similarities, what can drastically change the results we get:
criminal mastermind behind a web of crimes
#holmes-canon-4: >Holmes confronts a criminal mastermind who orchestrates crimes
from the shadows, and the conflict escalates into a decisive showdown away from
London.< with a distance of 1.0794227123260498
Next
With the additional metadata, we can filter our embeddings during queries, enabling us to concentrate on the most relevant documents and avoid spending time searching through an overly large dataset. Next week, we will apply this new understanding when we extract content from Markdown files for a Python Friday RAG.