#320: Store Embeddings in ChromaDB

We paused our AI journey after we figured out a way to work with large PDF files. It is now time to continue and find a solution to create and store vectors in a way that we can incrementally add new documents. The vectors, or embeddings, represent the semantic meaning and are a key part of the "chat with our docs" feature.

There are a handful solutions we can use. I start with Chroma (or ChromaDB) because it is an open-source vector database designed specifically to make building AI applications easy. It is built on top of SQLite, what helps us to work with the persisted data should we want to know more about how the magic works behind the scenes. Let us explore how we can use Chroma to calculate and store embeddings.

Installation

We can install the ChromaDB package with this command:

uv pip install chromadb

Create the database client

For our first steps we go with the in-memory client. That way we get a new database whenever we run our script, and we do not need to handle repeated inserts of the same documents.

We can get a database client that handles all the work with these lines:

import chromadb
from chromadb.config import Settings

# create client - opt out of telemetry
telemetry_off = Settings(anonymized_telemetry=False)
client = chromadb.Client(settings=telemetry_off)

The telemetry deactivation is optional, but I prefer not to share data wherever I can.

Insert data

To insert data, we need to create a collection in Chroma and then add the documents and their identifiers as separate lists. The term document is generous; it can vary from a small string to a large text:

# create a collection
collection = client.create_collection(name="my_collection")

# add a few documents
collection.add(
    documents = [
        "The cat sat on the mat.",
        "A fluffy feline lounged gracefully.",
        "The kitten played with a ball of yarn.",
        "The dog eats the bone.",
        "The car is green.",
        "The engine in the car runs on oil."
    ],
    ids=["id1", "id2", "id3","id4","id5","id6"]
)

Search for similarity

We can search in Chroma with the query() method. This method accepts a string, and Chroma will crate an embedding for it automatically. That reduces work for us and allows Chroma to use the correct embedding function that matches the one used in the collection. If we do not specify an explicit number of results we want with the n_results parameter, we get 10 results back.

# Query for semantically similar documents
question = "What do cats do?"
results = collection.query(
    query_texts=[question], # Chroma will embed this for you
    n_results=3 # how many results to return
)

print(question)
for i in range(len(results["ids"][0])):
    id = results["ids"][0][i]
    document = results["documents"][0][i]
    distance = results["distances"][0][i]

    print(f"#{id}: >{document}< with a distance of {distance}")

If we run this search for cats, we can also find the many other words that describe cats:

What do cats do?
#id2: >A fluffy feline lounged gracefully.< with a distance of 1.0599215030670166
#id3: >The kitten played with a ball of yarn.< with a distance of 1.243445634841919
#id1: >The cat sat on the mat.< with a distance of 1.292862057685852

Everything matches to a certain extend

Be aware that when we compare vectors, every document in our database produces a similarity result – even if it has nothing to do with what we search for. This happens because any vector we store has a distance to the vector we query for. That is why we need to limit the results we ask for and should check the distance of the results we get.

Persist the database

If we want to persist our database, we need to switch a few parts in our script. 1. Instead of Client we can use the PersistentClient and specify the path we want to use to store our database. 2. Instead of create_collection() we should use the method get_or_create_collection() to reuse the existing collection. 3. Instead of collection.add() we can use collection.upsert() to replace existing documents if the identifiers match.

If we put all together, our script to insert data now looks like this:

import chromadb
from chromadb.config import Settings

# create client - opt out of telemetry
telemetry_off = Settings(anonymized_telemetry=False)
client = chromadb.PersistentClient(path="demo.chroma", settings=telemetry_off)

# create a collection
collection = client.get_or_create_collection(name="my_collection")

# add a few documents
collection.upsert(
    documents = [
        "The cat sat on the mat.",
        "A fluffy feline lounged gracefully.",
        "The kitten played with a ball of yarn.",
        "The dog eats the bone.",
        "The car is green.",
        "The engine in the car runs on oil."
    ],
    ids=["id1", "id2", "id3","id4","id5","id6"]
)

With Chroma we get an easy-to-use vector database. It comes with a pre-defined embedding function that is good enough to start and that we can replace later should we be unhappy with the results we get. Next week we explore how we can use metadata to filter our data to get more relevant search results.