#322: Embed Markdown for a Python Friday RAG

After we found with Chroma a flexible vector store, we have everything together to build a RAG (Retrieval Augmented Generation) for the Python Friday blog that uses LangChain and a local LM Studio.

In this post we focus on extracting metadata from the Markdown files I use in this MkDocs Material powered blog. We split the Markdown into useful chunks and turn the metadata for the blog post into a metadata dictionary to use with Chroma. Let us explore how we can do this first part.

Installation

To build our Markdown extractor, we need a few packages:

uv pip install chromadb langchain-text-splitters

Create a Markdown splitter

The langchain_text_splitters module offers a MarkdownHeaderTextSplitter class that will split our Markdown files by (sub) headings. That way we get text parts that belong together, what should increase the quality of the answers we produce with the LLM.

We can configure the MarkdownHeaderTextSplitter class by telling it on what headers we want to split:

def get_markdown_splitter():
    """ Create a MarkdownHeaderTextSplitter with the header list to split on. """
    headers_to_split_on = [
        ("#", "Header 1"),
        ("##", "Header 2"),
        ("###", "Header 3"),
    ]

    return MarkdownHeaderTextSplitter(headers_to_split_on)

Split documents and prepare metadata

Since the blog posts have a helpful set of metadata (like tags and categories), we can use them for our RAG. For that we need to do some extra work that we could skip if the data we want to index are just plain Markdown files.

As a first step, we need to process the file or directory we want to index. If it is a directory, we need to find all Markdown files inside that folder and process each file. If it is a file, we can just process it:

def process_files(path, markdown_splitter):
    """ Find file(s) to extract data from """
    document_splits = []  

    if path.is_dir():
        md_files = sorted(path.rglob("*.md"))
        for md_file in md_files:
            print(md_file)
            document_splits.extend(split_file(md_file, markdown_splitter))

    elif path.is_file():
        print(path)
        document_splits.extend(split_file(path, markdown_splitter))

    else:
        raise FileNotFoundError(f"Path does not exist or is not a file/directory: {path}")


    return document_splits

When we know the files to work with, we can do some preprocessing in which we extract the metadata and reformat the Markdown before we hand it to our MarkdownHeaderTextSplitter. We make sure that each entry has its unique Id, that we will later use as the Id in Chroma.

def split_file(path, splitter):
    """ Splits a file into parts and adds metadata. """
    with open(path, 'r', encoding="utf-8") as file_content:
        document = file_content.read()

    basic_metadata, document = process_markdown_metadata(document)
    md_splits = splitter.split_text(document)

    part = 1
    for entry in md_splits:
        entry.metadata = add_reference(entry.metadata) | basic_metadata
        entry.metadata["Id"] = entry.metadata["No"] + "-" + str(part)
        part += 1

    return md_splits

The metadata extraction is a bit of a mess, but it basically extracts the values from the metadata header of the post, removes it and turns the title of the post (that was in a field title:) into a proper Markdown level one heading (#):

def process_markdown_metadata(md_text: str):
    """ Extract the metadata in the header and replace it with the title as level 1 header. """

    # Extract header
    header_match = re.search(r"^---\n(.*?)\n---", md_text, re.DOTALL)
    if not header_match:
        raise ValueError("No header section found")

    header = header_match.group(1)

    categories = []
    tags = []
    title = None
    published_on = None
    current_key = None

    for line in header.splitlines():
        line = line.strip()

        if line.startswith("title:"):
            title = line.split("title:", 1)[1].strip().strip('"')

        elif line.startswith("date:"):
            published_on = line.split("date:", 1)[1].strip()

        elif line.startswith("categories:"):
            current_key = "categories"

        elif line.startswith("tags:"):
            current_key = "tags"

        elif line.startswith("-") and current_key:
            value = line.lstrip("-").strip().strip('"')
            if current_key == "categories":
                categories.append(value)
            elif current_key == "tags":
                tags.append(value)

        else:
            current_key = None

    # Build dictionary with joined strings
    meta_dict = {
        "categories": " | ".join(categories),
        "tags": " | ".join(tags),
        "published_on": published_on,
    }

    # Replace header section with title header
    new_md = re.sub(
        r"^---\n.*?\n---",
        f"# {title}",
        md_text,
        flags=re.DOTALL
    )

    match_id = re.match(r"#(\d+)", title)
    if match_id:
        meta_dict["No"] = match_id.group(1)
    else:
        meta_dict["No"] = str(uuid.uuid4())

    new_md = new_md.replace("<!-- more -->", "", 1)

    return meta_dict, new_md

In the final part we create a Reference metadata entry that we want to use as a source reference with the LLM:

def add_reference(d: dict) -> dict:
    """ Generate a reference out of Header 1 / Header 2 sections """
    parts = []
    # pprint.pp(d)
    if "Header 1" in d:
        parts.append(d["Header 1"])
    if "Header 2" in d:
        parts.append(d["Header 2"])

    if parts:
        d["Reference"] = " / ".join(parts)
    # print(d["Reference"])
    return d

Persist documents

After we extracted all those text parts, we can persist them in Chroma with this method:

def store_splits(splits):
    """ Store the document splits in ChromaDB """
    ids = [doc.metadata["Id"] for doc in splits]
    docs = [doc.page_content for doc in splits]
    metadatas = [doc.metadata for doc in splits]

    telemetry_off = Settings(anonymized_telemetry=False)
    client = chromadb.PersistentClient(path="PythonFridayRAG.chroma", settings=telemetry_off)

    # create a collection
    collection = client.get_or_create_collection(name="posts")

    # add a few documents
    collection.upsert(
        documents = docs,
        ids=ids,
        metadatas=metadatas,
    )

Our vectors go to the posts collection in a PythonFridayRAG.chroma database.

Glue everything together

The glue code we need for our script looks like this:

import re
import sys
import uuid
from pathlib import Path
import pprint

import chromadb
from chromadb.config import Settings
from langchain_text_splitters import MarkdownHeaderTextSplitter


if __name__ == "__main__":
    if len(sys.argv) != 2:
        raise ValueError("Usage: python pf_rag_index.py <file-or-folder>")

    path = Path(sys.argv[1])

    splitter = get_markdown_splitter()
    splits = process_files(path, splitter)
    store_splits(splits)

We can run our index script with this command:

python pf_rag_index.py ../Blog/docs/posts/

Test the vector store

After we indexed our posts, we can use this script to see what our vector store will return when we ask a specific question:

import chromadb
from chromadb.config import Settings

telemetry_off = Settings(anonymized_telemetry=False)
client = chromadb.PersistentClient(path="PythonFridayRAG.chroma", settings=telemetry_off)

# create a collection
collection = client.get_or_create_collection(name="posts")

while True:
    user_input = input("----\n\nQuestion: ")
    if user_input.lower() in ["quit", "exit", "end"]:
        break

    results = collection.query(
        query_texts=[user_input],
        n_results=5
    )

    for i in range(len(results["ids"][0])):
        print("----")
        print(f'[{results["ids"][0][i]}] - distance: {results["distances"][0][i]}')
        print(results["metadatas"][0][i]["Reference"])
        print(results["documents"][0][i][:200])

We can try the script and ask it about PEP:

Question: What is PEP?
----
[17-1] - distance: 0.42222827672958374
#17: What is PEP?
I used the abbreviation PEP in a few posts without every explaining what this is. 
It is now time for a closer look at the development process of the Python 
programming language.
----
[17-2] - distance: 0.6288007497787476
#17: What is PEP? / PEP?
The abbreviation PEP stands for Python Enhancement Proposal and means this:
> A PEP is a design document providing information to the Python community, or 
describing a new feature for Python or its
----
[17-4] - distance: 0.8124025464057922
#17: What is PEP? / Conclusion
A Python Enhancement Proposal (PEP) describes a feature of the Python language in 
great technical detail. Most of the time you may not need to know all the details 
about a feature, but when you want,
----
[17-3] - distance: 1.0471127033233643
#17: What is PEP? / Why should that interest me?
If you (like me) just want to program with Python and not Python itself, then 
those PEP may carry too many details to be helpful. However, I found some good 
explanations there that I did not find in t
----
[129-2] - distance: 1.2951600551605225
#129: Copy & Paste With Python / Install pyperclip
**[Pyperclip](https://github.com/asweigart/pyperclip)** is one of multiple 
libraries we can use for this task. I choose pyperclip because it is 
straightforward to use and works on Windows, Linux and M
----

We see that the results match our expectations and we can continue to build our RAG. Should your data not return useful search results, then you may need to change the embedding function. We will do that in a future post.

We did the first part of our RAG that allows us to index folders or single files and turn them into embeddings that we store in Chroma. That way we can extend our persisted data whenever a new post is ready – no need to throw everything away and start from scratch. Next week we add our local LLM to the RAG to answer our questions.