#322: Embed Markdown for a Python Friday RAG
After we found with Chroma a flexible vector store, we have everything together to build a RAG (Retrieval Augmented Generation) for the Python Friday blog that uses LangChain and a local LM Studio.
In this post we focus on extracting metadata from the Markdown files I use in this MkDocs Material powered blog. We split the Markdown into useful chunks and turn the metadata for the blog post into a metadata dictionary to use with Chroma. Let us explore how we can do this first part.
Installation
To build our Markdown extractor, we need a few packages:
Create a Markdown splitter
The langchain_text_splitters module offers a MarkdownHeaderTextSplitter class that will split our Markdown files by (sub) headings. That way we get text parts that belong together, what should increase the quality of the answers we produce with the LLM.
We can configure the MarkdownHeaderTextSplitter class by telling it on what headers we want to split:
Split documents and prepare metadata
Since the blog posts have a helpful set of metadata (like tags and categories), we can use them for our RAG. For that we need to do some extra work that we could skip if the data we want to index are just plain Markdown files.
As a first step, we need to process the file or directory we want to index. If it is a directory, we need to find all Markdown files inside that folder and process each file. If it is a file, we can just process it:
When we know the files to work with, we can do some preprocessing in which we extract the metadata and reformat the Markdown before we hand it to our MarkdownHeaderTextSplitter. We make sure that each entry has its unique Id, that we will later use as the Id in Chroma.
The metadata extraction is a bit of a mess, but it basically extracts the values from the metadata header of the post, removes it and turns the title of the post (that was in a field title:) into a proper Markdown level one heading (#):
In the final part we create a Reference metadata entry that we want to use as a source reference with the LLM:
Persist documents
After we extracted all those text parts, we can persist them in Chroma with this method:
Our vectors go to the posts collection in a PythonFridayRAG.chroma database.
Glue everything together
The glue code we need for our script looks like this:
We can run our index script with this command:
Test the vector store
After we indexed our posts, we can use this script to see what our vector store will return when we ask a specific question:
We can try the script and ask it about PEP:
Question: What is PEP?
----
[17-1] - distance: 0.42222827672958374
#17: What is PEP?
I used the abbreviation PEP in a few posts without every explaining what this is.
It is now time for a closer look at the development process of the Python
programming language.
----
[17-2] - distance: 0.6288007497787476
#17: What is PEP? / PEP?
The abbreviation PEP stands for Python Enhancement Proposal and means this:
> A PEP is a design document providing information to the Python community, or
describing a new feature for Python or its
----
[17-4] - distance: 0.8124025464057922
#17: What is PEP? / Conclusion
A Python Enhancement Proposal (PEP) describes a feature of the Python language in
great technical detail. Most of the time you may not need to know all the details
about a feature, but when you want,
----
[17-3] - distance: 1.0471127033233643
#17: What is PEP? / Why should that interest me?
If you (like me) just want to program with Python and not Python itself, then
those PEP may carry too many details to be helpful. However, I found some good
explanations there that I did not find in t
----
[129-2] - distance: 1.2951600551605225
#129: Copy & Paste With Python / Install pyperclip
**[Pyperclip](https://github.com/asweigart/pyperclip)** is one of multiple
libraries we can use for this task. I choose pyperclip because it is
straightforward to use and works on Windows, Linux and M
----
We see that the results match our expectations and we can continue to build our RAG. Should your data not return useful search results, then you may need to change the embedding function. We will do that in a future post.
Next
We did the first part of our RAG that allows us to index folders or single files and turn them into embeddings that we store in Chroma. That way we can extend our persisted data whenever a new post is ready – no need to throw everything away and start from scratch. Next week we add our local LLM to the RAG to answer our questions.