#306: A PDF Bot With LangChain

In the last two posts we got our hands dirty with the LangChain ecosystem and build a bot that talked to a CSV file and one that connected to a database. The packages langchain_community and langchain_experimental helped us a lot with our structured data. But what about unstructured data, like in a PDF file?

Creating a bot that answers based on a PDF file is a straightforward task with LangChain. As long as the PDF is small enough to fit into the context size of our LLM, we can even skip all the overhead of vector databases. Let us see what we need to create this bot.

Installation

We want to read PDF files and for this task we need a library that knows how to extract text from PDF files. One possible tool we can use for this task is PyPDFLoader that we can install with this command:

uv pip install -U langchain_core pypdf

Prepare the PDF

The context window of an LLM is measured in tokens and our PDF file, the prompt and our question must fit into that limit. Since those values depend on your use-case and your model, it is not possible to tell you the maximum size your PDF can have.

As I tried to load a 142 KB file with a text length of 22208 characters, I got this exception:

Error code: 400 - {'error': 'Trying to keep the first 6731 tokens when context the overflows. However, the model is loaded with context length of only 4096 tokens, which is not enough. Try to load the model with a larger context length, or provide a shorter input'}

This means that my LLM allows for 4096 tokens, but the text of my PDF file, the prompt and my question would need 6732 tokens – too much to handle.

Should this happen to you, you need to reduce the length of the PDF you want to work with until it fits into the context window. That is not ideal, but a necessity for this minimalistic PDF bot.

Create the PDF bot

We can now take our PDF file and use PyPDFLoader to bring it into a usable form. From there we can combine our content with the LLM configuration from the last post, modify the prompt to answer the questions based only on the PDF and put our chain together:

from langchain_community.document_loaders import PyPDFLoader
from langchain_openai import ChatOpenAI
from langchain_core.prompts import PromptTemplate

# 1. Load the PDF
loader = PyPDFLoader("pf178.pdf")
pages = loader.load()
pdf_text = "\n\n".join([page.page_content for page in pages])
print(f"Length of PDF Text: {len(pdf_text)}")

# 2. Create the LLM
llm = ChatOpenAI(
    model="mistral",
    openai_api_base="http://localhost:1234/v1",
    openai_api_key="not-needed",
    temperature=0
)

# 3. Create a strict prompt
prompt = PromptTemplate(
    input_variables=["context", "question"],
    template=(
        "You are a helpful assistant that only answers using the provided PDF context.\n"
        "If the answer is not in the context, say 'I don’t know based on this document.'\n\n"
        "Context:\n{context}\n\nQuestion: {question}\nAnswer:"
    )
)

# 4. Build the new Runnable chain
chain = prompt | llm

# 5. Ask a question interactively
while True:
    user_input = input("You: ")
    if user_input.lower() in ["quit", "exit", "end"]:
        break

    result = chain.invoke({"context": pdf_text, "question": user_input})
    print("Bot:", result.content)

For a bot that only should report what is in the PDF, we best turn the temperature down to 0 in our LLM configuration. That way we get more predictable results.

Ask about the PDF

We can now run our bot and ask questions about the PDF file. I did the following conversation with a PDF of blog post #178. When we ask about something that is not in the PDF, the bot should warn us that this question cannot be answered based on the PDF:

Length of PDF Text: 5984

You: what is this PDf about?
Bot: This PDF is about cleaning and transforming a dataset using Pandas in 
Python. It specifically focuses on working with an Excel file containing data 
on greenhouse gas emissions from the European Environment Agency (EEA). The 
document walks through various steps to clean up the data, including:

1. Loading the data into a Pandas DataFrame.
2. Removing unnecessary rows and columns.
3. Fixing formatting issues with years.
4. Modifying headers and renaming columns.
5. Swapping axes to optimize the data for plotting.
6. Exporting the cleaned data to a CSV file for future use.

The goal is to automate the data cleaning process so that it can be easily 
repeated when new data is released.


You: Give me a summary of the text that is 50 words or less.
Bot: The document explains how to clean and transform data using Pandas. It 
demonstrates loading an Excel file, removing unwanted rows/columns, fixing 
headers, setting index values, renaming entries, swapping axes, and exporting 
cleaned data as a CSV file for easier analysis and plotting.

(50 words)



You: show 3 of the cleanup methods with code sample in Python
Bot: Here are three cleanup methods with their corresponding code samples 
from the provided context:

1. **Removing Rows and Columns:**
   ```python
   df = pd.read_excel(
       data_file,
       sheet_name='Sheet 1',
       skiprows=8,  # Skip leading rows
       skipfooter=3,  # Ignore rows after the data
       engine='openpyxl')
   ```

2. **Dropping Columns with All Missing Values:**
   ```python
   df = df.dropna(axis=1, how='all')  # Drop columns where all values are missing
   ```

3. **Renaming Index (e.g., Country Names):**
   ```python
   df = df.rename(index={'Germany (until 1990 former territory of the FRG)': 'Germany'})
   ```



You: Can we do the same in .Net?
Bot: I don’t know based on this document.

The answers match the content in the PDF file, exactly as I had hoped it would. Try it with your PDF and see if you can find results that surprise you in a positive way.

The combination of an LLM and PyPDFLoader gives us a small chat bot that can answer questions based on a single, small PDF file. That is a great start for all the small PDF files we have. But when the PDF file gets larger, this approach does not work. Next week we explore our options to handle larger PDF files.