#307: Experimenting With a Large PDF File in LangChain

Last week we created a minimalistic bot that let us ask questions on a PDF file. As long as the PDF file and our prompt fits into the context windows of the LLM, that can be done without much infrastructure. Unfortunately, most interesting PDF files are way larger, and we do not want to ignore them. Let us find a way to split our PDF file into chunks so that the file size no longer matters.

Combining the PDF and the FAQ bot

In this post we play with the solution we created for the [FAQ bot that did not use an LLM] and combine it with our PDF bot from last week. For the FAQ bot we used scikit-learn and the TfidfVectorizer to create vectors for each question and found a matching answer by running the question of the user through the same vectorizer.

In this post we split the text of the PDF file into chunks, turn them into vectors and use the cosine_similarity to find the chunks that best match the question. By only submitting the best matches to the LLM and not the whole PDF we always stay below the context window size. If all works out, the size of the PDF no longer limits our bot.

Find a test document

It is a bit tricky to find a useful large and free PDF file. Therefore, I went to the Project Gutenberg and turned the book The Hound of the Baskervilles into a PDF file with more than 100 pages. The story of Sherlock Holmes by Sir Conan Doyle may be known by many and is suitable long for our purpose.

If we try this PDF file with the solution from last week, we end up with the same error message about exceeding the context window. Let us now dive into a solution that works with this file.

Install packages

We need our scikit-learn package to get the vector component and NumPy for some glue code:

uv pip install -U langchain_core pypdf

Read the PDF in chunks

Our PDF bot for a large PDF file needs to take a few more steps than the one we created last week. Nevertheless, we can reuse quiet a lot and combine it with the FAQ bot. Here are the important sections we need:

We need to load the PDF file and extract the text of the document.
We create a recursive text splitter that creates chunks of 1000 characters. The overlap of 100 characters help us to keep the sentences as a whole in at least one chunk.
We initialise the TfidfVectorizer, optimise the vector creation with the fit() method and vectorise all the chunks we have.
We turn the vectors and the chunks into a dictionary.
We connect to our local LLM using the usual configuration.
We reuse the prompt template from last week without any modifications.
For our chain we combine the prompt template with the LLM.
The interaction part with the user is a bit more complicated. We need to turn the question into a vector (a), search for similar chunks (b), combine the top 10 matches into a string (c) that we then hand to the LLM (d).

from langchain_community.document_loaders import PyPDFLoader
from langchain_openai import ChatOpenAI
from langchain_core.prompts import PromptTemplate
from langchain.text_splitter import RecursiveCharacterTextSplitter
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# 1. Load the PDF
loader = PyPDFLoader("pg2852.txt.pdf")
pages = loader.load()
pdf_text = "\n\n".join([page.page_content for page in pages])
print(f"Length of PDF Text: {len(pdf_text)}")


# 2. Split the text into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,      # max characters per chunk
    chunk_overlap=100,   # overlap to preserve context
    length_function=len, # optional, default is len
)

chunks = text_splitter.split_text(pdf_text)
print(f"Total chunks created: {len(chunks)}")


# 3. initialise the TfidfVectorizer and vectorise the chunks
vectorizer = TfidfVectorizer()
vectorizer.fit(chunks)
vectorized_chunks = vectorizer.transform(chunks)


# 4. Using the vector's tuple as the dictionary key
vector_chunk_dict = {}
for i, vec in enumerate(vectorized_chunks):
    # Convert sparse vector to dense and then tuple to make it hashable
    vec_tuple = tuple(vec.toarray()[0])
    vector_chunk_dict[vec_tuple] = chunks[i]

print(f"Dictionary contains {len(vector_chunk_dict)} vectors.")


# 5. Create the LLM
llm = ChatOpenAI(
    model="mistral",
    openai_api_base="http://localhost:1234/v1",
    openai_api_key="not-needed",
    temperature=0
)


# 6. Create a strict prompt
prompt = PromptTemplate(
    input_variables=["context", "question"],
    template=(
        "You are a helpful assistant that only answers using the provided PDF context.\n"
        "If the answer is not in the context, say 'I don’t know based on this document.'\n\n"
        "Context:\n{context}\n\nQuestion: {question}\nAnswer:"
    )
)


# 7. Build the new Runnable chain
chain = prompt | llm


# 8. Ask a question interactively
while True:
    user_input = input("\n\nYou: ")
    if user_input.lower() in ["quit", "exit", "end"]:
        break

    # a. vectorize user input
    user_vec = vectorizer.transform([user_input])

    # b. compute cosine similarity with all chunks
    similarities = cosine_similarity(user_vec, vectorized_chunks)[0] 

    # c. get top 10 matches and merge them into a single text
    top_indices = np.argsort(similarities)[::-1][:10]  # descending order
    top_chunks_text = "\n".join([chunks[idx] for idx in top_indices])

    # d. send the request to the LLM and print the output
    result = chain.invoke({"context": top_chunks_text, "question": user_input})
    print("Bot:", result.content)

Asking questions about the book

We can now run our extended PDF bot and ask questions about the book. The better our question matches the wording in the book, the better are the results:

Length of PDF Text: 382335
Total chunks created: 474
Dictionary contains 474 vectors.


You: Who is Sherlock Holmes?
Bot: Based on the provided document, Sherlock Holmes is a detective described by 
a cabman as:
- Around forty years of age
- Of middle height (two or three inches shorter than another character)
- Dressed like a gentleman ("a toff")
- Having a black beard cut square at the end and a pale face

He is also mentioned as having a dolichocephalic skull with well-marked 
supra-orbital development, which interests Dr. Mortimer from an anthropological 
perspective.

Additionally, Holmes is portrayed as a consulting detective with an extensive 
practice in London, known for solving puzzles and cases that others find 
challenging. He is described as someone who can think out "little puzzles" and 
has a reputation that extends to the point where people might come to him directly 
with their problems.

The document also shows Holmes engaging in conversations about various cases, 
demonstrating his deductive reasoning skills and his ability to piece together 
information from different sources.


You: Who is Dr. Watson?
Bot: Based on the provided document, Dr. Watson is a companion and friend of 
Sherlock Holmes. He is also a medical doctor who assists Holmes in his 
investigations. In this particular context, Dr. Watson is accompanying Sir Henry 
Baskerville to Baskerville Hall and will be reporting back to Holmes about the 
events that unfold there.

The text describes Dr. Watson as someone who has been complimented by Holmes for 
his investigative skills and is eager to assist Sir Henry Baskerville with the 
mysterious situation at Baskerville Hall. He is also described as a resident of 
London who is traveling to Devon with Sir Henry and Dr. Mortimer.


You: Who is Sri Henry Baskerville?
Bot: Based on the provided document, Sir Henry Baskerville is the heir to 
Baskerville Hall following the death of his uncle, Sir Charles Baskerville. He 
was previously farming in Canada and is described as an "excellent fellow in every 
way." He arrives at Waterloo Station to inherit the estate and title after the 
mysterious death of his uncle.

The document does not provide further personal details about Sir Henry beyond this 
context.


You: What happens at Baker Street?
Bot: Based on the provided document, several events occur at Baker Street:

1. Dr. Mortimer and Baskerville are followed into Baker Street after leaving 
    Oxford Street and Regent Street.
2. A hansom cab with a man inside halts near Baker Street, and Holmes 
    observes this.
3. Sherlock Holmes sends a message to Sir Henry Baskerville via a boy named 
    Cartwright, instructing him to send any found pocketbook to Baker Street if it 
    is dropped.
4. Holmes and Watson discuss the case and the need for Watson to report facts back 
    to Baker Street.

The document does not provide more specific details about what happens at Baker 
Street beyond these points.


You: What happens at Bakerstreet?
Bot: I don’t know based on this document.


You: Create a summary of the story in 50 words.
Bot: I don’t know based on this document.

The provided context contains excerpts from "The Hound of the Baskervilles" by 
Arthur Conan Doyle but does not offer a coherent narrative or full story. 
Therefore, I cannot create a concise summary without additional information.


You: Who is the murderer and what is his motive?
Bot: Based on this document, the murderer is Selden, the Notting Hill murderer. 
His motive appears to be related to his violent and brutal nature, as he is 
described as a man who would "stick at nothing." However, the specific motive for 
the death of Sir Charles Baskerville or any other crimes mentioned is not 
explicitly stated in the provided context.

The document also mentions a hound that may have been involved in the death of Sir 
Charles Baskerville, but it does not provide definitive evidence linking Selden 
directly to that particular crime. The case seems to involve multiple suspects and 
complex circumstances, with Holmes and Watson investigating further.

If you need more specific details about the motive or other aspects of the case, I 
would recommend referring to additional context or chapters from the story.


You: Who is the murderer of Sir Charles Baskerville?
Bot: Based on this document, it is not explicitly stated who the murderer of Sir 
Charles Baskerville is. The text discusses various suspects and circumstances 
surrounding his death but does not conclusively identify the killer.

I don’t know based on this document.


You: Who is the murderer?
Bot: Based on this document, the murderer is Stapleton (also referred to as "the 
naturalist"). He orchestrated the death of Sir Charles Baskerville using a savage 
hound and artificial means to make the creature appear diabolical.


You: Why did Stapleton murder Sir Charles Baskerville?
Bot: I don’t know based on this document.


You: What was the motive of Stapleton?
Bot: Based on the provided document, the motive of Stapleton is not explicitly 
stated. The text mentions various theories and suspicions about his actions, such 
as having designs upon the old man (Sir Charles Baskerville) and the existence of 
the hound, but it does not definitively conclude his ultimate motive.

The closest inference is that Stapleton might have been trying to scare away the 
Baskerville family to secure a comfortable and permanent home for the Barrymores, 
but this is presented as an inadequate explanation. Therefore, I don’t know based 
on this document what Stapleton's true motive was.


You: What is the role of the hound in this story?
Bot: Based on the provided document, the role of the hound in "The Hound of the 
Baskervilles" is multifaceted:

1. **Legend and Superstition**: The hound is part of a family legend, said to be 
    a supernatural creature that haunts the Baskerville lineage as a punishment 
    for past sins. It is described as a hell-hound that drives people to fear and 
    death.

2. **Cause of Fear and Death**: The hound is used to frighten and potentially kill 
    members of the Baskerville family. For example, it is implied to have caused 
    the death of Sir Charles Baskerville due to sheer terror.

3. **Mystery and Investigation**: The existence and actions of the hound are 
    central to the mystery that Sherlock Holmes and Dr. Watson are investigating. 
    They try to determine whether the hound is a real animal, a supernatural 
    entity, or a cleverly devised hoax meant to instill fear.

4. **Symbol of Danger**: The hound symbolizes an ever-present danger and impending 
    doom, contributing to the atmosphere of suspense and dread in the story.

5. **Cunning Device**: It is suggested that the hound might be a cunning device 
    used by someone to frighten victims, possibly as part of a larger plot 
    involving human agency.

In summary, the hound serves as both a literal and symbolic threat, driving the 
narrative and the characters' actions throughout the story.

We can see that a typo like Sri (instead of Sir) is no problem, while the difference between "Bakerstreet" and "Baker Street" has a significant impact on the answer we get. The same is true for longer questions like for the murderer and the motive where we get a different answer than when we ask only for the murderer.

I experimented with different chunk sizes and the number of top results I put into the LLM. The more specific your questions are, the more you profit from more smaller chunks. But as we see with the question for a summary, the chunks itself present a problem then when we only work with parts of the text, we are unable to get a summary of the whole story.

Limitations

With this approach we solved the problem of the context window limitation. By splitting the text into chunks and only feed the most relevant parts to the LLM, the size of the PDF file is no longer relevant. We can work with PDF files of any size and get a decent result – as long as we reuse the words in the text. The TfidfVectorizer is not the most flexible one when it comes to typos or detecting similar words.

Another severe limitation is that we need to use the fit() function over the whole document to fine-tune the vector. That is no problem as long as we only work with a single document, but when we have multiple files and want to incrementally load data, that will be a challenge.

We currently need the vector and the text to be present for our solution. While this works with small data sets, it can be a challenge if we want to scale to include all our PDF files.

All these issues are solvable, and I encourage my readers to continue experimenting with this approach. In the meantime, I will cover a few other topics before continuing this journey.

We now have a text splitter in place that turns our PDF file into manageable chunks. That allows us only to work with the relevant sections of a PDF file instead of the whole document, what gives us more flexibility and removes the size limit. Next week we look at the magic behind the | in the syntax to create our LangChain chains.