#307: Experimenting With a Large PDF File in LangChain
Last week we created a minimalistic bot that let us ask questions on a PDF file. As long as the PDF file and our prompt fits into the context windows of the LLM, that can be done without much infrastructure. Unfortunately, most interesting PDF files are way larger, and we do not want to ignore them. Let us find a way to split our PDF file into chunks so that the file size no longer matters.
Combining the PDF and the FAQ bot
In this post we play with the solution we created for the [FAQ bot that did not use an LLM] and combine it with our PDF bot from last week. For the FAQ bot we used scikit-learn and the TfidfVectorizer to create vectors for each question and found a matching answer by running the question of the user through the same vectorizer.
In this post we split the text of the PDF file into chunks, turn them into vectors and use the cosine_similarity to find the chunks that best match the question. By only submitting the best matches to the LLM and not the whole PDF we always stay below the context window size. If all works out, the size of the PDF no longer limits our bot.
Find a test document
It is a bit tricky to find a useful large and free PDF file. Therefore, I went to the Project Gutenberg and turned the book The Hound of the Baskervilles into a PDF file with more than 100 pages. The story of Sherlock Holmes by Sir Conan Doyle may be known by many and is suitable long for our purpose.
If we try this PDF file with the solution from last week, we end up with the same error message about exceeding the context window. Let us now dive into a solution that works with this file.
Install packages
We need our scikit-learn package to get the vector component and NumPy for some glue code:
Read the PDF in chunks
Our PDF bot for a large PDF file needs to take a few more steps than the one we created last week. Nevertheless, we can reuse quiet a lot and combine it with the FAQ bot. Here are the important sections we need:
- We need to load the PDF file and extract the text of the document.
- We create a recursive text splitter that creates chunks of 1000 characters. The overlap of 100 characters help us to keep the sentences as a whole in at least one chunk.
- We initialise the
TfidfVectorizer, optimise the vector creation with thefit()method and vectorise all the chunks we have. - We turn the vectors and the chunks into a dictionary.
- We connect to our local LLM using the usual configuration.
- We reuse the prompt template from last week without any modifications.
- For our chain we combine the prompt template with the LLM.
- The interaction part with the user is a bit more complicated. We need to turn the question into a vector (a), search for similar chunks (b), combine the top 10 matches into a string (c) that we then hand to the LLM (d).
Asking questions about the book
We can now run our extended PDF bot and ask questions about the book. The better our question matches the wording in the book, the better are the results:
Length of PDF Text: 382335
Total chunks created: 474
Dictionary contains 474 vectors.
You: Who is Sherlock Holmes?
Bot: Based on the provided document, Sherlock Holmes is a detective described by
a cabman as:
- Around forty years of age
- Of middle height (two or three inches shorter than another character)
- Dressed like a gentleman ("a toff")
- Having a black beard cut square at the end and a pale face
He is also mentioned as having a dolichocephalic skull with well-marked
supra-orbital development, which interests Dr. Mortimer from an anthropological
perspective.
Additionally, Holmes is portrayed as a consulting detective with an extensive
practice in London, known for solving puzzles and cases that others find
challenging. He is described as someone who can think out "little puzzles" and
has a reputation that extends to the point where people might come to him directly
with their problems.
The document also shows Holmes engaging in conversations about various cases,
demonstrating his deductive reasoning skills and his ability to piece together
information from different sources.
You: Who is Dr. Watson?
Bot: Based on the provided document, Dr. Watson is a companion and friend of
Sherlock Holmes. He is also a medical doctor who assists Holmes in his
investigations. In this particular context, Dr. Watson is accompanying Sir Henry
Baskerville to Baskerville Hall and will be reporting back to Holmes about the
events that unfold there.
The text describes Dr. Watson as someone who has been complimented by Holmes for
his investigative skills and is eager to assist Sir Henry Baskerville with the
mysterious situation at Baskerville Hall. He is also described as a resident of
London who is traveling to Devon with Sir Henry and Dr. Mortimer.
You: Who is Sri Henry Baskerville?
Bot: Based on the provided document, Sir Henry Baskerville is the heir to
Baskerville Hall following the death of his uncle, Sir Charles Baskerville. He
was previously farming in Canada and is described as an "excellent fellow in every
way." He arrives at Waterloo Station to inherit the estate and title after the
mysterious death of his uncle.
The document does not provide further personal details about Sir Henry beyond this
context.
You: What happens at Baker Street?
Bot: Based on the provided document, several events occur at Baker Street:
1. Dr. Mortimer and Baskerville are followed into Baker Street after leaving
Oxford Street and Regent Street.
2. A hansom cab with a man inside halts near Baker Street, and Holmes
observes this.
3. Sherlock Holmes sends a message to Sir Henry Baskerville via a boy named
Cartwright, instructing him to send any found pocketbook to Baker Street if it
is dropped.
4. Holmes and Watson discuss the case and the need for Watson to report facts back
to Baker Street.
The document does not provide more specific details about what happens at Baker
Street beyond these points.
You: What happens at Bakerstreet?
Bot: I don’t know based on this document.
You: Create a summary of the story in 50 words.
Bot: I don’t know based on this document.
The provided context contains excerpts from "The Hound of the Baskervilles" by
Arthur Conan Doyle but does not offer a coherent narrative or full story.
Therefore, I cannot create a concise summary without additional information.
You: Who is the murderer and what is his motive?
Bot: Based on this document, the murderer is Selden, the Notting Hill murderer.
His motive appears to be related to his violent and brutal nature, as he is
described as a man who would "stick at nothing." However, the specific motive for
the death of Sir Charles Baskerville or any other crimes mentioned is not
explicitly stated in the provided context.
The document also mentions a hound that may have been involved in the death of Sir
Charles Baskerville, but it does not provide definitive evidence linking Selden
directly to that particular crime. The case seems to involve multiple suspects and
complex circumstances, with Holmes and Watson investigating further.
If you need more specific details about the motive or other aspects of the case, I
would recommend referring to additional context or chapters from the story.
You: Who is the murderer of Sir Charles Baskerville?
Bot: Based on this document, it is not explicitly stated who the murderer of Sir
Charles Baskerville is. The text discusses various suspects and circumstances
surrounding his death but does not conclusively identify the killer.
I don’t know based on this document.
You: Who is the murderer?
Bot: Based on this document, the murderer is Stapleton (also referred to as "the
naturalist"). He orchestrated the death of Sir Charles Baskerville using a savage
hound and artificial means to make the creature appear diabolical.
You: Why did Stapleton murder Sir Charles Baskerville?
Bot: I don’t know based on this document.
You: What was the motive of Stapleton?
Bot: Based on the provided document, the motive of Stapleton is not explicitly
stated. The text mentions various theories and suspicions about his actions, such
as having designs upon the old man (Sir Charles Baskerville) and the existence of
the hound, but it does not definitively conclude his ultimate motive.
The closest inference is that Stapleton might have been trying to scare away the
Baskerville family to secure a comfortable and permanent home for the Barrymores,
but this is presented as an inadequate explanation. Therefore, I don’t know based
on this document what Stapleton's true motive was.
You: What is the role of the hound in this story?
Bot: Based on the provided document, the role of the hound in "The Hound of the
Baskervilles" is multifaceted:
1. **Legend and Superstition**: The hound is part of a family legend, said to be
a supernatural creature that haunts the Baskerville lineage as a punishment
for past sins. It is described as a hell-hound that drives people to fear and
death.
2. **Cause of Fear and Death**: The hound is used to frighten and potentially kill
members of the Baskerville family. For example, it is implied to have caused
the death of Sir Charles Baskerville due to sheer terror.
3. **Mystery and Investigation**: The existence and actions of the hound are
central to the mystery that Sherlock Holmes and Dr. Watson are investigating.
They try to determine whether the hound is a real animal, a supernatural
entity, or a cleverly devised hoax meant to instill fear.
4. **Symbol of Danger**: The hound symbolizes an ever-present danger and impending
doom, contributing to the atmosphere of suspense and dread in the story.
5. **Cunning Device**: It is suggested that the hound might be a cunning device
used by someone to frighten victims, possibly as part of a larger plot
involving human agency.
In summary, the hound serves as both a literal and symbolic threat, driving the
narrative and the characters' actions throughout the story.
We can see that a typo like Sri (instead of Sir) is no problem, while the difference between "Bakerstreet" and "Baker Street" has a significant impact on the answer we get. The same is true for longer questions like for the murderer and the motive where we get a different answer than when we ask only for the murderer.
I experimented with different chunk sizes and the number of top results I put into the LLM. The more specific your questions are, the more you profit from more smaller chunks. But as we see with the question for a summary, the chunks itself present a problem then when we only work with parts of the text, we are unable to get a summary of the whole story.
Limitations
With this approach we solved the problem of the context window limitation. By splitting the text into chunks and only feed the most relevant parts to the LLM, the size of the PDF file is no longer relevant. We can work with PDF files of any size and get a decent result – as long as we reuse the words in the text. The TfidfVectorizer is not the most flexible one when it comes to typos or detecting similar words.
Another severe limitation is that we need to use the fit() function over the whole document to fine-tune the vector. That is no problem as long as we only work with a single document, but when we have multiple files and want to incrementally load data, that will be a challenge.
We currently need the vector and the text to be present for our solution. While this works with small data sets, it can be a challenge if we want to scale to include all our PDF files.
All these issues are solvable, and I encourage my readers to continue experimenting with this approach. In the meantime, I will cover a few other topics before continuing this journey.
Next
We now have a text splitter in place that turns our PDF file into manageable chunks. That allows us only to work with the relevant sections of a PDF file instead of the whole document, what gives us more flexibility and removes the size limit. Next week we look at the magic behind the | in the syntax to create our LangChain chains.