#304: Chat With CSV Files in LangChain
If you have a CSV file to explore and have no idea what is going on, asking broad questions would be a great help. By using LangChain and some additional packages we can build a chat bot that allows us to do that. Let us figure out what we need to make it happen.
Attention: Dangerous code ahead
The solution we are going to build requires dynamically generated code that runs as our user. Since we have no chance of knowing what is going to happen it is not the best idea to run it on your production system or even a computer you use for your daily activities. Since I do not have a spare machine, I will use Docker to run that code inside a container.
If you decide to run that code on your machine without any protection, it is up to you, and you are on your own to fix the resulting problems.
Create the CSV agent
For this post we use the Titanic dataset that we can get from the Seaborn samples and store it as titanic.csv on the disk. We can configure our LLM as usual, but since we want to run this code in a Docker container, we need to replace localhost with host.docker.internal.
The magic of this agent happens with the create_csv_agent() method. We get a full abstraction in an AI agent that will take our question, turns it into a prompt, and adds our data to it. We finish our script with the usual loop where the user can ask one or more questions.
Prepare a Docker container
For the CSV agent we need the langchain_experimental package that is not yet updated to LangChain version 1. For this reason, we need a cvs_requirements.txt file with explicit version numbers of the packages we use:
langchain==0.3.27
langchain-community==0.3.31
langchain-core==0.3.79
langchain-experimental==0.3.4
langchain-openai==0.3.35
langchain-text-splitters==0.3.11
langsmith==0.4.37
numpy==2.3.4
openai==2.4.0
pandas==2.3.3
seaborn==0.13.2
tabulate==0.9.0
In our Dockerfile we use the official Docker image for Python 3.13 and update all system packages. When this is done, we install our dependencies and copy our CSV agent into the container:
FROM python:3.13-trixie
RUN apt-get update -qq
WORKDIR /workspace
COPY csv_requirements.txt /workspace/csv_requirements.txt
COPY csv_agent.py /workspace/csv_agent.py
RUN pip install -U -r csv_requirements.txt
CMD ["/bin/bash"]
With those two files in place, we can create and run our container:
Ask questions about the CSV file
We can now run our CSV agent inside the container and start asking questions about the Titanic data:
✅ CSV Agent ready! Ask me anything about the Titanic dataset.
Type 'exit' to quit.
🧠 Question: Describe the data
Final Answer:
The dataframe `df` contains 891 entries with 15 columns. Here is a detailed
description:
### Column Information:
1. **survived**: Binary indicator (0 = No, 1 = Yes) showing whether the
passenger survived. It has 891 non-null integer values.
2. **pclass**: Passenger class (1 = First, 2 = Second, 3 = Third). It has 891
non-null integer values.
3. **sex**: Gender of the passenger (male/female). It has 891 non-null object
(string) values.
4. **age**: Age of the passenger in years. It has 714 non-null float values,
indicating some missing data.
5. **sibsp**: Number of siblings/spouses aboard the Titanic. It has 891 non-null
integer values.
6. **parch**: Number of parents/children aboard the Titanic. It has 891 non-null
integer values.
7. **fare**: Passenger fare. It has 891 non-null float values.
8. **embarked**: Port of embarkation (C = Cherbourg, Q = Queenstown,
S = Southampton). It has 889 non-null object (string) values, indicating some
missing data.
9. **class**: Passenger class derived from pclass (First, Second, Third). It has
891 non-null object (string) values.
10. **who**: Type of passenger (man, woman, child). It has 891 non-null object
(string) values.
11. **adult_male**: Boolean indicator showing if the passenger is an adult male.
It has 891 non-null boolean values.
12. **deck**: Deck location on the ship. It has 203 non-null object (string)
values, indicating significant missing data.
13. **embark_town**: Town of embarkation (Cherbourg, Queenstown, Southampton). It
has 889 non-null object (string) values, indicating some missing data.
14. **alive**: Indicator showing if the passenger survived (yes/no). It has 891
non-null object (string) values.
15. **alone**: Boolean indicator showing if the passenger was alone (no siblings/
spouses or parents/children aboard). It has 891 non-null boolean values.
### Descriptive Statistics for Numerical Columns:
- **survived**:
- Mean: ~0.384 (38.4% survival rate)
- Standard deviation: ~0.487
- Min: 0, Max: 1
- **pclass**:
- Mean: ~2.309
- Standard deviation: ~0.836
- Min: 1, Max: 3
- **age**:
- Mean: ~29.7
- Standard deviation: ~14.53
- Min: 0.42, Max: 80
- **sibsp**:
- Mean: ~0.523
- Standard deviation: ~1.103
- Min: 0, Max: 8
- **parch**:
- Mean: ~0.382
- Standard deviation: ~0.806
- Min: 0, Max: 6
- **fare**:
- Mean: ~32.204
- Standard deviation: ~49.693
- Min: 0, Max: 512.329
### Unique Values for Categorical Columns:
- **sex**: 2 unique values (male, female)
- **embarked**: 3 unique values (S, C, Q)
- **class**: 3 unique values (Third, First, Second)
- **who**: 3 unique values (man, woman, child)
- **deck**: Multiple unique values (A, B, C, D, E, F, G, nan)
- **embark_town**: 3 unique values (Southampton, Cherbourg, Queenstown)
- **alive**: 2 unique values (no, yes)
🧠 Question: what is the relation between fare and survived?
💬 Answer: The average fare for passengers who survived was approximately 48.40,
while for those who did not survive, it was about 22.12. The correlation
coefficient between 'fare' and 'survived' is around 0.257, indicating a moderate
positive relationship: passengers who paid higher fares had a higher likelihood
of surviving.
Questions matter
The solution we have looks impressive and it is easy to forget that we still have a generator involved. If we are not careful how we formulate our questions, we may end up with results that do not correctly represent the data in our CSV file. This is what Qwen3 model said on the CSV file:
🧠 Question: how many people embarked in which town?
💬 Answer: The number of people who embarked in each town is as follows:
- Southampton: 887 people
- Cherbourg: 2 people
- Queenstown: 1 person
🧠 Question: how many people embarked in which port?
💬 Answer: The number of people who embarked in each port is as follows:
- Southampton (S): 644 people
- Cherbourg (C): 168 people
- Queenstown (Q): 79 people
The data in the CSV file has two columns "embarked" and "embark_town" to store the embarking port. In "embarked" we get a key, while "embark_town" contains the name of the port. There are no mix-ups in the data, and the questions above should have given us the same results. Instead, we got two convincing answers that contradict each other.
Be aware that this could happen with each question you ask. Therefore, if you found something interesting with the CSV agent, make sure that you recheck it with the data outside of an LLM.
Next
The idea of talking to a CSV file is interesting and offers some helpful approaches to better understand the data. However, if we cannot fully trust the answers this stays at a toy level and may be too dangerous to run against real data.
Next week we try another approach to answer questions on our data by building a chat bot that talks to a database.