#304: Chat With CSV Files in LangChain

If you have a CSV file to explore and have no idea what is going on, asking broad questions would be a great help. By using LangChain and some additional packages we can build a chat bot that allows us to do that. Let us figure out what we need to make it happen.

Attention: Dangerous code ahead

The solution we are going to build requires dynamically generated code that runs as our user. Since we have no chance of knowing what is going to happen it is not the best idea to run it on your production system or even a computer you use for your daily activities. Since I do not have a spare machine, I will use Docker to run that code inside a container.

If you decide to run that code on your machine without any protection, it is up to you, and you are on your own to fix the resulting problems.

Create the CSV agent

For this post we use the Titanic dataset that we can get from the Seaborn samples and store it as titanic.csv on the disk. We can configure our LLM as usual, but since we want to run this code in a Docker container, we need to replace localhost with host.docker.internal.

The magic of this agent happens with the create_csv_agent() method. We get a full abstraction in an AI agent that will take our question, turns it into a prompt, and adds our data to it. We finish our script with the usual loop where the user can ask one or more questions.

import pandas as pd
import seaborn as sns
from langchain_experimental.agents.agent_toolkits import create_csv_agent
from langchain_openai import ChatOpenAI

# 1. Load Titanic data set from seaborn and store it as CSV file
csv_path = "titanic.csv"
sns.load_dataset("titanic").to_csv(csv_path, index=False)

# 2. Define the LLM
llm = ChatOpenAI(
    model="mistral",
    openai_api_base="http://host.docker.internal:1234/v1",
    openai_api_key="not-needed",
    temperature=0.1
)

# 3. Create CSV agent
agent = create_csv_agent(
    llm,
    csv_path,
    verbose=False,
    allow_dangerous_code=True
)

# 4. Loop for questions
print("✅ CSV Agent ready! Ask me anything about the Titanic dataset.")
print("Type 'exit' to quit.\n")

while True:
    query = input("🧠 Question: ").strip()
    if query.lower() in ["exit", "quit"]:
        print("👋 Bye!")
        break
    try:
        answer = agent.invoke(query)
        print(f"💬 Answer: {answer["output"]}\n")
    except Exception as e:
        print(f"⚠️ Error: {e}\n")

Prepare a Docker container

For the CSV agent we need the langchain_experimental package that is not yet updated to LangChain version 1. For this reason, we need a cvs_requirements.txt file with explicit version numbers of the packages we use:

langchain==0.3.27
langchain-community==0.3.31
langchain-core==0.3.79
langchain-experimental==0.3.4
langchain-openai==0.3.35
langchain-text-splitters==0.3.11
langsmith==0.4.37
numpy==2.3.4
openai==2.4.0
pandas==2.3.3
seaborn==0.13.2
tabulate==0.9.0

In our Dockerfile we use the official Docker image for Python 3.13 and update all system packages. When this is done, we install our dependencies and copy our CSV agent into the container:

FROM python:3.13-trixie

RUN apt-get update -qq

WORKDIR /workspace

COPY csv_requirements.txt /workspace/csv_requirements.txt
COPY csv_agent.py /workspace/csv_agent.py

RUN pip install -U -r csv_requirements.txt

CMD ["/bin/bash"]

With those two files in place, we can create and run our container:

docker build -t csv-container . 
docker run -it csv-container

Ask questions about the CSV file

We can now run our CSV agent inside the container and start asking questions about the Titanic data:

✅ CSV Agent ready! Ask me anything about the Titanic dataset.
Type 'exit' to quit.

🧠 Question: Describe the data

Final Answer:
The dataframe `df` contains 891 entries with 15 columns. Here is a detailed 
description:

### Column Information:
1. **survived**: Binary indicator (0 = No, 1 = Yes) showing whether the 
    passenger survived. It has 891 non-null integer values.
2. **pclass**: Passenger class (1 = First, 2 = Second, 3 = Third). It has 891 
    non-null integer values.
3. **sex**: Gender of the passenger (male/female). It has 891 non-null object
     (string) values.
4. **age**: Age of the passenger in years. It has 714 non-null float values, 
    indicating some missing data.
5. **sibsp**: Number of siblings/spouses aboard the Titanic. It has 891 non-null 
    integer values.
6. **parch**: Number of parents/children aboard the Titanic. It has 891 non-null 
    integer values.
7. **fare**: Passenger fare. It has 891 non-null float values.
8. **embarked**: Port of embarkation (C = Cherbourg, Q = Queenstown, 
    S = Southampton). It has 889 non-null object (string) values, indicating some 
    missing data.
9. **class**: Passenger class derived from pclass (First, Second, Third). It has 
    891 non-null object (string) values.
10. **who**: Type of passenger (man, woman, child). It has 891 non-null object 
    (string) values.
11. **adult_male**: Boolean indicator showing if the passenger is an adult male. 
    It has 891 non-null boolean values.
12. **deck**: Deck location on the ship. It has 203 non-null object (string) 
    values, indicating significant missing data.
13. **embark_town**: Town of embarkation (Cherbourg, Queenstown, Southampton). It 
    has 889 non-null object (string) values, indicating some missing data.
14. **alive**: Indicator showing if the passenger survived (yes/no). It has 891 
    non-null object (string) values.
15. **alone**: Boolean indicator showing if the passenger was alone (no siblings/
    spouses or parents/children aboard). It has 891 non-null boolean values.

### Descriptive Statistics for Numerical Columns:
- **survived**:
  - Mean: ~0.384 (38.4% survival rate)
  - Standard deviation: ~0.487
  - Min: 0, Max: 1

- **pclass**:
  - Mean: ~2.309
  - Standard deviation: ~0.836
  - Min: 1, Max: 3

- **age**:
  - Mean: ~29.7
  - Standard deviation: ~14.53
  - Min: 0.42, Max: 80

- **sibsp**:
  - Mean: ~0.523
  - Standard deviation: ~1.103
  - Min: 0, Max: 8

- **parch**:
  - Mean: ~0.382
  - Standard deviation: ~0.806
  - Min: 0, Max: 6

- **fare**:
  - Mean: ~32.204
  - Standard deviation: ~49.693
  - Min: 0, Max: 512.329

### Unique Values for Categorical Columns:
- **sex**: 2 unique values (male, female)
- **embarked**: 3 unique values (S, C, Q)
- **class**: 3 unique values (Third, First, Second)
- **who**: 3 unique values (man, woman, child)
- **deck**: Multiple unique values (A, B, C, D, E, F, G, nan)
- **embark_town**: 3 unique values (Southampton, Cherbourg, Queenstown)
- **alive**: 2 unique values (no, yes)


🧠 Question: what is the relation between fare and survived?
💬 Answer: The average fare for passengers who survived was approximately 48.40, 
while for those who did not survive, it was about 22.12. The correlation 
coefficient between 'fare' and 'survived' is around 0.257, indicating a moderate 
positive relationship: passengers who paid higher fares had a higher likelihood 
of surviving.

Questions matter

The solution we have looks impressive and it is easy to forget that we still have a generator involved. If we are not careful how we formulate our questions, we may end up with results that do not correctly represent the data in our CSV file. This is what Qwen3 model said on the CSV file:

🧠 Question: how many people embarked in which town?
💬 Answer: The number of people who embarked in each town is as follows:  
- Southampton: 887 people  
- Cherbourg: 2 people  
- Queenstown: 1 person

🧠 Question: how many people embarked in which port?
💬 Answer: The number of people who embarked in each port is as follows:  
- Southampton (S): 644 people  
- Cherbourg (C): 168 people  
- Queenstown (Q): 79 people

The data in the CSV file has two columns "embarked" and "embark_town" to store the embarking port. In "embarked" we get a key, while "embark_town" contains the name of the port. There are no mix-ups in the data, and the questions above should have given us the same results. Instead, we got two convincing answers that contradict each other.

Be aware that this could happen with each question you ask. Therefore, if you found something interesting with the CSV agent, make sure that you recheck it with the data outside of an LLM.