#269: Introducing DuckDB

DuckDB is an in-process SQL OLAP (Online Analytical Processing) database designed for efficient analytical queries. Known as the "SQLite for analytics", it integrates seamlessly with Python, making it ideal for data analysis. Let us explore the benefits it can bring to our tool box. We focus on the Python integration, but you can use DuckDB with many more programming languages.

Why choose DuckDB?

The main benefits of DuckDB for me are these points:

Simplicity and Integration: DuckDB runs within the host application. Therefore, we do not need a separate database server and can run SQL queries on our data structures (like Pandas DataFrames) without any additional work.
Performance: Optimized for analytical queries, DuckDB utilizes a columnar storage format and vectorized query execution, enabling rapid data processing even on large datasets that are larger than our memory.
Flexibility: DuckDB supports various data formats, including CSV, Parquet, and JSON, and can query data from local files or cloud storage services like S3.

Where DuckDB is not the right fit

DuckDB is made for analytics, not for transactions. If you want to write a lot of small transactions, you will not be happy with the performance you get. You can use it that way, but other systems like SQLite, are much better suited for such a workload.

Installation

We can install DuckDB using uv or pip:

uv pip install duckdb

Querying data

The main reason to use DuckDB is to query data, where we can choose from these options: • Query a (CSV) file directly:

import duckdb

duckdb.sql("SELECT * FROM file_example_CSV_100.csv").show()

• Read a (CSV) file into a relation and then query it:

r = duckdb.read_csv("file_example_CSV_100.csv")
duckdb.sql("SELECT * FROM r").show()

• Query a Pandas data frame:

import pandas as pd
df = pd.read_csv('file_example_CSV_100.csv')
duckdb.sql("SELECT * FROM df").show()

If we run any of the options from above, the show() method will give us a console friendly output:

We can see the columns with their data type and a compact view on the data.

Work with the result

If we want to do something more than just show the result, we can use the fetchall() method and then iterate through our result:

result = duckdb.sql("SELECT Country, count(*) FROM 'file_example_CSV_100.csv' " \
                    "GROUP BY Country ORDER BY Country")
countries = result.fetchall()
for country in countries:
    print(f"{country[0]} - {country[1]}")

France - 24
Great Britain - 28
United States - 48

DuckDB returns results in batches, which is great for handling large result sets. However, when working with smaller sets or for exploratory purposes, this approach may not be the most intuitive.

Convert results into other formats

The fetchall() turns the result into Python objects. If we want to use the result with other Python tools, we can go for a more specific converter that can give us the data back as a Pandas or Polars data frame, an Arrow table or a NumPy array:

pandas_df = duckdb.sql("SELECT 42").df()           # Pandas DataFrame
polars_df = duckdb.sql("SELECT 42").pl()           # Polars DataFrame
arrow_table = duckdb.sql("SELECT 42").arrow()      # Arrow Table
numpy_array = duckdb.sql("SELECT 42").fetchnumpy() # NumPy Arrays

Just make sure that you import the tool you want to export to - otherwise you end up with a ModuleNotFoundError exception.

Write results to the disk

If we want to store the result of our query as a CSV or a Parquet file, we do not need to take a detour through Pandas. Instead, we can write the result directly to the disk:

duckdb.sql("SELECT 42").write_parquet("out.parquet")
duckdb.sql("SELECT 42").write_csv("out.csv")