Skip to content

#273: Query Excel Files With DuckDB

Excel is the backbone of many companies. While it comes with a nice format to share tabular data between departments, it is not ideal to work with it in an automated way. Even with the integrated Python and the development functionality, Excel feels something limited compared to what we can do with regular Python. In this post we use Excel only as a storage format and do the work in plain Python and DuckDB.

Load the Excel extension

To get the capabilities to work with Excel in DuckDB, we need to install the Excel extension. For that we can use this query as we learned in the last post to install and load the extension:

1
2
3
import duckdb

duckdb.sql("INSTALL excel; LOAD excel;")

Read data from Excel

If we have a file named demo.xlsx, we can use this code to read the data and show the optimised view for the console:

duckdb.sql("SELECT * FROM 'demo.xlsx'").show()
We can see that the file contains 2000 rows and 7 columns.

If we want to be more specific on how we load the data, we can use this code:

duckdb.sql("SELECT * FROM read_xlsx('demo.xlsx', header = true, sheet='Sheet1');").show()

This extended way of loading data is useful if we have several sheets with data or if we do not have a header column.

Create a table from Excel

We can create a table based on existing data the same way we did for a CSV file:

duckdb.sql("CREATE TABLE people AS FROM 'demo.xlsx'")
duckdb.sql("SELECT count(*) AS 'rows' FROM people").show()
We got 2000 rows in our new people table.

Query the data

We can use the data directly or load it into a table and then query an Excel sheet as we would do it with any other data source supported by DuckDB:

data = duckdb.sql("SELECT * FROM 'demo.xlsx' WHERE Age > 30 AND Gender = 'Female'")
data.show()
We can filter data directly from Excel

If you need help with the syntax, check out the post on querying data in DuckDB.

Write to Excel

To close the loop with Excel, we can filter data and then write it to (another) Excel file with this code:

1
2
3
4
5
6
with duckdb.connect() as con:
    con.install_extension("excel")
    con.load_extension("excel")

    data = con.sql("SELECT * EXCLUDE (Date) FROM 'demo.xlsx' WHERE Age > 20")
    con.sql("COPY data TO 'output.xlsx' WITH (FORMAT xlsx, HEADER true, SHEET 'filtered')")

I had to switch to the connection-based approach, since the direct one failed with various errors. Depending on the data you want to write and your Excel version, you may end up with a file that Excel needs to repair before it can open it (see issue #15).

Conclusion

With DuckDB we can not only work seamlessly with Python and the objects we have inside our application, but we can even run SQL against an Excel file. This gives us all the flexibility we need to solve problems in our day-to-day work.

This post concludes this short series on DuckDB. What we covered so far should give you a head start for your own exploration. We may return in the future, but first I need to cover a few topics that currently grab my attention.