#271: Querying Data With DuckDB

DuckDB is a fast database, that allows us to use SQL to interact with our data. But that is not where the story ends. While we can use a lot of common SQL syntax, we get a few helpful features that help us to work more effectively with our data. Let us explore them.

Select as usual

We can use select to fetch the rows we are interested in. As with many other systems, we can use a * to fetch all columns:

import duckdb

r = duckdb.read_csv("people.csv")

duckdb.sql("SELECT * FROM r").show()

This gives us the optimised output for the command line containing all columns of the table.

If we only care for some columns, we can specify them explicitly. If the column names have spaces, we need to use ” around the names - ’ will not work:

duckdb.sql('SELECT "First Name", "Last Name", Age FROM r').show()

Now we only get the columns back that we requested.

Select with a twist

If we want to have all columns except one, we can specify all other columns as we would do in any other database. But when we query DuckDB, we can use the * and combine it with the EXCLUDE keyword. We can now pass the column name that we want to exclude:

duckdb.sql("SELECT * EXCLUDE (Country) FROM r").show()

We get the table without the Country column.

The EXCLUDE keyword accepts a list of columns, that allows us to exclude multiple columns with this syntax:

duckdb.sql("SELECT * EXCLUDE (Country, Age) FROM r").show()

We now get the table but without the Country and the Age columns.

If we want to run a transformation on a specific column, we can use a similar syntax. To turn the country names into lower case, we can use this command:

duckdb.sql("SELECT * REPLACE (lower(Country) AS Country) FROM r").show()

We now get all columns back, but the country names are now in lower case.

If we need a range of numbers, we can use the table function range() and use it like the (range function from Python](https://pythonfriday.dev/2020/02/6-control-structures/):

duckdb.sql("SELECT t.i FROM range(10, 25) AS t(i)").show()

We get a table with one column containing the numbers from 10 to 24.

Joins

We can join two or more tables with the JOIN keyword. It is best to give each involved table an alias and then specify the columns we want to match in the format alias.column:

duckdb.sql("SELECT * FROM range(10, 25) AS t(i) JOIN r on r.Age == t.i").show()

This joins our value of the range function with the matching age of the people.csv file.

Group values

Especially in combination with count(*) we need to specify what columns we want to group. For many basic queries DuckDB can offer us a shortcut with the GROUP BY ALL helper where we do not need to list the columns explicitly:

duckdb.sql("SELECT Country, Age, count(*) FROM r GROUP BY ALL").show()

We get a grouped result without the need to specify by what columns we want to group.

This is a great help when we build up the query and do not need to keep repeating the same column names over and over again.

Limit and offset

When we want to page through our data, we often need to combine LIMIT (how many rows we want) and OFFSET (how many rows we want to skip). In DuckDB we can use these two keywords as we would in SQL Server:

duckdb.sql("SELECT * FROM r ORDER BY Id LIMIT 5 OFFSET 20").show()

We get 5 rows of our data after we skipped 20 from our data that we ordered by Id.

Get a sample

To analyse data, it is often helpful to take a glimpse into the table. In DuckDB there is a function for this where we can specify how many rows we want:

duckdb.sql("SELECT * FROM r TABLESAMPLE 5 ROWS").show()

We get a selection of rows that matches the number of rows we requested.

According to the documentation we should be able to use this command with a percentage value. However, with my small CSV file I usually ended up getting nothing back.

Another function to get a sample of the data back is USING SAMPLE that we can use like this:

duckdb.sql("SELECT * FROM r USING SAMPLE  5 ROWS;").show()

With this approach we can use the percentage value and use the Bernoulli sampling method:

duckdb.sql("SELECT t.i FROM range(1000) AS t(i) USING SAMPLE 1% (bernoulli);").show()