Skip to content

#274: Working With Excel Files in Pandas

Last week we saw how we could work seamlessly with Excel in DuckDB. If you have Pandas and want to get a similar experience, this post is for you.

Install openpyxl

If we try to open an Excel file with Pandas, we may run into this error:

ImportError: Missing optional dependency 'openpyxl'. Use pip or conda to install openpyxl.

We can fix this problem by installing openpyxl with this command:

pip install openpyxl

Optional: Install pyarrow

If you want more than int and object as the data types for your columns of your Excel file, you should install pyarrow as well:

pip install pyarrow

Load an Excel file into a data frame

To load an Excel file with the first column as its index, we can use this command:

1
2
3
import pandas as pd

df = pd.read_excel('../DuckDB/demo.xlsx', index_col=0, dtype_backend='pyarrow')

This gives us a data frame that we can work with like we did with any other data frame:

>>> df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 2000 entries, 1 to 2000
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   First Name  2000 non-null   string[pyarrow]
 1   Last Name   2000 non-null   string[pyarrow]
 2   Gender      2000 non-null   string[pyarrow]
 3   Country     2000 non-null   string[pyarrow]
 4   Age         2000 non-null   int64[pyarrow]
 5   Date        2000 non-null   string[pyarrow]
 6   Id          2000 non-null   int64[pyarrow]
dtypes: int64[pyarrow](2), string[pyarrow](5)
memory usage: 162.1 KB
None

>>> df
     First Name   Last Name  Gender        Country  Age        Date    Id
1         Dulce       Abril  Female  United States   32  15/10/2017  1562
2          Mara   Hashimoto  Female  Great Britain   25  16/08/2016  1582
3        Philip        Gent    Male         France   36  21/05/2015  2587
4      Kathleen      Hanner  Female  United States   25  15/10/2017  3549
5       Nereida     Magwood  Female  United States   58  16/08/2016  2468
...         ...         ...     ...            ...  ...         ...   ...
1996       Roma  Lafollette  Female  United States   34  15/10/2017  2654
1997     Felisa        Cail  Female  United States   28  16/08/2016  6525
1998   Demetria       Abbey  Female  United States   32  21/05/2015  3265
1999     Jeromy        Danz    Male  United States   39  15/10/2017  3265
2000   Rasheeda      Alkire  Female  United States   29  16/08/2016  6125

In post #186 I collected the most helpful ways to filter data in Pandas.

Store a data frame in an Excel file

We can use this command to write a data frame to an Excel file:

1
2
3
data = df[df.Country == 'Great Britain']

data.to_excel('output.xlsx', index=True, sheet_name='GB')

The index=True keeps the index column while sheet_name='GB' sets the name of the sheet:

We got our data as an Excel file on the sheet named GB.

If you need more control when writing the Excel file, the library XlsxWriter may be a great starting place.

Conclusion

With the help of openpyxl and pyarrow we can access Excel files from Pandas and use them like any other data source. This is an immense help if you must work with Excel files and want to automate your manual workflow.