#265: Working With Parquet Files
Now that we know how to create a large amount of test data with Faker, we should find an efficient way to store the data. Most developers know CSV files, but is there a more efficient format we can use? On my search to find an answer to this question, the Parquet format showed up and it sounds like the tool for this task. Let us find out if this is the case and how we can use it.
In this post we use pyarrow and fastparquet to work with Parquet files, while Pandas will be the topic of the next post.
Apache Parquet and its columnar format
Apache Parquet is a columnar storage file format designed for efficient data processing and retrieval. In traditional row-based storage formats, data is stored sequentially row by row. In contrast, Parquet's columnar format stores data column by column. This structure allows for:
- Efficient Data Retrieval: Reading only the necessary columns reduces I/O operations, leading to faster data access.
- Better Compression: Similar data types within columns enhance compression ratios, saving storage space.
- Optimized Query Performance: Analytical queries that target specific columns execute more efficiently.
Working with Parquet files using pyarrow
The pyarrow library, a part of the Apache Arrow project, provides tools for reading and writing Parquet files. Here is how we can use it:
-
Installation:
-
Writing Data to a Parquet File:
-
Reading Data from a Parquet File:
Working with Parquet files using fastparquet
Another library, fastparquet, offers efficient Parquet file processing. Unfortunately, we cannot take the dictionary and persist it right away. Instead, we need to turn it into a Pandas DataFrame to pass it to fastparquet:
-
Installation:
-
Writing Data to a Parquet File:
-
Reading Data from a Parquet File:
Choose between pyarrow and fastparquet
Both pyarrow and fastparquet are great for handling Parquet files in Python. The choice between them depends on specific requirements:
pyarrow: Offers a comprehensive set of features and is part of the broader Apache Arrow ecosystem. It is suitable for complex data processing tasks.fastparquet: Designed for speed and efficiency, making it ideal for quick read/write operations.
For quick scripts where I do not want to go through Pandas, I prefer pyarrow, but your mileage may vary.
Difference in file size
Now that we know how to write Parquet files, it is time answer the initial question and compare the file size of Parquet with CSV. We use this script to create one million users and store them in both file formats:
We can run the script (it takes about 8 minutes) and then compare the file size:
For our example data set, we can save 70 MB (62%) when we use Parquet with GZIP compression compared to the CSV file. If we do not set an explicit compression, pyarrow will use SNAPPY what gives us a file of 72 MB that is still around 37% less than the uncompressed CSV file.
Downside
Before we come to an end, there is one important downside with Parquet files: They are compressed files and not plain text files. Therefore, we cannot just pick any text editor and make changes to the files.
That said, there are lots of tools to work with Parquet files. Alone for VS Code we can find 8 plug-ins. This allows us to make modifications, even if it takes a dedicated plug-in to do that.
Next
We now know what Parquet files are and how we can work with them. The storage improvement is impressive and grows the larger your files are. Next week we explore the Parquet integration n Pandas and see what we need to change when we want to use Parquet files.