Skip to content

#265: Working With Parquet Files

Now that we know how to create a large amount of test data with Faker, we should find an efficient way to store the data. Most developers know CSV files, but is there a more efficient format we can use? On my search to find an answer to this question, the Parquet format showed up and it sounds like the tool for this task. Let us find out if this is the case and how we can use it.

In this post we use pyarrow and fastparquet to work with Parquet files, while Pandas will be the topic of the next post.

Apache Parquet and its columnar format

Apache Parquet is a columnar storage file format designed for efficient data processing and retrieval. In traditional row-based storage formats, data is stored sequentially row by row. In contrast, Parquet's columnar format stores data column by column. This structure allows for:

  • Efficient Data Retrieval: Reading only the necessary columns reduces I/O operations, leading to faster data access.
  • Better Compression: Similar data types within columns enhance compression ratios, saving storage space.
  • Optimized Query Performance: Analytical queries that target specific columns execute more efficiently.

Working with Parquet files using pyarrow

The pyarrow library, a part of the Apache Arrow project, provides tools for reading and writing Parquet files. Here is how we can use it:

  1. Installation:

    pip install pyarrow
    

  2. Writing Data to a Parquet File:

    import pyarrow as pa
    import pyarrow.parquet as pq
    
    # Sample data
    data = {
        'id': [1, 2, 3],
        'name': ['Alice', 'Bob', 'Charlie'],
        'age': [25, 30, 35]
    }
    
    # Create a Table
    table = pa.table(data)
    
    # Write to Parquet file
    pq.write_table(table, 'people.parquet')
    

  3. Reading Data from a Parquet File:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    import pyarrow.parquet as pq
    
    # Read the Parquet file
    table = pq.read_table('people.parquet')
    
    # Convert to a dictionary
    data = table.to_pydict()
    print(data)
    # {'id': [1, 2, 3], 'name': ['Alice', 'Bob', 'Charlie'], 'age': [25, 30, 35]}
    

Working with Parquet files using fastparquet

Another library, fastparquet, offers efficient Parquet file processing. Unfortunately, we cannot take the dictionary and persist it right away. Instead, we need to turn it into a Pandas DataFrame to pass it to fastparquet:

  1. Installation:

    pip install fastparquet
    

  2. Writing Data to a Parquet File:

    from fastparquet import write
    import pandas as pd
    
    # Sample data
    data = {
        'id': [1, 2, 3],
        'name': ['Alice', 'Bob', 'Charlie'],
        'age': [25, 30, 35]
    }
    
    # Create a DataFrame
    df = pd.DataFrame(data)
    
    # Write to Parquet file
    write('people_fp.parquet', data)
    

  3. Reading Data from a Parquet File:

    1
    2
    3
    4
    5
    6
    7
    8
    from fastparquet import ParquetFile
    
    # Read the Parquet file
    pf = ParquetFile('people_fp.parquet')
    data = pf.to_pandas().to_dict(orient='list')
    print(data)
    
    # {'id': [1, 2, 3], 'name': ['Alice', 'Bob', 'Charlie'], 'age': [25, 30, 35]}
    

Choose between pyarrow and fastparquet

Both pyarrow and fastparquet are great for handling Parquet files in Python. The choice between them depends on specific requirements:

  • pyarrow: Offers a comprehensive set of features and is part of the broader Apache Arrow ecosystem. It is suitable for complex data processing tasks.
  • fastparquet: Designed for speed and efficiency, making it ideal for quick read/write operations.

For quick scripts where I do not want to go through Pandas, I prefer pyarrow, but your mileage may vary.

Difference in file size

Now that we know how to write Parquet files, it is time answer the initial question and compare the file size of Parquet with CSV. We use this script to create one million users and store them in both file formats:

import csv
from faker import Faker
import pyarrow.csv as pv
import pyarrow.parquet as pq

def generate_user_data(num_users, chunk_size=100000):
    fake = Faker()
    fieldnames = ['id', 'name', 'email', 'address', 'phone_number', 'birthdate']
    user_id = 1

    with open('users.csv', mode='w', newline='', encoding='utf-8') as csvfile:
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()

        while user_id <= num_users:
            users = []
            for _ in range(min(chunk_size, num_users - user_id + 1)):
                user = {
                    'id': user_id,
                    'name': fake.name(),
                    'email': fake.email(),
                    'address': fake.address().replace('\n', ', '),
                    'phone_number': fake.phone_number(),
                    'birthdate': fake.date_of_birth().isoformat()
                }
                users.append(user)
                user_id += 1

            writer.writerows(users)
            print(f'Generated {user_id - 1} of {num_users} users')

def convert_csv_to_parquet(csv_file, parquet_file):
    table = pv.read_csv(csv_file)
    pq.write_table(table, parquet_file, compression='GZIP')
    print(f'Converted {csv_file} to {parquet_file}')

if __name__ == '__main__':
    num_users = 1000000
    generate_user_data(num_users)
    convert_csv_to_parquet('users.csv', 'users.parquet')

We can run the script (it takes about 8 minutes) and then compare the file size:

du -hs user*
116M    users.csv
44M     users.parquet

For our example data set, we can save 70 MB (62%) when we use Parquet with GZIP compression compared to the CSV file. If we do not set an explicit compression, pyarrow will use SNAPPY what gives us a file of 72 MB that is still around 37% less than the uncompressed CSV file.

Downside

Before we come to an end, there is one important downside with Parquet files: They are compressed files and not plain text files. Therefore, we cannot just pick any text editor and make changes to the files.

That said, there are lots of tools to work with Parquet files. Alone for VS Code we can find 8 plug-ins. This allows us to make modifications, even if it takes a dedicated plug-in to do that.

Next

We now know what Parquet files are and how we can work with them. The storage improvement is impressive and grows the larger your files are. Next week we explore the Parquet integration n Pandas and see what we need to change when we want to use Parquet files.