How to Convert CSV to Parquet

Files

CSV to Parquet tutorial.zip
- 500 KB
- Download

Introduction

In FME, the Apache Parquet writer will create a .parquet file for each feature type. In the writer parameters, you can specify the compression type, including Uncompressed, ZSTD (suitable for character-based columns that store strings, such as JSON strings), or Snappy (fast compression at a reasonable compression ratio).

Step-by-Step Instructions

In this scenario, the user needs to convert a CSV file to Parquet for use in a Big Data system. In FME, this is simply a matter of connecting a CSV reader to a Parquet writer and running the workspace. Follow the steps below to build the workspace from scratch, or open the completed FME template attached to the article.

The data we are working with is a CSV file of public art in Vancouver:
1 CSV public art.PNG

1. Open FME Workbench
Open FME Workbench and start with a blank canvas.

2. Add a CSV Reader
Click “Add Reader” and add the source data to the workspace. Enter the following parameters:

Format: CSV (Comma Separated Value)
Dataset: C:\<Tutorial Download>\public-art.csv
Workflow Options: Individual Feature Types

2 CSV Reader.PNG

Click “Parameters…” and ensure the reader is configured to read the CSV data correctly. In this scenario, the delimiter is a semicolon, and the first row contains the field names, so we’ll want to specify that.

Feature Type Name(s): From File Name(s)
Delimiter Character: ;
Field Names Line: 1
Data Start Line: 2
Attribute Definition: Automatic

3 CSV Reader parameters.PNG

Click OK on the Parameters, and click OK again to add the reader feature type to the canvas.

3. Add an Apache Parquet Writer
Click “Add Writer” and enter the following parameters:

Format: Apache Parquet
Dataset: C:\<Tutorial Download>\public-art-parquet
Feature Type Definition: Copy from Reader...

Open the Parameters dialog to optionally set the Compression Type and File Version. In this scenario, we will leave the default values:

Compression Type: UNCOMPRESSED
File Version: 2.0

4 Parquet writer.PNG

Click OK.

4. Connect the Reader and Writer
Connect the reader feature type to the writer feature type. The workspace is now configured to translate the input CSV file to Parquet.

5 CSV to Parquet workspace.PNG

5. Run the Workspace
Run the workspace to convert the CSV data to Parquet. The file public-art.parquet is generated in the output folder, which can then be uploaded to a Big Data system.

6 Output parquet file.PNG

Bonus: Generating Multiple Output Parquet Files

To partition the data into multiple .parquet files, use Feature Type Fanout to create a file for each value in a specified column. For example, in this public art dataset, we could fan out on the “Type” column and output a Parquet file for each type of art.

This CSV file with a Type column:
7 CSV type column.PNG

Becomes these .parquet files, fanned out by Type:

8 Fanout parquet files.PNG

Fanout is specified in the writer feature type properties:

1. Open the Writer Feature Type Properties
With the writer feature type connected to its input, double-click it to open the Feature Type Properties dialog.

2. Select the Fanout Attribute
Click the arrow beside the Feature Type Name field, and choose the column to use as the fanout value.

9 Fanout attribute.png

Click OK to close the properties dialog. When the workspace is run, the writer generates output Parquet files based on the attribute value.

Additional Resources

Apache Parquet FME Documentation

Data Attribution

The data used here originates from open data made available by the City of Vancouver, British Columbia. It contains information licensed under the Open Government License - Vancouver.