FME Version
Files
Introduction
In FME, the Apache Parquet writer will create a .parquet file for each feature type. In the writer parameters, you can specify the compression type, including Uncompressed, ZSTD (good for character-based columns that store strings, including JSON strings), or Snappy (fast compression at a reasonable ratio).
Step-by-Step Instructions
In this scenario, the user needs to convert a CSV file to Parquet for use in a Big Data system. In FME, this is simply a matter of connecting a CSV reader to a Parquet writer and running the workspace. Follow along in the steps below to build the workspace from scratch, or open the completed FME template in the article attachments.
The data we are working with is a CSV file of public art in Vancouver:
1. Open FME Workbench
Open FME Workbench and start with a blank canvas.
2. Add a CSV Reader
Click “Add Reader” and add the source data to the workspace. Enter the following parameters:
- Format: CSV (Comma Separated Value)
- Dataset: C:\<Tutorial Download>\public-art.csv
- Workflow Options: Individual Feature Types
Click “Parameters…” and ensure the reader is configured to read the CSV data correctly. In this scenario, the delimiter is a semicolon and the first row contains the field names, so we’ll want to specify that.
- Feature Type Name(s): From File Name(s)
- Delimiter Character: ;
- Field Names Line: 1
- Data Start Line: 2
- Attribute Definition: Automatic
Click OK on the Parameters, and click OK again to add the reader feature type to the canvas.
3. Add an Apache Parquet Writer
Click “Add Writer” and enter the following parameters:
- Format: Apache Parquet
- Dataset: C:\<Tutorial Download>\public-art-parquet
- Feature Type Definition: Copy from Reader...
Open the Parameters dialog to optionally set the Compression Type and File Version. In this scenario, we will leave the default values:
- Compression Type: UNCOMPRESSED
- File Version: 2.0
Click OK.
4. Connect the Reader and Writer
Connect the reader feature type to the writer feature type. The workspace is now configured to translate the input CSV file to Parquet.
5. Run the Workspace
Run the workspace to convert the CSV data to Parquet. The file public-art.parquet is generated in the output folder, which can then be uploaded to a Big Data system.
Bonus: Generating Multiple Output Parquet Files
To partition the data into multiple .parquet files, use Feature Type Fanout to create a file for each value in a specified column. For example, in this public art dataset, we could fanout on the “Type” column and output a Parquet file for each type of art.
This CSV file with a Type column:
Becomes these .parquet files, fanned out by Type:
Fanout is specified in the writer feature type properties:
1. Open the Writer Feature Type Properties
With the writer feature type connected to its input, double-click it to open the Feature Type Properties dialog.
2. Select the Fanout Attribute
Click the arrow beside the Feature Type Name field, and choose the column to use as the fanout value.
Click OK to close the properties dialog. When the workspace is run, the writer will generate output Parquet files based on that attribute value.
Additional Resources
Apache Parquet FME Documentation
Data Attribution
The data used here originates from open data made available by the City of Vancouver, British Columbia. It contains information licensed under the Open Government License - Vancouver.
Comments
0 comments
Please sign in to leave a comment.