Tutorial: Getting Started with Apache Parquet

Liz Sanderson
Liz Sanderson
  • Updated

FME Version

Introduction

Apache Parquet is a column-oriented file format designed to be performant for use in Big Data systems. Unlike CSV, it supports null values and a full range of data types, and it is designed for efficient queries. This makes it optimal for use in data warehouses and data lakes, including systems like Apache Hadoop, Amazon Athena, Google BigQuery, and Microsoft Azure.

A Parquet dataset consists of .parquet files in a folder, which might be nested into partitions based on attributes. In FME, a .parquet file is a feature type and a row/record is a feature.


Articles

This tutorial series will walk through basic Parquet translation and transformation scenarios, including how to use the Apache Parquet reader and writer in FME.

 

How to Convert CSV to Parquet

This tutorial walks through how to convert a CSV file to one or more Parquet files for use in a Big Data system.

 

How to Convert Parquet to JSON

This tutorial walks through how to convert a partitioned dataset of .parquet files into a single JSON file for easy sharing over the web.

 

How to do Spatial Processing on Parquet Data

This tutorial walks through how to do spatial processing on a Parquet file that has been extracted from a data lake. The data is then uploaded back to the cloud.

Was this article helpful?

Comments

0 comments

Please sign in to leave a comment.