Introduction
GeoParquet is an open, cloud-native geospatial format that stores vector data along with location information in Parquet files.It stores geometries and attributes in a highly compressed, columnar structure making it ideal, scalable querying for large datasets, and compatible with cloud-based workflows and analytics tools.
You can learn more about the format at the official site: GeoParquet Format (.parquet).
OpenStreetMap (OSM) provides freely accessible geospatial data from around the world in Protocolbuffer Binary Format (.pbf).
In this tutorial, we’ll use FME to convert OSM .pbf data into both partitioned and non-partitioned GeoParquet formats. We’ll write these outputs to Amazon S3, then compare the read performance of .pbf vs. GeoParquet by reading the same set of features (~193K tagged as natural) back into FME. We will also explore how GeoParquet partitioning enables selective access to specific types of features — such as water, forest, or scrub.
Step-by-Step Instructions
Part 1: Writing Partitioned and Non-Partitioned GeoParquet Data
1. Open FME Workbench and Add a Reader
In FME Workbench, start with a blank workspace. Click the Reader button to add a new reader. In the parameters:
- Format: OpenStreetMap (OSM) Protocolbuffer Binary Format (PBF)
-
Dataset:
https://safe-experts.s3.us-west-2.amazonaws.com/cloud-native-webinar/planet_-100.649%2C47.534_-93.386%2C50.485.osm.pbf -
Parameters:
-
Map Features: natural
- Click the ellipsis next to Map Features, in the dialog, expand osm and select natural. It is a large dataset and may take awhile to load.
-
Map Features: natural
Then click OK three times.
This limits the data to ~193K features tagged as natural, instead of reading ~1 million features from the full PBF file.
The natural Feature Type corresponds to the OSM tag natural=*, which describes natural features in OpenStreetMap. The features read into FME will include an attribute also called natural, with values such as water, forest, scrub, wetland, beach, and others. These values are derived directly from OSM and will later be used to partition the GeoParquet output into separate folders based on attribute value.
2. Inspect the Data
Run the workspace with Feature Caching Enabled. Click on the cache to view the data in Data Preview (formerly Visual Preview).
Open the Translation Log and notice the time taken to read the ~193K features. The below screenshot shows time taken without Feature Caching enabled.
3. Add FeatureWriters to Write into GeoParquet Format
We will write this data into GeoParquet twice, once with Partitioning enabled to see how to write Partitioned data and once without to compare the time taken to read the same amount of features from PBF reader vs. GeoParquet.
For this reason, we will add two FeatureWriters and split the workspace into two streams.
4. Configure FeatureWriter to Write with Partitioning
Now we will write the data into GeoParquet format with Partitioning enabled.
A partition refers to organizing your data into separate folders based on the values of one or more attributes (columns). This makes it easier to load only what you need (e.g., just “water”) from cloud storage without scanning everything.
- Format: GeoParquet
- Dataset: \Output
-
Partition: Checked
- Partition Type: Directory
- Partition Attributes: natural
Click OK.
More information on Partition parameters can be found here: GeoParquet Writer Documentation
5. Configure FeatureWriter to Write Without Partitioning
Duplicate the FeatureWriter, but in the parameters:
- Partition: Unchecked
6. Inspect the Output
Run the Workspace and check both outputs. Navigate to the destination folder specified in the Dataset parameter to check the output. You’ll see a natural.parquet and a natural folder with subfolders (partitions) named after values from the natural attribute. Each subfolder contains a .parquet file with only that feature type.
7. Add S3Connectors to Upload to Amazon S3
To simulate a real cloud-based use case, we will write both outputs to the cloud instead of reading them from local storage. Add two S3Connector transformers and connect one to the Partitioned data stream and the other to the non-Partitioned data stream.
8. Configure S3Connector for Partitioned Data
In the S3Connector parameters attached to the partioning workflow, set the following:
- Credential Source: Embedded
-
Embedded Credentials:
- Access Key ID: <your access key>
- Secret Access Key: <your secret key>
- Session Token: <your session token>
-
Request:
- Action: Upload
-
Data Source:
- Upload: Folder
-
Folder to Upload:
@Value(_dataset)\@Value(_feature_types{0}.name) - Include Subfolders: Yes
- Contents Only: No
-
Upload Options:
- Bucket: <your bucket>
- Path: <your path>
9. Configure S3Connector for Non-Partitioned data
Duplicate the parameters from the partitioned S3Connector for the non-partitioned workflow, except change the following:
-
Data Source:
- Upload: File
-
File to Upload:
@Value(_dataset)\@Value(_feature_types{0}.name).parquet
10. Run Workspace
Run the workspace to upload the partitioned and non-partitioned data to S3.
We have successfully written Partitioned and non-Partitioned GeoParquet data!
Part 2: Reading Partitioned and Non-Partitioned GeoParque Data
1. Read the Non-Partitioned GeoParquet using the GeoParquet Reader
Now let’s test reading the non-partitioned data. In a new workspace, add a reader.
- Format: GeoParquet
-
Dataset: <your S3 bucket from Part 1> /natural.parquet
- Click on the dropdown next to Dataset and hover over Select File From Web > Browse Amazon S3
This will open the options to set AWS connection settings. You can use the same settings as the S3Connector from earlier.
Run the workspace to read only the newly added GeoParquet Feature Type. Open the Translation Log and notice the time taken to read the ~193K features from GeoParquet. The below screenshot shows time taken without Feature Caching enabled.
As we saw above, the PBF file took ~32.6 seconds to read from the cloud, whereas the GeoParquet file was read in just ~8.4 seconds.
This exercise demonstrated how reading the same number of features from the GeoParquet format vs the PBF format took less than half the time!
13. Inspect the Partitioned Output
Now let’s test reading the partitioned data. Add another reader to the workspace.
- Format: GeoParquet
-
Dataset: <your S3 bucket from Part 1> /natural/water/part0.parquet
- Click on the dropdown next to Dataset and hover over Select File From Web > Browse Amazon S3
Use the Run to This option on the part0 reader feature type, or connect an Inspector.
The output can be inspected in the Data Preview (formerly Visual Preview) window.
Summary
In this tutorial, we used FME to read OpenStreetMap data in .pbfformat and convert it into GeoParquet, both with and without partitioning.
We wrote both versions to S3 and demonstrated how to read them back using the GeoParquet Reader. The non-partitioned version showed a noticeable performance improvement in read time compared to the original .pbffile, even when handling the same number of features (~193K). The partitioned version, on the other hand, highlighted how cloud-native data can be organized by attribute (e.g., natural=water, natural=forest) to enable targeted data access — ideal for workflows that only require specific feature types.
This exercise demonstrates how GeoParquet enhances both performance and data flexibility, and how partitioning adds value by aligning geospatial data with modern, selective-access storage patterns.