How to do Spatial Processing on Parquet Data

Liz Sanderson
Liz Sanderson
  • Updated

FME Version

Introduction

In FME, Parquet data can be extracted from data warehouse or data lake, processed, and written back to Parquet for uploading back to the cloud. Often, large volumes of spatial information like latitude and longitude or x/y coordinates are stored in columns and need to be processed and transformed. Parquet is optimized for storing millions of records, and FME can quickly process these records and associated geospatial information.

 

Step-by-Step Instructions

In this scenario, the user needs to process a Parquet file stored in the cloud, which contains spatial information in the form of lat/long coordinates. In FME, the workflow includes a Parquet reader and writer, plus transformers and an additional reader to perform the data processing. Follow along in the steps below to build the workspace from scratch, or open the completed FME template in the article attachments.

The data we are working with represents public art in Downtown Vancouver and contains latitude and longitude values for each row:

1 Parquet preview.png

We are interested in finding which transit station is closest to each art display, and then updating the Parquet dataset to include that information. We’ll use an external GIS dataset containing transit data.

1. Generate a Workspace
Open FME Workbench and generate a new workspace. Add an Apache Parquet reader and writer by entering the following parameters:

  • Reader Format: Apache Parquet
  • Reader Dataset: C:\<Tutorial Download>\Downtown.parquet
  • Writer Format: Apache Parquet
  • Writer Dataset: C:\<Tutorial Download>\Downtown-transformed
  • Workflow Options: Static Schema

2 Generate Parquet translation.PNG

Click OK.

A workspace is generated that translates the input .parquet file to an output .parquet file.

3 Parquet workspace.png

2. Add a Shapefile Reader
Now it’s time to perform the desired spatial processing. We are interested in finding which transit station is closest to each art display, so we’ll read in the GIS file containing transit stations.

Click “Add Reader” and enter the following parameters:

  • Format: Esri Shapefile
  • Dataset: C:\<Tutorial Download>\transit\rapid-transit-stations.shp
  • Workflow Options: Individual Feature Types

4 Shapefile reader.PNG

The workspace now has two reader feature types, one for Downtown and one for rapid-transit-stations.

3. Add a NeighborFinder transformer
Click anywhere on the canvas and begin typing “NeighborFinder”. Click the transformer to add it to the workflow.

On the input side, connect the Downtown feature type to the Base port and the rapid-transit-stations feature type to the Candidate port.

Open the NeighborFinder parameters. Under the “Attribute Accumulation” section, check “Merge Attributes”. This ensures the station name from the Shapefile dataset will appear in the output Parquet dataset.

5 NeighborFinder parameters.png

Connect the MatchedBase output port to the Downtown writer feature type. The workspace should look as follows:

6 NeighborFinder.png

4. Configure the Output Attributes
Open the parameters on the output writer feature type. In the User Attributes tab, ensure “Manual” is chosen and add a “station” attribute of type “string”. The list of attributes should consist of Name, Title, Longitude, Latitude, and station (case-sensitive). This ensures that we write only the columns we care about in the output Parquet file, which includes the station name taken from the Shapefile dataset.

7 Writer feature type attributes.PNG

Click OK.

When you expand the writer feature type, you should see the list of attributes with green input arrows. If an arrow is red, it means the attribute name does not match the source.

9 Output attributes.png

5. Run the Workspace
Run the workspace to perform the desired spatial processing on the Parquet dataset. The file Downtown.parquet is generated in the Downtown-transformed folder, which contains an additional column with the name of the nearest transit station. This file can now be uploaded back to the cloud.

8 Output parquet file.png

 

Additional Resources

Apache Parquet FME Documentation

 

Data Attribution

The data used here originates from open data made available by the City of Vancouver, British Columbia. It contains information licensed under the Open Government License - Vancouver.

Was this article helpful?

Comments

0 comments

Please sign in to leave a comment.