How to do Spatial Processing on Parquet Data

Files

Parquet spatial processing tutorial.zip
- 30 KB
- Download

Introduction

In FME, Parquet data can be extracted from a data warehouse or a data lake, processed, and written back to Parquet for uploading back to the cloud. Often, large volumes of spatial information, such as latitude and longitude or x/y coordinates, are stored in columns and require processing and transformation. Parquet is optimized for storing millions of records, and FME can quickly process these records and associated geospatial information.

Step-by-Step Instructions

In this scenario, the user needs to process a Parquet file stored in the cloud that contains spatial information in the form of latitude/longitude coordinates. In FME, the workflow includes a Parquet reader and writer, as well as transformers and an additional reader, to perform the data processing. Follow the steps below to build the workspace from scratch, or open the completed FME template attached to the article.

The data we are working with represents public art in Downtown Vancouver and contains latitude and longitude values for each row:

1 Parquet preview.png

We are interested in determining which transit station is closest to each art display and then updating the Parquet dataset to include this information. We’ll use an external GIS dataset containing transit data.

1. Generate a Workspace

Open FME Workbench and generate a new workspace. Add an Apache Parquet reader and writer by entering the following parameters:

Reader Format: Apache Parquet
Reader Dataset: C:\<Tutorial Download>\Downtown.parquet
Writer Format: Apache Parquet
Writer Dataset: C:\<Tutorial Download>\Downtown-transformed
Workflow Options: Static Schema

2 Generate Parquet translation.PNG

Click OK.

A workspace is generated that translates the input .parquet file to an output .parquet file.

3 Parquet workspace.png

2. Add a Shapefile Reader

Now it’s time to perform the desired spatial processing. We are interested in finding which transit station is closest to each art display, so we’ll read in the GIS file containing transit stations.

Click “Add Reader” and enter the following parameters:

Format: Esri Shapefile
Dataset: C:\<Tutorial Download>\transit\rapid-transit-stations.shp
Workflow Options: Individual Feature Types

4 Shapefile reader.PNG

The workspace now has two reader feature types, one for Downtown and one for rapid-transit-stations.

3. Add a NeighborFinder transformer

Click anywhere on the canvas and begin typing “NeighborFinder”. Click the transformer to add it to the workflow.

On the input side, connect the Downtown feature type to the Base port and the rapid-transit-stations feature type to the Candidate port.

Open the NeighborFinder parameters. Under the “Attribute Accumulation” section, check “Merge Attributes”. This ensures that the station name from the Shapefile dataset appears in the output Parquet dataset.

5 NeighborFinder parameters.png

Connect the MatchedBase output port to the Downtown writer feature type. The workspace should look as follows:

6 NeighborFinder.png

4. Configure the Output Attributes

Open the parameters on the output writer feature type. In the User Attributes tab, ensure “Manual” is chosen and add a “station” attribute of type “string”. The list of attributes should consist of Name, Title, Longitude, Latitude, and station (case-sensitive). This ensures that we write only the columns we care about in the output Parquet file, which includes the station name taken from the Shapefile dataset.

7 Writer feature type attributes.PNG

Click OK.

When you expand the writer feature type, you should see the list of attributes with green input arrows. If an arrow is red, it means the attribute name does not match the source.

9 Output attributes.png

5. Run the Workspace

Run the workspace to perform the desired spatial processing on the Parquet dataset. The file Downtown.parquet is generated in the Downtown-transformed folder, which contains an additional column with the name of the nearest transit station. This file can now be uploaded back to the cloud.

8 Output parquet file.png

Additional Resources

Apache Parquet FME Documentation

Data Attribution

The data used here originates from open data made available by the City of Vancouver, British Columbia. It contains information licensed under the Open Government License - Vancouver.