Converting from OGC GeoPackage to Databricks

Files

ConvertingToDatabricks.zip
- 80 KB
- Download

Introduction

This tutorial walks through how to spatially join two OGC GeoPackage layers and write the resulting features to a Databricks Delta Table.
In this scenario, we want to find out which drinking fountains fall within which postcode boundaries for use in a data lake, in this case, Databricks. Follow the steps below to build this workspace from scratch, or open the completed FME template in the article attachments.

Step-by-Step Instructions

The data we are working with in this exercise is an OGC GeoPackage containing Drinking Fountains and Postcode Boundaries in Vancouver, British Columbia.

1. Open FME Workbench
Open FME Workbench and create a new workspace.

2. Add an OGC GeoPackage Reader
Add a reader to the workspace by clicking the “Add Reader” button in the toolbar. Enter the following parameters:

Format: OGC GeoPackage
Dataset: Vancouver_Data.gpkg

Click OK, then in the Select Feature Types dialog, select both PostcodeBoundaries and DrinkingFountains. Click OK.

Run the workspace with Feature Caching Enabled. There should be 113 features coming from the DrinkingFountains layer and 21 features coming from the PostcodeBoundaries layer.

3. Perform an Intersect Between the Two Feature Types
We want to perform a spatial intersect between the two feature types to see which PostcodeBoundaries the DrinkingFountains fall within. To do this, we can add a SpatialFilter to the workspace. Connect the DrinkingFountains reader feature type to the Candidate port. Connect the PostcodeBoundaries reader feature type to the Filter port.

Open the SpatialFilter parameters. Set the Spatial Predicates to Test parameter to “Filter OGC-Intersects Candidate”. Leave the rest of the parameters as the defaults. Press OK to accept the new parameters and close the transformer.

4. Attribute Cleanup
Before we output the data to Databricks, we need to do some attribute cleaning.
Add an AttributeManager to the workspace and connect the SpatialFilter Passed output port to it.
Open the AttributeManager parameters. Remove the SHAPE_Length, SHAPE_Area, and _predicate attributes by clicking on each Input Attribute and clicking the “-” button to remove them.
Rename the CFSAUID attribute to “PostalCode” by clicking on the Output Attribute named CFSAUID and renaming it. Press OK to accept the new parameters.

5. Write to Databricks
Now that we’ve found which postal code boundaries each of the drinking fountains falls within and have cleaned up the data, we can write to a new Databricks Delta Table.

Add a writer to the workspace by clicking the “Add Writer” button in the toolbar. Enter the following parameters:

Format: Databricks
Dataset: <your_databricks_connection>
Table Definition: Automatic…

If you’re not sure how to create a Databricks writer connection, please reference How to Use the Databricks Writer.
Press OK to add the writer to the workspace. In the Feature Type dialog that pops up, set the Table Name to DrinkingFountains and press OK. Connect the AttributeManager output port to the DrinkingFountains writer feature type.

6. Run the workspace
With the writer added to the workspace, you can now run it to see the table created in Databricks. The final workspace should look like this:

By default, the schema is set to "default". If you are using a different schema name, open the Databricks writer feature type and change the Table Qualifier.

Run the workspace to write the data to Databricks.

7. View the Output in Databricks
Now go to the Data Explorer page in your Databricks Workspace. We can see that the DrinkingFountains table was uploaded as a new Delta Table.

Data Attribution

The data used here originates from open data made available by the City of Vancouver, British Columbia. It contains information licensed under the Open Government License - Vancouver.