How to Use the Databricks Writer

Dan Minney
Dan Minney
  • Updated

FME Version

Introduction

In FME, the Databricks writer allows you to write data to an existing or new Delta Table in a Databricks Workspace. In the writer parameters, you can control the data types of the attributes as well as the operation performed. The Databricks writer requires a connection to both a Databricks Workspace and a Cloud Storage location (either Amazon S3 or Microsoft Azure Data Lake Gen 2). The Cloud Storage location is used as a staging location before the data is written to the table in Databricks. This is a requirement of the Databricks APIs.

In this tutorial, we will create a Databricks writer connection to write data to Databricks. You will need to have a Databricks account to continue. After creating the connection, continue to Converting from OGC GeoPackage to Databricks to work through an example of writing data to Databricks with FME. 
 

Creating a Databricks Writer Connection

The Databricks writer requires that you’ve downloaded the Databricks writer package. When adding a Databricks writer to a workspace for the first time you will be prompted to Install the “Databricks Writer” package. This is a requirement of the writer and must be installed for the Databricks writer to function properly.

To write to new and existing Databricks tables, you will need to create a Databricks writer connection. When creating a Databricks writer connection, you are required to provide a few different parameters. When creating a Databricks writer connection, the dialog will look as follows:
image19.png

To fulfill all the Databricks writer connection parameters, follow the next steps. 
Open your Databricks Workspace in a web browser, go to the Compute tab. 
image7.png

Next, click on the Cluster you want to use for writing Databricks tables. The Databricks reader and writer require that the cluster is running in order to read the tables stored in Databricks. 
On the Cluster Details page, scroll down to the bottom and expand the Advanced Options. Click on the JDBC/ODBC tab and you will be presented with the parameters required to create a Databricks writer connection.
image16.png

Server Hostname
This value is the same value as the Server Hostname parameter with https:// added in front. For example, if the Server Hostname in the Cluster Details is your-workspace.cloud.databricks.com, then the Workspace ID should be set to https://your-workspace.cloud.databricks.com.

Cluster ID
In Cluster Details, this can be found in the HTTP Path as the last set of alphanumeric characters. For example, if our HTTP Path is sql/protocolv1/o/1234567890123456/0123-123456-aj1bc23e, then the Cluster ID is 0123-123456-aj1bc23e.

Personal Access Token
Use an existing Access Token if you already have one. To generate a Personal Access Token follow these instructions. 

Catalog
The Catalog you have access to is set in the Databricks writer connection. Setting a Catalog is optional but recommended. If Unity Catalog is enabled for the Databricks Workspace, then you will need to select which Catalog you want to read from. The Catalog you select will affect where you will be able to write to new and existing tables in the Databricks Workspace. 

Cloud Upload Credentials
When writing features to a table in Databricks, the writer requires a cloud-hosted staging location. For the Databricks writer, there is the option to use Amazon S3 or Azure Data Lake Storage Gen2 (ADLS).

Your Databricks workspace MUST have access to the cloud storage location you use. There are multiple ways to test if your workspace has the ability to connect to the S3 bucket/ADLS you want to use. 

Keeping performance in mind, it is recommended that the staging table is located on the same cloud platform and the same server as the Databricks Workspace. 
For example, if the Databricks Workspace is hosted on us-west-2 on AWS, then the staging location should be an Amazon S3 bucket on us-west-2. 

 

AWS: For instructions on how to connect to an Amazon S3 bucket from Databricks.

Microsoft Azure: For instructions on how to connect to Azure Data Lake Storage Gen2 from Databricks.

Cloud Storage Type
For cloud storage type, you have the selection of either Amazon S3, or Microsoft Azure Data Lake Gen 2.

S3 Storage
If you selected Amazon S3, you will need to fulfill two parameters. 
S3 Connection:

  • Create a new (or use an existing) AWS Web Connection. See Working with Amazon Web Services for more information.  

S3 Bucket:

  • Enter the name of the S3 bucket you want to use as a staging location. The name must match exactly. 

Azure Cloud Storage
If you selected Azure Cloud Storage, you will need to fulfill the following parameters:
Cloud Authentication Method: 

  • Web Connection - allows you to connect via a Web Connection in FME
  • Microsoft SAS - allows you to connect via a Shared Access Signature (SAS)

Azure Data Lake Container: 

  • Enter the name of the Azure Data Lake Container you want to use as a staging location.

Using the example values in the screenshot above, the Databricks web connection would look like the following:
image12.png

To save the connection for another time, click the Presets drop-down arrow > Save As > New Preset. Give the Preset a meaningful name.
Below is a visual representation of where the corresponding parameters are located in Databricks.
image14.png
 

Data Attribution

The data used here originates from open data made available by the City of Vancouver, British Columbia. It contains information licensed under the Open Government License - Vancouver.

Was this article helpful?

Comments

0 comments

Please sign in to leave a comment.