How to Use the Databricks Writer

Dan Minney
Dan Minney
  • Updated

FME Version

Introduction

In FME, the Databricks writer allows you to write data to an existing or new Delta Table in a Databricks Workspace. In the writer parameters, you can control the data types of the attributes as well as the operation performed. 

In this tutorial, we will create a Databricks writer connection to write data to Databricks. You will need to have a Databricks account to continue. After creating the connection, continue to Converting from OGC GeoPackage to Databricks to work through an example of writing data to Databricks with FME. 

 

How does the writer work?

The Databricks writer uses a staging location to stage data before loading it into the Databricks table. This process is the most efficient for loading large amounts of data. 

A Unity Catalog Volume can be specified as the staging location. This method is recommended as it requires the least amount of setup.

Alternatively, users can use a Cloud Storage location (either Amazon S3 or Microsoft Azure Data Lake Gen 2) as the staging location. Databricks must have permissions to access the cloud storage location specified.

 

Creating a Databricks Writer Connection

The Databricks writer requires that you’ve downloaded the Databricks writer package. When adding a Databricks writer to a workspace for the first time, you will be prompted to Install the “Databricks Writer” package. This is a requirement of the writer and must be installed for the Databricks writer to function properly.

To write to new and existing Databricks tables, you will need to create a Databricks writer connection. When creating a Databricks writer connection, you are required to provide a few different parameters. When creating a Databricks writer connection, the dialog will look as follows:



To fulfill all the Databricks writer connection parameters, follow the next steps.
Open your Databricks Workspace in a web browser and go to the Compute tab.


Next, click on the Cluster you want to use for writing Databricks tables. The Databricks reader and writer require that the cluster is running in order to read the tables stored in Databricks.
On the Cluster Details page, scroll down to the bottom and expand the Advanced Options. Click on the JDBC/ODBC tab, and you will be presented with the parameters required to create a Databricks writer connection.


Server Hostname
This value is the same value as the Server Hostname parameter with https:// added in front. For example, if the Server Hostname in the Cluster Details is your-workspace.cloud.databricks.com, then the Workspace ID should be set to https://your-workspace.cloud.databricks.com.

Cluster ID
In Cluster Details, this can be found in the HTTP Path as the last set of alphanumeric characters. For example, if our HTTP Path is sql/protocolv1/o/57523595023/0123-123456-aj1bc23e, then the Cluster ID is 0123-123456-aj1bc23e.

Personal Access Token
Use an existing Access Token if you already have one. To generate a Personal Access Token follow these instructions.

Catalog
The Catalog you have access to is set in the Databricks writer connection. The Catalog you select will affect where you will be able to write to new and existing tables in the Databricks Workspace.

Staging Upload Credentials
When writing features to a table in Databricks, the writer requires a staging location. For the Databricks writer, there is the option to use a Unity Catalog Volume, Amazon S3 or Azure Data Lake Storage Gen2 (ADLS).

If you decide to use Amazon S3 or Azure Data Lake Storage Gen2, your Databricks workspace MUST have access to the cloud storage location you use. 

AWS: For instructions on how to connect to an Amazon S3 bucket from Databricks.
Microsoft Azure: For instructions on how to connect to Azure Data Lake Storage Gen2 from Databricks.

 

Storage Type
For cloud storage type, you can select Amazon S3, Microsoft Azure Data Lake Gen 2, or Databricks UC Volume.

S3 Storage
If you selected Amazon S3, you will need to fulfill two parameters.
S3 Connection:

  • Create a new (or use an existing) AWS Web Connection. See Working with Amazon Web Services for more information.  

S3 Bucket:

  • Enter the name of the S3 bucket you want to use as a staging location. The name must match exactly. 

Azure Cloud Storage
If you selected Azure Cloud Storage, you will need to fulfill the following parameters:
Cloud Authentication Method: 

  • Web Connection - allows you to connect via a Web Connection in FME
  • Microsoft SAS - allows you to connect via a Shared Access Signature (SAS)

Azure Data Lake Container: 

  • Enter the name of the Azure Data Lake Container you want to use as a staging location.

Using a Unity Catalog Volume as our Staging Location in this example, the Databricks web connection would look something like the following:

Click Test… to check that the connection is valid. Click Save to save the connection for future use.

 

Below is a visual representation of where the corresponding parameters are located in Databricks.

 

Data Attribution

The data used here originates from open data made available by the City of Vancouver, British Columbia. It contains information licensed under the Open Government License - Vancouver.

Was this article helpful?

Comments

0 comments

Please sign in to leave a comment.