How to Use the SchemaScanner Transformer

Files

SchemaScanner_Examples_2022.fmwt
- 100 KB
- Download

Introduction

With the SchemaScanner, you can easily extract and manipulate the schema of your datasets, tackling dynamic workspace issues such as schema standardization and schema drift. So what does the SchemaScanner give you in return? It’s simple. A list attribute with the attribute name and its data type. Going forward in your workspace, this is all you need to manipulate your schema and create flexible workflows. Quality assurance and schema drift handling just got that much easier.

What is a Schema?

A schema, sometimes referred to as the "data model", can be described as the structure of a dataset, or, more accurately, a formal definition of a dataset’s structure.

Each dataset has its own unique schema, which includes feature types, permitted geometries, user-defined attributes, and other rules that define or restrict its content. However, most users deal with attribute names and data types when working with schemas.

Using the SchemaScanner

The SchemaScanner processes features and retrieves their schema by scanning for the attribute name and its data type. It will either scan all features or just a specified number of them. There’s also the option to exclude attributes by using a Regular Expression to ensure a clean schema output.

The resulting output is a new schema feature output via the <Schema> output port. This new feature is also given the special attribute and value: fme_schema_handling = ‘schema_only’, which allows the feature to be recognized by a dynamic writer as a schema feature. If you wish to continue using the original input features, these are passed via the Output port. For more technical information on the SchemaScanner, check out the documentation.

Why you might want to use the SchemaScanner in your workspace

You want to ensure your dynamic writer is receiving a valid schema
You don’t know the schema of incoming data and want to make sure it meets certain standards before being used in a dynamic writer
You want to modify the schema before it reaches the writer

Key things to remember:

When used in a dynamic workspace, schema features should be output from the SchemaScanner first, using the Output Schema Features Before Data parameter
Most schemas contain attributes you don’t need in your final output dataset, so be sure to make use of the Ignore Attributes Containing parameter

How to Use the SchemaScanner

You can find the following examples in the template workspace under the Files section on the right side of this page.

Understanding the SchemaScanner output

First, we will explore the SchemaScanner output and examine how it works. In this example, we’ll be reading a CSV dataset of PublicArt features in the City of Vancouver, British Columbia.

After adding the dataset to the workspace and connecting it to the SchemaScanner, we need to set up a few parameters to ensure a clean output schema feature.

Output Schema Before Data Features is set to Yes because we are most likely to use this transformer in a dynamic workflow. That means we want our schema feature to be output and reach our writer before the actual data features we want to write out. Without knowing the schema, the dynamic writer will fail to adapt to the new schema before the actual data features reach it.

Ignore Attributes Containing is one of the more powerful parameters because it allows us to clean up our schema attributes before they exit the transformer. Using the Regular Expression ^multi|^fme|^csv we are able to exclude these attributes from our output schema.

The resulting output schema contains attribute names and their data types, with the special attribute fme_schema_handling set to schema_only. fme_schema_handling ensures that the attributes are only used as a schema in a Dynamic workflow, and are not output as features. The following is the output schema as seen through the Visual Preview Feature Information window.

Enabling dynamic SchemaMapper workflows with the SchemaScanner

Attribute mapping, which involves transforming and/or connecting one or more attributes of a feature to a new attribute or set of attributes, can be easily accomplished in FME using the SchemaMapper transformer. However, when the goal is to output these newly mapped features through a dynamic writer, the SchemaMapper requires an updated schema for the writer to use, as the attributes have changed.

This is where the SchemaScanner comes to work, exposing the schema of the newly mapped attributes to create a new schema feature. In the workspace below, we read in a shapefile of bike paths and map the attributes using a SchemaMapper. The SchemaScanner is used after the SchemaMapper to expose the schema of the newly mapped attributes. It’s also important to note that Output Schema Before Data Features should be set to Yes; otherwise, the dynamic writer will receive the features before it receives a schema, resulting in no features being written. We also specify ^fme|^multi|^_schemamapper|^shapefile for the Ignore Attributes Containing parameter.

To allow the dynamic writer to receive the schema generated by the SchemaScanner, we set the parameters as follows:

Schema Sources: “Schema From Schema Feature”
Schema Definition Name: @Value(fme_feature_type_name)

The dynamic output is a Shapefile containing the newly mapped schema.

Dynamically read and write multiple files with different schemas

In this scenario, we dynamically read and write multiple files. In a normal scenario, this would work fine without the SchemaScanner, but in this instance, the files have different schemas (attributes). Instead of adding multiple readers and writers to our workspace, so that incoming and outgoing files have matching schemas, we want to keep things clean and simple with only one dynamic reader and writer. Using the Group By parameter in the SchemaScanner makes this possible by assigning the schema object to the correct features.

Dynamically read CSV files contain an extra attribute called fme_feature_type. This attribute indicates which file each feature belongs to, allowing us to group features with their corresponding schema. You can also see that some attributes have missing values. These are attributes that have been imported from other files and will later be filtered out by the SchemaScanner according to each feature type.

In the SchemaScanner, use the Group By parameter to group the schemas based on fme_feature_type. The result is two schemas, one for each of the files we dynamically read in.

Next, we add a Shapefile writer to our workspace with the Shapefile Definition set to Dynamic. In the writer feature type parameters, we set the parameters as follows:

Shapefile Name: @Value(fme_feature_type)
Schema Sources: “Schema From Schema Feature”
Schema Definition Name: @Value(fme_feature_type_name)

When we run the workspace, the two files are written out separately, just as they entered the workflow.

Detecting Schema Drift

In many workplaces, data sources can change metadata without notice. Their fields, columns, and types can be added, removed, and changed depending on who is managing and uploading the data. The SchemaScanner is able to adapt to these changes and detect when the schema of a new dataset has changed (or drifted in this case) from the original (or base) dataset schema.

For example, we have a dataset in our database with a table containing the following attributes:

PathId, PathName, PathLength, PathType

Every few days, an employee uploads a new set of recordings to supplement this data. The employee may not be familiar with the original dataset and has made some assumptions in naming the attributes. Their dataset contains the following attributes:

PathId, PathName, Distance

In this case, the schema has drifted, with the PathType attribute missing and the PathLength attribute incorrectly named as Distance. Before the user can upload the data to the existing dataset, we need to be able to detect changes in the schema so that we can notify the user to update their input data schema.

We use the SchemaScanner paired with a ListExploder to expose the schema of incoming datasets and compare it against the schema of our base dataset

To compare the two schemas, we use a ChangeDetector. If the schemas match perfectly, then all the features will exit through the Unchanged output port. If any features exist in the Inserted or Deleted output port, then not all of the attributes are valid.

We set up the ChangeDetector to check the name attribute that was exposed by the ListExploder. If any features exist in the inserted or deleted output port, then these are rerouted to the Terminator, which will cause the workspace to stop.

The user will now have to re-format their input data to match the base dataset schema.

Additional Resources

FME 2022 Sneak Peek: SchemaScanner
Dynamic Workspaces Documentation
Strategies for Detecting and Handling Schema Drift Webinar
Tutorial: Dynamic Workflows

Data Attribution

The data used here originates from open data made available by the City of Vancouver, British Columbia. It contains information licensed under the Open Government License - Vancouver.