FME Version
Files
Introduction
With the SchemaScanner, you can easily extract and manipulate the schema of your datasets, tackling dynamic workspace issues such as schema standardization and schema drift. So what does the SchemaScanner give you in return? It’s simple. A list attribute with the attribute name and its data type. Going forward in your workspace, this is all you need to manipulate your schema and create flexible workflows. Quality assurance and schema drift handling just got that much easier.
What is a Schema?
A schema, sometimes referred to as the "data model", can be described as the structure of a dataset, or, more accurately, a formal definition of a dataset’s structure.
Each dataset has its own unique schema which includes feature types, permitted geometries, user-defined attributes and other rules that define or restrict its content. However, most users deal with attribute names and data types when working with schemas.
Using the SchemaScanner
The SchemaScanner processes features and retrieves their schema by scanning for the attribute name and its data type. It will either scan all features or just a specified number of them. There’s also the option to exclude attributes by using a Regular Expression to ensure a clean schema output.
The resulting output is a new schema feature output via the <Schema> output port. This new feature is also given the special attribute and value: fme_schema_handling = ‘schema_only’ which allows the feature to be recognized by a dynamic writer as a schema feature. If you wish to continue using the original input features, these are passed via the Output port. For more technical information on the SchemaScanner, check out the documentation.
Why you might want to use the SchemaScanner in your workspace
- You want to ensure your dynamic writer is receiving a valid schema
- You don’t know the schema of incoming data and want to make sure it meets certain standards before being used in a dynamic writer
- You want to modify the schema before it reaches the writer
- When used in a dynamic workspace, schema features should be output from the SchemaScanner first using the Output Schema Features Before Data parameter
- Most schemas contain attributes you don’t need in your final output dataset, so be sure to make use of the Ignore Attributes Containing parameter
How to Use the SchemaScanner
You can find the following examples in the template workspace under the Files section on the right side of this page.
Understanding the SchemaScanner output
First off, we will explore the SchemaScanner output, and how the SchemaScanner works. In this example, we’ll be reading a CSV dataset of PublicArt features in the City of Vancouver, British Columbia.
After adding the dataset to the workspace and connecting it to the SchemaScanner, we need to set up a few parameters to ensure a clean output schema feature.
Output Schema Before Data Features is set to Yes because we are most likely looking to use this transformer in a dynamic workflow. That means we want our schema feature to be output and reach our writer before the actual data features we want to write out. Without knowing the schema, the dynamic writer will fail to adapt to the new schema before the actual data features reach it.
Ignore Attributes Containing is one of the more powerful parameters because it allows us to clean up our schema attributes before they exit the transformer. Using the Regular Expression ^multi|^fme|^csv we are able to exclude these attributes from our output schema.
The resulting output schema contains attribute names and their data types, with the special attribute fme_schema_handling set to schema_only. fme_schema_handling ensures that the attributes are only used as a schema in a Dynamic workflow, and are not output as features. The following is the output schema as seen through the Visual Preview Feature Information window.
Enabling dynamic SchemaMapper workflows with the SchemaScanner
Attribute mapping, which is the act of transforming and/or connecting one or more attributes of a feature to a new attribute or set of attributes, can easily be done in FME with the SchemaMapper transformer. However, when the goal is to output these newly mapped features through a dynamic writer, the SchemaMapper needs an updated schema for the writer to use since the attributes have changed.
This is where the SchemaScanner comes to work, exposing the schema of the newly mapped attributes to create a new schema feature. In the workspace below, we read in a shapefile of bike paths and map the attributes using a SchemaMapper. The SchemaScanner is used after the SchemaMapper to expose the schema of the newly mapped attributes. It’s also important to note that Output Schema Before Data Features should be set to Yes, otherwise the dynamic writer will receive the features before it receives a schema, resulting in no features being written. We also specify ^fme|^multi|^_schemamapper|^shapefile for the Ignore Attributes Containing parameter.
To allow the dynamic writer to receive the schema generated by the SchemaScanner, we set the parameters as follows:
- Schema Sources: “Schema From Schema Feature”
- Schema Definition Name: @Value(fme_feature_type_name)
The dynamic output is a Shapefile containing the newly mapped schema.
Dynamically read and write multiple files with different schemas
In this scenario, we dynamically read and write multiple files. In a normal scenario, this would work fine without the SchemaScanner, but in this instance, the files have different schemas (attributes). Instead of adding multiple readers and writers to our workspace, so that incoming and outgoing files have matching schemas, we want to keep things clean and simple with only one dynamic reader and writer. Using the Group By parameter in the SchemaScanner makes this possible, by assigning the schema object to the correct features.
Dynamically read CSV files contain an extra attribute called fme_feature_type. This attribute tells us which file each feature belongs to and will allow us to group features with their corresponding schema. You can also see that some attributes are missing values. These are attributes that have been brought in from other files that will later be filtered out with the SchemaScanner according to each feature type.
In the SchemaScanner, use the Group By parameter to group the schema’s based on fme_feature_type. The result is two schemas, one for each of the files we dynamically read in.
Next, we add a Shapefile writer to our workspace with the Shapefile Definition set to Dynamic. In the writer feature type parameters, we set the parameters as follows:
- Shapefile Name: @Value(fme_feature_type)
- Schema Sources: “Schema From Schema Feature”
- Schema Definition Name: @Value(fme_feature_type_name)
When we run the workspace, the two files are written out separately, just as they entered the workflow.
Detecting Schema Drift
In many workplaces, data sources can change metadata without notice. Their fields, columns, and types can be added, removed, and changed depending on who is managing and uploading the data. The SchemaScanner is able to adapt to these changes and detect when the schema of a new dataset has changed (or drifted in this case) from the original (or base) dataset schema.
For example, we have a dataset in our database with a table containing the following attributes:
- PathId, PathName, PathLength, PathType
Every few days, an employee uploads a new set of recordings to add on top of this data. The employee may not be familiar with the original dataset and has made some assumptions in naming the attributes. Their dataset contains the following attributes:
- PathId, PathName, Distance
In this case, the schema has drifted, with the PathType attribute missing and the PathLength attribute having the incorrect name: Distance. Before the user is able to upload the data to the existing dataset, we want to be able to detect changes in the schema so we can notify the user to change their input data schema.
We use the SchemaScanner paired with a ListExploder to expose the schema of incoming datasets and compare it against the schema of our base dataset.
To compare the two schemas we use a ChangeDetector. If the schemas match perfectly, then all the features will exit through the Unchanged output port. If any features exit the Inserted or Deleted output port, then not all of the attributes are valid.
We set up the ChangeDetector to check the name attribute that was exposed by the ListExploder. If any features exit the Inserted or Deleted output port, then these are re-routed to the Terminator which will cause the workspace to stop.
The user will now have to re-format their input data to match the base dataset schema.
Additional Resources
FME 2022 Sneak Peek: SchemaScanner
Dynamic Workspaces Documentation
Strategies for Detecting and Handling Schema Drift Webinar
Tutorial: Dynamic Workflows
Comments
0 comments
Please sign in to leave a comment.