Dynamic Workflows: Dealing with Schema Drift

Files

5-DealingWithSchemaDrift.zip
- 200 KB
- Download

Introduction

In many workplaces, data sources and their schemas can change without notice. Attributes and data types can be added, removed, and changed depending on who is managing and uploading the data.

Imagine a scenario where a new hire doesn’t know the current dataset schemas and tries to update a dataset with newly created data. Now, you either have a dataset with unwanted additional columns that are just a repetition of the old columns, or you’re missing data in your existing dataset because the schemas didn’t match. This can be a huge headache for whoever manages the data in the long term, and nobody likes going back to correct datasets.

That’s why with the SchemaScanner transformer, we can be proactive by detecting and reacting to schema drift in our datasets. These two essential steps make it easy to adapt to changing schemas with FME.

For example, we have a dataset in our database with a table containing the following attributes:

PathID, PathName, PathLength, PathType

Every few days, an employee uploads a new set of recordings to add on top of this data. The employee may not be familiar with the original dataset and has made some assumptions in naming the attributes. Their dataset contains the following attributes:

PathId, PathName, Distance

Schema drift visualized

In this case, the schema has drifted, with the PathType attribute missing and the PathLength attribute incorrectly named Distance.

Before the user can upload the data to the existing dataset, we want to detect schema changes so we can notify the user to update their input data schema.

This is where the SchemaScanner comes into use. It allows us to extract the schema of the incoming dataset and compare it against the known, existing dataset before allowing the data to be uploaded. If the schemas match, then all is good to go, and we can allow the data to flow freely into the base dataset. However, if the data’s schema doesn’t match the base dataset, we need to address what is going on here. In our case, we’re going to create a Schema Change Report that we can then feed back to the user, whether via email or an FME Flow App.

Step-by-step Instructions

Part 1: Detecting Schema Drift

The first step in our example is to detect the schema drift. We will do this using the SchemaScanner and the ChangeDetector transformer to detect changes between the original schema and the uploaded data's schema.

Before beginning, download the SchemaDrift_Datasets.zip file located in the Files section of this article.

1. Open FME Workbench

First, open up a blank workspace in FME Workbench.

2. Add a Shapefile Reader

Add a reader to the workspace and set the folloiwing:

Format: Esri Shapefile
Dataset: <path>\BikePaths_L.shp
- Click on the ellipsis to navigate to the location of the file on your computer

Click OK to add the reader to the canvas.

Shapefile Reader parameters - original dataset

3. Add Another Shapefile Reader

Now that we have the original dataset in our workspace, we want to add the updated data to compare against it. Again, add a reader to the canvas and set the following:

Format: Esri Shapefile
Dataset: <path>\Updated_BikePaths_L.shp
- Click on the ellipsis to navigate to the location of the file on your computer
Workflow Options: Single Merged Feature Type
- The dataset may change at any time, so we want to ensure we can read the data dynamically

Click OK to add the reader to your canvas.

Shapefile Reader parameters - new upload dataset

Improving workspace performance tip: Since we only need to receive 1 feature from the reader to extract the dataset’s schema, we can limit the reader to 1 feature. To do this, go to the Navigator Window, expand the reader’s parameters, click on Parameters > Features to Read >, and set Max Features to Read to 1. You can do this for both the shapefile readers.

Workspace performance tip

4. Add a SchemaScanner Transformer

Next, we need to extract the schema from the source base dataset and prepare it for comparison against the newly uploaded dataset. To do this, we can add a SchemaScanner to our workspace and connect the BikePaths_L shapefile feature type to it.

5. Setting the SchemaScanner Parameters

The SchemaScanner on its own won’t work with our dataset, because it will produce schema features that we don’t want. We want to keep only the original exposed attributes and remove any additional attributes within the schema related to FME or the format itself.

Open up the SchemaScanner parameters. For Ignore Attributes Containing, set the value to: ^multi|^fme|^shape

This will remove any attributes that contain "multi", "fme", or "shape" from the scanned schema.

Press OK to accept the new parameters.

SchemaScanner parameters

6. Add a ListExploder

At this point, if we were to run the workspace, the output schemas from the SchemaScanner would not be visible in the Visual Preview table as they remain as part of the attributes{} list; they would only be visible in the Feature Information window. Add a ListExploder transformer to your workspace and connect the SchemaScanner <Schema> output port to it.

In the parameters, set the List Attribute to attribute{}.

ListExploder parameters

7. Duplicate the SchemaScanner & ListExploder

We also want to extract the schema from the updated dataset for comparison against the original dataset. Hold the CTRL key and select both the SchemaScanner and the ListExploder. Then, duplicate (CTRL+D or right-click > duplicate) the transformers. Connect the Updated_BikePaths_L dataset to SchemaScanner_2. Ensure that the SchemaScanner_2 is connected to the ListExploder_2 transformer. Your workspace should now look like this:

Step 8 workspace screenshot

8. Inspect Output

At this point, we can run the workspace with Data Caching Enabled to see what our output schema looks like. We’re expecting to see a list of attribute names and their data types. Below is an example of the original dataset’s ListExploder output.

fme_feature_type_name denotes which dataset the schema belongs to. "name" is the name of the dataset's attributes. "fme_data_type" is the associated data type for each attribute.

9. Detecting Changes in the Schema

Now that we’ve extracted the schemas from our datasets into a set of attributes, we can use the ChangeDetector transformer to compare the schemas.

Add a ChangeDetector transformer to your workspace. Connect the ListExploder for the original dataset to the Original input port. Connect the ListExploder_2 to the Revised input port using the updated dataset. Open the ChangeDetector parameters. Under Check Attributes, click the ellipsis next to Selected Attributes and select name.

Press OK twice to accept the transformer parameters.

Now, if we run the workspace, we can see that the schema has changed. One new attribute has been inserted, two attributes have been deleted, and two remain Unchanged.

Part 2: Creating a Schema Change Report

So far in our workflow, we’ve exposed the schemas of our datasets and compared them to identify which attributes have changed. In the next part of this exercise, we look at how to create a report on schema changes.

Perhaps we’re moving towards a more manual approach, where the user receives an HTML report listing which attributes have changed and how the schema differs from the original dataset.

1. Add an AttributeCreator

We want to create a message for each attribute added since the original schema. To do this, add an AttributeCreator transformer to your workspace and connect the ChangeDetector Inserted output port to it.

Open the AttributeCreator parameters and set the following:

New Attribute: _message
Attribute Value: Unexpected Attribute: `@Value(name)' encountered in the input dataset.

AttributeCreator parameters

Press OK to accept the parameters and close the transformer.

2. Add a Second AttributeCreator

We also want to create a message for each attribute that has been deleted since the original schema. To do so, we can add another AttributeCreator and connect the ChangeDetector’s Deleted output port to it.

Open the AttributeCreator parameters and set the following:

New Attribute: _message
Attribute Value: Missing Attribute: `@Value(name)' is not present in the input dataset.

AttributeCreator_2 parameters

3. Filtering for a Valid Schema

We need a way to test if there are any features coming from the ChangeDetector. If no features are output from the two AttributeCreators, then that tells us the schema is valid and no error messages were produced.
Add a NoFeaturesTester custom transformer to the workspace (FME will automatically download the transformer from the FME Hub). Connect both the AttributeCreators to it.
NoFeaturesTester workflow

4. Create an HTML Report for Failed Schemas

Now that we’ve bundled our messages into a list that will be presentable in an HTML document, we can go ahead and create the report.
Add an HTMLReportGenerator to your workspace and connect both the AttributeCreators to it.

Open up the HTMLReportGenerator parameters. Under Page Contents, change the existing Chart(Bar) page to Header by clicking on it and selecting Header. Click anywhere on the right side of the dialog to confirm the page change; the parameters on the right side will change to the Header parameters. Set the Header Content Settings as follows:

Text: Schema Check Failed
Header Level: H4
Text Alignment: Left
Color: 0,0,0

Now that we’ve set up a simple header, we can add our list of messages.

Add another page by clicking the small plus (+) symbol in the bottom left, then set it to Custom HTML. Under Content Settings, set the Custom HTML as follows:

The upload was halted because the source schema has been changed! Please check the daily input and log for more details. <br>
Below you can find more details on missing or unexpected attributes.<br> <br>

Next, add another page. Set it as a Table and change the Table Style to Striped. Under Column Settings, set Column Contents to the _message attribute, and set the Column Name to:

Missing or Unexpected Attributes

HTMLReportGenerator Table parameters

Below this, add another page and set it to a Separator. Set the thickness to 2.
Add one last page and set it to Custom HTML: Set the Custom HTML as follows:

Thanks for your cooperation.

HTMLReportGenerator parameters

Press OK to accept the parameters and close the transformer.

5. Cleaning up the HTML Layout

Before sending out the HTML to the HTML writer, we want to clean up and organize the HTML layout. Add an HTMLLayouter to the canvas and connect the HTMLReportGenerator to it. We can use the default parameters. The HTMLLayouter is commonly paired with the HTMLReportGenerator to create a Bootstrap grid or vertical layouts. In this case, the default parameters will produce a vertical layout.

6. Create an HTML Report for Valid Schemas

We also want our users to be notified if their schema is valid, not just if their schema has failed. The message needs to go both ways to make the user understand what actions they need to take next. Add another HTMLReportGenerator to the canvas and connect the NoFeaturesTester NOINPUT output port to it.

Open up the HTMLReportGenerator_2 parameters. Under Page Contents, change the existing Chart(Bar) page to Header by clicking on it and selecting Header. Click anywhere on the right side of the dialog to confirm the page change; the parameters on the right side will change to the Header parameters. Set the Header Content Settings as follows:

Text: Schema Check Passed
Header Level: H4
Text Alignment: Left
Color: 0,0,0

Add another page by clicking the small plus (+) symbol in the bottom left, then set it to Custom HTML. Under Content Settings, set the Custom HTML as follows:

Hello,

Your data has passed the schema validation test and is ready to be uploaded.

Thanks for your cooperation.

HTMLReportGenerator_2 parameters

Press OK to close the HTMLReportGenerator_2 and accept the new parameters. Add another HTMLLayouter to the canvas and connect the HTMLReportGenerator_2 to it. Again, we can use the default parameters to create a vertical bootstrap layout.

Step 6 Workspace

7. Output the HTML Report Back to the User

Whether you plan to use this workspace in a workspace app or just create a simple HTML document, we need to add a writer to output our final schema report!

Add a writer to your workspace and set the Format to HTML. Browse to an output file location and then name the file schema_failed.html. Press OK to add the writer to your workspace. Now, let's connect the HTMLLayouter to the HTML writer feature type.

HTML Writer parameters

We can repeat the process above to create an HTML writer for the valid schema report. Add another HTML writer to your workspace. Set the Dataset parameter location to the same output location as above, then name the file schema_passed.html.

At this point, everything in our workspace is complete, and we can output our data to produce the report. Connect the schema_passed writer feature type to the HTMLLayouter_2.

8. Run the Workspace and Inspect the Results

Now that we’ve set up our workspace to both detect changes in the schema and produce a report on the condition of the schema, we can try running the workspace to see our results! Try running the workspace, click Open Containing Folder on the HTML Writer that receives features from its HTMLLayouter.
Step 8 Workspace

Open up the resulting HTML file in your web browser by double-clicking on it. You should see a Schema Check Failed report.
Schema Change Report example

Conclusion

Throughout this exercise, we learned how to use the SchemaScanner alongside the ChangeDetector to perform change detection on two dataset schemas. The result? We’re now able to generate a report on whether the schemas match. With this new information, users will feel more confident when updating datasets and inserting new records. This also means time saved by not having to go back and fix the dataset after a user has uploaded features with a mismatched schema!

How else can we use this workspace? Instead of writing our data to an HTML file, we could email a user the results of the Schema Change Report as the HTML content generated by the HTMLReportGenerator and the HTMLLayouter. Alternatively, if you’re familiar with FME Flow Apps, you could create an app that lets users upload a file to the server app themselves and instantly receive a copy of the Schema Change Report in their browser upon submission.