Dynamic Workflows with a Multi-Dataset SchemaScanner

Liz Sanderson
Liz Sanderson
  • Updated

FME Version

  • FME 2022.2

Introduction

At a basic level, the SchemaScanner is fairly simple to operate. However, there are more complex scenarios that revolve around how the different parameters are used and the exact values of incoming data.

The Empty attributes parameter controls the schema when incoming data has missing values:

SchemaScannerB1.png

However, this only applies to data where the entire dataset is missing a particular attribute value. In this dataset, for example, every value for MarketType is null. MergedAddress also has null values, but not for every record:

SchemaScannerB2.png

Given the above parameters, the output schema will not contain the MarketType field, because there are no values at all for it:

SchemaScannerB3.png

However, MergedAddress is included, because some records still do have values.

Now, say for example, that we wanted to include MarketType in the output, even though there are no values. There are two alternatives to the Ignore option:

SchemaScannerB4.png

Because there are no data values, the SchemaScanner cannot scan the data to guess the data type. Instead, it can either use a default data type (a varchar of unknown length) or it can trace backwards through the workspace trying to find any clues to the data type.

In this case, the reader itself has that information:

SchemaScannerB5.png

…so the SchemaScanner will use that and create an output schema where MarketType = varchar(15).

 

Step-by-step Instructions

The goal here is to add more data to the output from the previous exercise by merging different sets of source data.

1. Open Workspace
Open the workspace from the previous exercise. Now let's add in the Vancouver Farmers Market CSV dataset, keeping the existing Cedar Cottage data. You can do this in one of two ways:

1a. Add a New CSV Reader.
Either use Reader > Add Reader from the menubar or drag/drop the CSV file into the workspace canvas. Connect the new feature type to the AttributeManager input port.

1b. Edit the Existing CSV Reader
In the Navigator window, double-click the Source CSV parameter for the existing CSV Reader. When it opens, click the drop-down arrow and choose Select Multiple Folders/Files:

SchemaScannerB11.png

In the dialog that opens, click Add Files and choose the file Vancouver Farmers Market CSV. Because the existing reader is dynamic, you won’t even have to add a new feature type.

2. Run Workspace
Run the workspace and inspect the output. You’ll notice that the Vancouver Farmers Market data has no values for MarketType, so their values are Null. Also notice that the data has an extra field called Offerings, which Cedar Cottage is missing:

SchemaScannerB6.png

These fields were included in the output CSV, even though not all data possessed them. That’s because the SchemaScanner creates an inclusive schema. i.e. if only one feature possesses an attribute, then it will be included in the output.

3. Create Different Outputs for Each Day
Now let’s create a different file for each day of the week; i.e. we wish to divide food markets into different outputs depending on which day they take place.

Open the parameters dialog for the writer feature type.

Change the CSV File Name parameter from fme_feature_type to the Day attribute by clicking on the drop-down arrow. 

SchemaScannerB7.png


Re-run the workspace and inspect the outputs in a Text Editor. 

In the Saturday and Sunday files, some fields have null values, but the fields still exist because not every record is null:

SchemaScannerB12.png

In the Wednesday and Thursday files, some fields are entirely null. However, the fields still exist because there is only one output schema and every output dataset is using it.

SchemaScannerB13.png

Let’s see how we could give each output a different schema.

4. Update SchemaScanner to Group Features
View the SchemaScanner parameters. Enable Group Processing and under Group By, choose the Day attribute.

SchemaScannerB8.png

Re-run the workspace. Notice that there are now four schema features, one per day of the week:

SchemaScannerB9.png

Notice that fme_feature_type_name - the attribute that sets the name of the schema - is what defines the difference between each feature.

Inspect the output. If a dataset was missing an entire attribute, then that attribute is now absent in the output. That’s because each dataset has its own schema, and the SchemaScanner is set to Ignore Empty Attributes.

5. Output Individual Schemas
Finally, let’s say we wanted to give each output its own schema, but we didn’t want empty attributes to be ignored. How would we do that?

View the SchemaScanner parameters again. Change the Empty Attributes parameter from “Ignore” to “Interpret Upstream Schema (Advanced)”. Now FME will include those attributes, obtaining the information to create the schema from wherever it can find that “upstream” in the workspace.

Once again, run the workspace and inspect the Schema features, in particular the one for Thursday markets.

SchemaScannerB10.png

Previously the Thursday market did not have a MarketType attribute in its schema. Now it does, even though all of the values are empty. The type is fme_varchar(22) because FME could determine that from the reader schema.


Data Attribution

The data used here originates from open data made available by the City of Vancouver, British Columbia. It contains information licensed under the Open Government License - Vancouver.
 

Was this article helpful?

Comments

0 comments

Please sign in to leave a comment.