Dynamic Workflows with a Multi-Dataset SchemaScanner

Files

4-WorkflowWithAMultiDatasetSchemaScanner.zip
- 60 KB
- Download

Introduction

At a basic level, the SchemaScanner is fairly simple to operate. However, there are more complex scenarios that involve how the different parameters are used and the exact values of the incoming data.

The Empty attributes parameter controls the schema when incoming data has missing values:

However, this only applies to data where the entire dataset is missing a particular attribute value. In this dataset, for example, every MarketType value is null. MergedAddress also has null values, but not for every record:

Given the above parameters, the output schema will not contain the MarketType field, because there are no values at all for it:

However, MergedAddress is included because some records still have values.

Now, say for example, that we wanted to include MarketType in the output, even though there are no values. There are two alternatives to the Ignore option:

Because there are no data values, the SchemaScanner cannot infer the data type. Instead, it can either use a default data type (a varchar of unknown length) or trace backward through the workspace, looking for clues to the data type.

In this case, the reader themselves has that information:

…so the SchemaScanner will use that and create an output schema where MarketType = varchar(15).

Step-by-step Instructions

The goal here is to add more data to the output from the previous exercise by merging multiple source datasets.

1. Open Workspace

Open the workspace from the previous exercise. Now, let's add the Vancouver Farmers Market CSV dataset, keeping the existing Cedar Cottage data. You can do this in one of two ways:

1a. Add a New CSV Reader.

Either use Reader > Add Reader from the menubar or drag/drop the CSV file into the workspace canvas. Connect the new feature type to the AttributeManager input port.

1b. Edit the Existing CSV Reader

In the Navigator window, double-click the Source CSV parameter for the existing CSV Reader. When it opens, click the drop-down arrow and choose Select Multiple Folders/Files:

In the dialog that opens, click Add Files, then choose the Vancouver Farmers Market CSV file. Because the existing reader is dynamic, you won’t even have to add a new feature type.

2. Run Workspace

Run the workspace and inspect the output. You’ll notice that the Vancouver Farmers Market data has no values for MarketType, so their values are Null. Also, notice that the data has an extra field called Offerings, which Cedar Cottage is missing:

These fields were included in the output CSV, even though not all data possessed them. That’s because the SchemaScanner creates an inclusive schema. i.e., if only one feature possesses an attribute, then it will be included in the output.

3. Create Different Outputs for Each Day

Now let’s create a different file for each day of the week; i.e,. We wish to divide food markets into different outputs depending on which day they take place.

Open the parameters dialog for the writer feature type.

Change the CSV File Name parameter from fme_feature_type to the Day attribute by clicking on the drop-down arrow.

Re-run the workspace and inspect the outputs in a Text Editor.

In the Saturday and Sunday files, some fields have null values, but the fields still exist because not every record is null:

In the Wednesday and Thursday files, some fields are entirely null. However, the fields still exist because there is only one output schema, and every output dataset uses it.

Let’s see how we could give each output a different schema.

4. Update SchemaScanner to Group Features

View the SchemaScanner parameters. Enable Group Processing and under Group By, choose the Day attribute.

Re-run the workspace. Notice that there are now four schema features, one per day of the week:

Notice that fme_feature_type_name - the attribute that sets the name of the schema - is what defines the difference between each feature.

Inspect the output. If a dataset was missing an entire attribute, then that attribute is now absent in the output. That’s because each dataset has its own schema, and the SchemaScanner is set to Ignore Empty Attributes.

5. Output Individual Schemas

Finally, let’s say we wanted to give each output its own schema, but we didn’t want empty attributes to be ignored. How would we do that?

View the SchemaScanner parameters again. Change the Empty Attributes parameter from “Ignore” to “Interpret Upstream Schema (Advanced)”. Now FME will include those attributes, obtaining the information to create the schema from wherever it can find that “upstream” in the workspace.

Once again, run the workspace and inspect the Schema features, in particular the one for Thursday markets.

Previously, the Thursday market lacked a MarketType attribute in its schema. Now it does, even though all of the values are empty. The type is fme_varchar(22) because FME could determine that from the reader schema.

Data Attribution

The data used here originates from open data made available by the City of Vancouver, British Columbia. It contains information licensed under the Open Government License - Vancouver.