Data QA: Invalid Spatial Schemas

Files

invalidschema1.fmwt
- 20 KB
- Download
invalidschema3.fmwt
- 30 KB
- Download
invalidschema2.fmwt
- 30 KB
- Download
invalidschemadataset.zip
- 10 KB
- Download
InvalidSchema_2021.2.fmwt
- 40 KB
- Download

Introduction

A dataset schema (data model) consists of multiple parts. Some parts relate to attributes, other parts relate to the spatial data.

The spatial part of a schema usually defines the feature types (layers, tables, etc) that exist or are permitted to exist in a dataset, and the geometry types (lines, points, polygons, etc) that exist or are permitted to exist in a dataset.

An invalid schema occurs where a feature exists outside of the permitted feature types (for example, a layer of data has a different name to the dataset specifications) and as a type of geometry other than that is permitted (for example a line feature exists on a layer for polygon features).

This can be important for internal (corporate) consistency and integrity, but also when using formats that are strictly defined by the table names and geometry types permitted.

FME can deal with format limitations automatically, but the user must define whether data meets a corporate data standard or not. Because there are various tests that can be made, there are various transformers in FME that can be used to test them. The following example and notes cover just a few of these.

Source Data

The source dataset for this example provides information on construction activity and projects that may affect the flow of traffic in the city of Vancouver. It is stored in GML format:

In theory, all of these features should consist of simple polygon geometries. The layer each item exists on should represent the organization undertaking the construction. Permitted values are:

shaw
hydro
telus
city
fortis
private
other

However, we can't be sure that the correct layers have been used, or the correct geometry, and we shall have to test that.

Step-by-Step Instructions

Part 1: Locating Invalid Feature Types

Follow these steps to learn how to identify source feature types (layers) that exist in a source dataset.

1. Start FME Workbench and begin with an empty canvas. Select Readers > Add Reader from the menubar.

Set the data format to OGC GML (Geography Markup Language) and select the attached GML dataset as the source. Set the Workflow Options to Single Merged Feature Type (to make sure all objects are read as a single layer) and click OK to add the reader:

Save the workspace.

2. Add a DuplicateFilter transformer
Connect it to the reader feature type. Inspect the parameters and set the Key Attribute to fme_feature_type:

fme_feature_type records the layer of the source data, so by filtering out a single example of each we have effectively created a list of feature types (layers) in the source dataset.

3. Select Writers > Add Writer from the menubar
Set the data format to Text File and define a location to write the text file to. Connect the DuplicateFilter:Unique port to the textfile writer's feature type. Map the attribute fme_feature_type to the writer's text_line_data (either by drawing a connection or using an AttributeManager transformer):

4. Run the translation
Open the output text file. We now have a list of all layers that are used in the source dataset, both valid and invalid:

For instance, "private" is valid, xyz is invalid, and "tellus" is obviously a typo that should be "telus".

Part 2: Counting and Fixing Invalid Feature Types

We now have a list of feature types, and can see that some are invalid. But to count or filter these we need to know which layers are permitted, and preferably to have these stored in a file somewhere.

There are a number of transformers that could be used to match a feature type to this list - for example, the AttributeFilter - but here we'll use the DatabaseJoiner.

5. Add a DatabaseJoiner transformer into the workspace
Connect it to a second output from the source dataset:

Inspect the DatabaseJoiner parameters. Set them up as follows:

Reader Format: CSV (Comma Separated Value)
Dataset: PermittedLayerList.txt
Reader Parameters:
- Field Names Line: <none> (ie delete it)
- Data Starts Line: 1
Table: CSV
Join On:
- Feature Attribute: fme_feature_type
- Table Field: col0

With this setup, any feature that emerges from the Unjoined port has an invalid feature type.

6. Add a StatisticsCalculator
Connect it to the DatabaseJoiner:Unjoined output port. Inspect the parameters and set fme_feature_type as the Attribute to Analyze. Click under the Total Count column to add as as a Statistic to Calculate.

7. Run the workspace and inspect the StatisticsCalculator:Complete output port.
The output will show all the features that have an incorrect layer, with the layer recorded on the attribute fme_feature_type, and the number of invalid features recorded in the attribute fme_feature_type.total_count:

So now we have a count of the features with invalid layers. We can't fix the layer names - because we don't know what they should be - but we have cleaned the dataset by filtering out these invalid features.

Part 3: Locating and Fixing Invalid Geometry Types

Follow these steps to learn how to identify source features that have an incorrect geometry type.

8. Select Writers > Add Writer from the menubar. Set the data format to Esri Shapefile
Define a location to write the dataset to. For the Shapefile Definition parameter, select Copy from Reader:

Connect the newly created writer feature type to the DatabaseJoiner:Joined port.

9. Inspect the parameters for the new writer feature type

For the Shapefile Name click the drop-down arrow and select Attribute Value > fme_feature_type. This will ensure the data is written to the same layer it came from. Now set Geometry to shape_polygon:

Run the workspace. Check the translation log. Notice that there are 172 warning messages!

Some features are rejected from the output because they are not an area geometry:

Shapefile Writer: Feature type 'shaw' received an incompatible geometry type 'polyline', expected 'polygon'. The following feature cannot be written to this file and will be discarded

So FME has fixed some geometry types where it can and rejected others. We also have a count of how many features were rejected.

Additionally (as long as you saved the workspace in step 1), all rejected features are stored in an FFS (FME Feature Store) format dataset as a form of a spatial log:

Data Attribution

The data used here originates from open data made available by the City of Vancouver, British Columbia. It contains information licensed under the Open Government License - Vancouver.

Search