Data QA: Identifying Duplicate Features with FME

Files

duplicatefeaturechecker1.fmwt
- 40 KB
- Download
parkspossibleduplicates.zip
- 30 KB
- Download
duplicatefeaturechecker2.fmwt
- 40 KB
- Download
Identify_Duplicate_Features_2021.2.fmw
- 70 KB
- Download

Introduction

A duplicate feature (in spatial terms) is one whose geometry is an exact match for that of another feature in the workflow. This might occur when the same feature has been accidentally submitted twice to a database, or when two (or more) overlapping datasets are merged together.

Many FME transformers can identify duplicate features, but some transformers - or combinations of transformers - will be much more efficient than others.

Matcher: The Matcher detects features that are matches of each other. Features are declared to match when they have matching geometry, matching attribute values, or both. This transformer can perform sluggishly when dealing with very large geometries.

CRCCalculator: This transformer calculates a CRC (Cyclic Redundancy Check) value for a feature and places the calculated CRC value into the attribute specified. Most often used as a check for corrupted features, a CRC value can also be used to check if two features are identical (using a transformer like the Matcher). This process can be more efficient than comparing geometries directly.

ChangeDetector: This transformer can match geometry and attributes in the same way as the Matcher. However, one difference is that it has two input ports; therefore you would use it to check for duplicate (or not duplicate) features in two different datasets (rather than within the same one).

In general, the CRCCalculator is more efficient (because the comparison is only between two different numeric strings, not full geometry) especially when CRC values are stored with the data and so don't need recreating each time.

However, the Matcher is better for more complex geometries (such as those with textures), for comparing coordinate systems, and for matching null or missing attributes in different ways.

In these two examples, we will look at identifying duplicate features firstly with a Matcher transformer alone and then using the CRCCalculator in conjunction with the Matcher.

Source Data

The source data is a MapInfo TAB file containing parks within the City of Vancouver:

Let's assume there are duplicate park features (with the same, rather than a different ID number) and we need to find, count, and remove the duplicates.

Step-by-Step Instructions

Method 1: Locating, Counting, Fixing Duplicate Features with the Matcher

Follow these steps to learn how to locate duplicate features with a Matcher transformer.

1. Add Source Data
Start FME Workbench and begin with an empty canvas. Select Readers > Add Reader from the menubar. Set the data format to MapInfo TAB (MITAB). Select the attached MapInfo dataset as the source and click OK to add the reader.

2. Add a Matcher
Add a Matcher transformer to the canvas and connect it to reader feature type. In the Matcher transformer parameters, set the following:

Match Geometry: 2D
Match ID Output Attribute: _match_id

Optionally also set:

Attribute Matching Strategy: Match Selected Attributes
Selected Attributes: ParkId

3. Run Workspace
Run the workspace and inspect each of the Matcher Output ports with Feature Caching Enabled.

Features without a match will exit from the NotMatched port.

Features that exit the SingleMatched port are a single instance of duplicate records. Features that exit the Matched port are all instances of the duplicate records.

If the Attribute Matching parameters are set, then the duplication will be both ID and geometry, otherwise it will only be matching geometry.

4. Add a StatisticsCalculator
Add a StatisticsCalculator transformer to the Matched output port of the Matcher. Set the parameters to:

Attributes to Analyze: _match_id
Check the Total Count parameter

To simply get a number of duplicate features, connect the Summary output port to the Inspector. To keep all duplicate features for inspection, connect the Complete output port.

5. Run the Workspace
Run the workspace, then inspect the different outputs, being sure to look for the _match_id_.total_count attribute. This attribute denotes how many duplicate features exist in the dataset.

6. Discard the Duplicates
The usual fix for duplicate features is to simply discard the duplicates, keeping one copy of them. With the Matcher transformer, this means keeping the NotMatched and SingleMatched outputs.

So, optionally add a writer to the workspace in the format of your choice. Connect the NotMatched and SingleMatched outputs to a writer feature type, while leaving the Matched port unconnected, or connected only to an Inspector or Logger transformer:

Method 2: Locating, Counting, Fixing Duplicate Features with the CRCCalculator

Follow these steps to learn how to locate duplicate features with a CRCCalculator transformer and a Matcher transformer.

2. Add a CRCCalculator
Add a CRCCalculator transformer and connect it to the reader feature type. In the CRCCalculator transformer parameters, set:

CRC Algorithm: CRC-32
Calculate CRC on: Coordinates and Attributes
Included Attributes: <none>
Match ID Output Attribute: _crc

Optionally add an Inspector transformer and run the workspace. Inspect the calculated crc value for each feature.

3. Add a Matcher
Add a Matcher transformer and connect it to CRCCalculator output port:

In the Matcher transformer parameters, set:

Match Geometry: None
Attribute Matching Strategy: Match Selected Attributes
Selected Attributes: _crc
Match ID Output Attribute: _match_id

Run the workspace and inspect the output ports of the Matcher.

As before, features without a match will exit from the NotMatched port.

Features that exit the SingleMatched port are a single instance of duplicate records. Features that exit the Matched port are all instances of the duplicate records. The number of features that exit the Matched port equals the number of duplicate records. In this case, six records exit the Matched port.

4. Discarding Duplicate Features
The usual fix for duplicate features is to simply discard the duplicates, keeping one copy of them. With the Matcher transformer, this means keeping the NotMatched and SingleMatched outputs.

So, optionally add a writer to the workspace in the format of your choice. Connect the NotMatched and SingleMatched outputs to a writer feature type, while leaving the Matched port unconnected.

Data Attribution

The data used here originates from open data made available by the City of Vancouver, British Columbia. It contains information licensed under the Open Government License - Vancouver.

Search