Data QA: Identifying Duplicate Attribute Values

Liz Sanderson
Liz Sanderson
  • Updated

FME Version

  • FME 2021.2

Introduction

Duplicate attribute values are those where multiple features share the same value for the same attribute. In many cases this is natural and expected; for example, the province (county/state) field for an address database is very likely to be the same for many records.

However, in some cases duplicate values are invalid and unexpected; for example, a field that should be a unique ID instead contains a value that is identical to that of another feature (1,2,3,4,4,5,6).

Duplicate ID values, in particular, can cause integrity errors when loaded into a database with a unique indexing key.

Sometimes duplicate values are valid as long as they have a specific relationship with another field; for example:

Column 1 Column 2
1 a
1 b
1 c
2 a
2 b
2 c
2 c

Or:...the first column has duplicates, but a combination of column 1 and 2 is unique (so with 2a, 2b, 2c, 2c, the final 2c is invalid because it is a match for a previous combination)

 

Column 1 Column 2
1 a
1 a
1 a
2 b
2 b
2 b
2 c

...where column 1 has duplicates but it is valid because it matches column 2 (so with 2b, 2b, 2b, 2c, the 2c value is invalid because it is not a match for a previous combination).

Duplicate attribute values can be identified with either the DuplicateFilter transformer or the Matcher transformer. The Matcher transformer is good for finding relationships where one column has duplicates and it is invalid because another column differs (the 2b, 2b, 2b, 2c example)

The Matcher transformer is also better at extracting all duplicate records, as the DuplicateFilter lets the first instance of a duplicate record pass. But the same behaviour makes the DuplicateFilter more memory efficient, so that is preferred if possible.

It's fairly simple to identify and count duplicate values, but making fixes may require manual intervention, depending on what sort of contents the field is meant to contain.
 

Source Data

The source dataset for this example is an Excel spreadsheet containing business license information for the city of Vancouver.

The dataset looks like this in the FME Data Inspector (Table View):

 

duplicateattrs1.png

 

There are multiple columns, but in particular, we are interested in the three key ID numbers:

  • RecordID is a unique ID number throughout the entire dataset; i.e. each record should have a different value.
  • LicenseNumber is an ID number that can be reused and may therefore not be unique. Also, there may be multiple records for the same LicenseNumber; for example, a single business that operates at multiple addresses will have a record in the dataset for each address (but with the same LicenseNumber).
  • LicenseRSN is a unique ID number per license; i.e. each license issued should have a unique LicenseRSN. Of course, there may be multiple records with this value as well, because it should match each record of a multiple LicenseNumber feature.

The scenario here is to check for duplicate values. We can say that RecordID must be fully unique. Also, while each LicenseRSN might have multiple records, it should only ever match to a single LicenseNumber.
 

Step-by-Step Instructions

Part 1: Locating Duplicate Attribute Values

Follow these steps to learn how to identify duplicate attribute values.
 

1. Start FME Workbench and begin with an empty canvas.

Select Readers > Add Reader from the menubar.

Set the data format to Microsoft Excel. Select the attached file as the source dataset. Click the Parameters button to review the parameters and inspect the source data preview. The default parameter values should be fine to use so click OK to close this dialog and then click OK again to add the reader.
 

2. Checking for duplicate RecordID values.
We'll do this with a DuplicateFilter transformer. So place a DuplicateFilter transformer on the canvas, connected to the reader feature type (table).

In the parameters dialog select RecordID as the Key Attributes parameter:

duplicateattrs2.png

 

3. Inspect the output from the DuplicateFilter
Run the workspace and inspect the output from the DuplicateFilter by clicking on the green magnifying glass, ensuring that Feature Caching is enabled. One duplicate RecordID is found and output to the Duplicate output port. The RecordID number is 4854.

Only the duplicate record is displayed; i.e. the record it is a duplicate of is not also filtered. In many cases, this is enough to satisfy our needs. However, if we want to be able to look at both records we'll have to open the full dataset:

duplicateattrs3.png

When the "Advanced Fireworks Group" business license was cancelled, it looks like the "Killarney Cafe Dept." (the next license to be issued) was erroneously given their record ID number.


4. Checking for duplicate LicenseRSN values 
Duplicate values are not by themselves a problem unless the duplicates have a difference value for LicenseNumber (i.e. LicenseRSN:LicenseNumber should be a 1:1 match). For this, we'll use a Matcher transformer.

Place a Matcher transformer connected to the Unique output port on the DuplicateFilter:

duplicateattrs4.png

Check the Matcher parameters and set the following:

  • Uncheck "Check Geometry"
  • Attribute Matching Strategy: Match Selected Attributes
  • Selected Attributes: LicenseRSN
  • Attributes That Must Differ: LicenseNumber

duplicateattrs5.png
 

5. Inspect the output form the Matcher
Run the workspace and then inspect the Matched output port from the Matcher. There will be four records identified as matching:

duplicateattrs6.png

 

For the "Electronic Enterprises Corp." it looks like one of their LicenseNumbers contains a typo (120097 instead of 120079). It's harder to determine the cause of the other duplicate.

Notice that this transformer has given us both instances of the duplicate record, not just one of them; however there is a performance cost associated with this benefit, so it is not always the best solution.
 

6. Testing for issued licenses
A final check we can make is for a duplicate LicenseNumber with a different LicenseRSN. Because (as stated above) the LicenseNumber value can be reused there may be quite a few of these. However, this duplication only causes a problem when both licenses are active at the same time.

So, firstly add a Tester transformer connected to the Matcher:NotMatched output port:

duplicateattrs7.png

Set up the parameters to test for licenses whose status is "Issued".
 

7. Add a second Matcher transformer to the workspace
Conncet it to the Tester:Passed port. Set it up as before, except this time LicenseNumber is the selected attribute, and LicenseRSN the attribute that must differ:

duplicateattrs8.png

8. Run the workspace and inspect the Matcher's Matched output port 
There will be 10 features that exit through the Matched port:

duplicateattrs9.png

i.e. these features have a duplicate LicenseNumber that is not matched to the same LicenseRSN number.
 

Part 2: Counting Duplicate Attribute Values

Counting duplicate attribute values is simple enough, it's just a case of deciding whether each instance of a duplicate value counts as one issue, or whether it is one issue for all matching values (for example, if two features both have a LicenseNumber of 17-107142, does that count as one issue, or two?)

Additionally, it's necessary to decide whether to count all of the different types of duplicates separately or as one.
 

9. Count duplicate features
To count duplicate features simply requires a StatisticsCalculator. So, add a StatisticsCalculator and connect it to the Matcher:Matched port and set it to analyze any attribute.

Check off the Total Count attribute and press OK to close the transformer.

duplicateattrs10.png
 

10. Counting matches as only as 1 issue
The duplicate RecordID features were filtered by the DuplicateFilter with the result that only 1 feature was flagged as a duplicate (the record it was a duplicate of is not counted).

The Matcher results will have 2 features for each match. If you wish to count that only as 1 issue, the duplicate duplicates(!) need to be filtered out. This is simply done by switching the connection from Matcher:Matched to Matcher:SingleMatched:

duplicateattrs11.png

 


 

Data Attribution

The data used here originates from open data made available by the City of Vancouver, British Columbia. It contains information licensed under the Open Government License - Vancouver.

Was this article helpful?

Comments

0 comments

Please sign in to leave a comment.