Data QA: Identifying Non-Consecutive Duplicate Vertices with FME

Files

duplicatenonsequentialpoints.fmwt
- 50 KB
- Download
buildings.dgn
- 50 KB
- Download
non_consecutive_duplicate_vertices_2021.2.fmw
- 80 KB
- Download

Introduction

A duplicate vertex (duplicate point) occurs when the geometry has one or more vertex that occurs multiple times within the feature. Duplicate vertices are those with identical X, Y, and Z coordinate values, to as many decimal places as exist in the data.

Duplicate vertices are not only a sign of lower quality data, they can also be a data format problem. Some formats permit duplicate vertices (for example, MicroStation DGN allows zero-length lines) while other formats prohibit duplicate vertices (for example Oracle Spatial).

The duplicate vertex might occur sequentially in the geometry (for example, A,B,C,C,D,E) or it might occur out of sequence (A,B,C,D,C,E). It might just be duplicated once (A,B,C,C,D), or it might be duplicated multiple times (A,B,C,C,C,D,C,E,C).

Of course, sometimes a duplicate vertex is valid; for example, a polygon start and endpoint should be identical if it is to close properly (A,B,C,D,E,A) and sometimes a linear feature should loop around and rejoin mid-point (A,B,C,D,E,C); so it is not always easy to identify invalid features on this basis alone.

There are various FME transformers that can be used to identify duplicate vertices, but some transformers - or combinations of transformers - will be much more efficient than others.

GeometryValidator: This transformer identifies and fixes duplicate vertices that occur consecutively within a single geometry.
ClosedCurveFilter: This transformer identifies features that form a closed loop, and can, therefore, be used to detect (or eliminate from suspicion) features with duplicate endpoints.
CoordinateExtractor: This transformer extracts a list of coordinates from a feature, which can then be analyzed to look for duplicates.

In general, the GeometryValidator is used more often because consecutive duplicate vertices are a more obvious issue.

However, the CoordinateExtractor is better for detecting duplicate vertices that occur out of sequence, so that further investigation can take place.

This example uses a combination of ClosedCurveFilter and CoordinateExtractor to identify duplicate points that are unsequenced. A second example uses the GeometryValidator transformer to identify sequential duplicate points.

Source Data

The source data is a MicroStation Design file containing line features that represent building outlines:

The scenario is that we wish to validate and clean the data before it is put into production use.

Step-by-Step Instructions

Part 1: Locating Non-Consecutive Duplicate Vertices

Locating non-consecutive(sequential) duplicate vertices is not as straightforward as consecutive duplicates; however, it can be done. Follow these steps to discover one method to locate non-consecutive duplicate vertices.

1. Add Source Data
Start FME Workbench and begin with an empty canvas. Select Reader > Add Reader from the menubar. Set the data format to Bentley MicroStation Design (V8). Select the attached MicroStation dataset as the source. If you click the parameters button you'll find there is an advanced parameter to remove duplicate points:

Ensure this parameter is turned off as we want to identify where and how many duplicate vertices there are. So simply click OK to add the reader. If/when prompted, select the BuildingFootprints level as the data to be read.

2. Inspect the Data
Click the reader feature type on the canvas. On the menu that pops up, select the View Source Data option to view the data in the Visual Preview window. Examine the data. The data looks correct at a glance, and it is difficult to identify where there might be duplicate vertices.

3. Add a ClosedCurveFilter transformer
Add a ClosedCurveFilter transformer and connect the Microstation V8 Reader to it. Run the translation and inspect the output from the transformer by clicking the green magnifying glass symbol. Depending on random color generation, you might have to color the ClosedCurveFilter_Open feature to differentiate the results. It will identify an open feature like this:

This is a feature with a duplicate vertex, but it doesn't close like a polygon would. It may, or may not, be considered a problem feature, but since this is meant to be a building we can probably assume it's incorrect.

Part 2: Counting Non-Consecutive Duplicate Vertices

4. Find Non-Consecutive Duplicate Points
To find non-consecutive duplicate points we'll extract a list of coordinates and check for duplicates. Of course, it's important to not confuse points on different features, and to not include the start/end point of polygons.

Here the data does not have a unique ID for each feature, so we should create one by adding a Counter transformer. That way identical points on different features will not be confused. The default parameters - which will create an attribute called _count - are fine for our purposes.

5. Add a CoordinateExtractor Transformer
Add a CoordinateExtractor to the workspace and connect the Counter output to it. The parameters should be set to extract All Coordinates to a list called _indices:

If you wish, run the workspace and inspect the output features. Query a feature and you'll find that it now has a list containing its vertices.

6. Expose the Coordinates
We want to analyze the coordinates list, but we can't do it as a list object. There is no specific list transformer that will find duplicate values among multiple values (the ListDuplicateRemover will find duplicate X values, or duplicate Y values, but not a combination of duplicate X and Y). So, we'll explode the list into one feature per list element using the ListExploder transformer:

If you wish, run the workspace and inspect the output from the ListExploder. You'll see there is now one feature per vertex. Each vertex has its position in the list recorded as _element_index:

The above shows that building 55 has 5 vertices, numbered 0 to 4. The first and last vertices match, meaning it's a closed line (which is fine).

Part 3: Fixing Non-Consecutive Duplicate Vertices

7. Removing Non-Duplicate Vertices
Now we can start removing vertices that are not (or do not count as) duplicates.

Add a Tester transformer to the workspace and connect the ListExploder to it. Set up the parameters to test for _element_index = 0 (i.e. this is the first coordinate of the line).

These are the features we want to drop - because otherwise the first and last point of a closed line would match and be flagged as an error - so the Failed port are the features we want to keep.

8. Add a DuplicateFilter Transformer
Now place a DuplicateFilter transformer, connected to the Tester:Failed port:

Set up the transformer to filter out duplicate values of _count, x, y, and (optionally) z. i.e. on the same feature (count matches) flag up vertices with an identical x,y,z.

Run the workspace and inspect the output. The result will look like this:

There is one unclosed feature and six features flagged with duplicate vertices. In fact, there will be a feature for every vertex of a building that is a duplicate, so if a building has two duplicate vertices there will be two features to represent it. The x/y/z attributes of the feature identify where the duplicate vertex lies.

Notes

If you're overthinking the problem you might be wondering if there is any effect introduced by dropping the first point. For example, where we have A,A,B,C,D,E or A,B,C,A,D,E - would there be a problem because the first A feature is dropped and so won't match with any subsequent A's?

Well, no, for various reasons:

If it's A,A,B,C,D,E then the two A's are consecutive and you could find those with the GeometryValidator. But even if you didn't...
If it's a closed line then "E" is the same as "A" anyway, so subsequent A's will match with E.
If it's not a closed line then the ClosedCurveFilter will already have flagged this as a possible problem feature.

Counting the number of problem vertices is as simple as introducing a StatisticsCalculator (as in the prior example) to count the features.

Fixing the problem vertices is another matter. Technically we could use the VertexRemover to drop one of the bad vertices. But there is no guarantee that we would remove the correct one. For example, add a VertexRemover after the DuplicateFilter, set to remove vertex "_element_index" (which we know to be a duplicate):

The result works for some features, but not others:

Therefore it's suggested that this technique should be used to identify non-consecutive duplicate coordinates, but not to fix them. The problem features should be passed on to a proper editing tool for fixing.

Data Attribution

The data used here originates from open data made available by the City of Vancouver, British Columbia. It contains information licensed under the Open Government License - Vancouver.

Search