Files
-
- 100 KB
- Download
Introduction
FME can be used in combination with AI services to intelligently clean data. This article details the use of OpenAI to recognize different abbreviations of street names and provide the full word from the abbreviation. Because abbreviations used may not be standardized, we also ask AI to provide a confidence level on their replacement text suggestion. In this tutorial, we delegate the monotonous task of data cleanup to AI.
Requirements
- OpenAI API Key: Required to use the OpenAIConnector transformer
- Internet Access: Necessary for calling OpenAI’s API with FME Form
Step-by-Step Instructions
This article demonstrates how to clean up and standardize address data by expanding abbreviations using AI-generated suggestions. Download Abbreviations.fmw to get started, or create your own workspace.
Part 1: Find and Replace Abbreviations
1. Start FME Workbench
Launch FME Workbench and create a new workspace or open the Abbreviations.fmw file to follow along. The Abbreviations.fmw workspace is pre-configured with readers and transformers for abbreviation detection and replacement.
2. Add an Esri ArcGIS Server Feature Service Reader
The source data contains road center lines, the road name, and speed limit. In order to read it into the workspace, we must add an Esri ArcGIS Feature Service Reader:
- Format: ArcGIS Server Feature Service
-
Dataset:
https://gisservices.surrey.ca/arcgis/rest/services/OpenData/MapServer -
Parameters:
-
Feature Service:
https://gisservices.surrey.ca/arcgis/rest/services/OpenData/MapServer -
Layers:
- Select Layers: Road Centrelines (151)
-
Feature Service:
Optional Step: Add a Sampler Transformer
To improve the processing speed and reduce any API delays, we can use a Sampler transformer. The Sampler is also especially useful at the workspace testing stage, before we commit to processing 10,000+ features.
3. Find Street Name Abbreviations
Add a StringSearcher transformer and connect the Road Centralines (151) dataset to its input port. Set the following parameters to extract all street name abbreviations that appear in the road names into the RoadAbbreviation attribute:
- Search In: ROAD_NAME
-
Contains Regular Expression:
(?<=\b)([a-zA-Z]+\s+[a-zA-Z]|[a-zA-Z]+)(?=$)|([A-Za-z]+)(?=\)$) - Output Attribute Name: RoadAbbreviation
The Regular Expression (REGEX) is designed to extract only the relevant parts of the road name. In our case, we’re taking the last one or two words of the string.
4. Aggregate Features by Abbreviation
Let’s aggregate all of the features by the street name abbreviations. Aggregating the common values optimizes token usage when sending the data to the AI service. We will join the results back to the original dataset later on. This way, we can see all unique street name abbreviations and send only the unique values to be unabbreviated.
Add an Aggregator transformer, connect the Matched port of the StringSearcher to its input, and set the following parameters:
-
Group Processing: Enabled
- Group By: RoadAbbreviation
-
Attribute Accumulation:
- Accumulation Mode: Use Attributes From One Feature
5. Add an OpenAIConnector Transformer
Now we can send the abbreviated road names to the OpenAI API for expansion. By delegating this task to AI, we do not have to account for every single variation of street name abbreviations.
Set the following parameters on the OpenAIConnector transformer:
- API Key: <Input your own>
- Action: Web Search
- Model: gpt-4o
- Instructions:
You are a data cleanup assistant helping to expand abbreviated Canadian street names into their full names. Here is a list of common abbreviations for street names: “St” for “Street”, “Ave” for “Avenue”, “Blvd” for “Boulevard”, "Cl" for "Close", "Pl" for "Place", "Gr" for "Grove", "Cr" for "Crescent", "Div" for "Diversion", "Rwy" for "Railway". Return the expanded form of each name. Example input: • Main St • Elm Ave • King George Blvd • Bayview Dr • 5th Ct • 15 St E • Scott Rd Expected output: • Main Street • Elm Avenue • King George Boulevard • Bayview Drive • 5th Court • 15 Street East • Scott Road If you receive a full word and not an abbreviation, respond with the same word.
- User Prompt: Road Abbreviation
-
Structured Output:
{ "additionalProperties": false, "properties": { "Confidence": { "type": "number" }, "ReplacementText": { "type": "string" } }, "required": [ "Confidence", "ReplacementText" ], "type": "object" }
6. Flatten the OpenAI JSON Response into Attributes
The JSON response from the OpenAI API needs to be flattened to use as attributes in the workflow. Add a JSONFlattener transformer with the following parameters set:
- JSON Document: Response
-
Attributes to Expose:
- Confidence
- ReplacementText
7. Add a FeatureJoiner Transformer
With the FeatureJoiner, we can join the attribute holding the unabbreviated street names with the original road centreline data.
Connect the Output port of the JSONFlattener to the FeatureJoiner’s Left port. Connect the Matched port of the StringSearcher to the FeatureJoiner’s Right port. Double-click the FeatureJoiner to set parameters:
- Attribute Conflict Resolution: Prefer Right
- Geometry Handling: Prefer Right
8. Replace Abbreviations
Add a StringReplacer transformer to replace the abbreviations with the full word. We can set a condition to only apply this replacement if the confidence given by the OpenAI API is above 90%. Set the following parameters:
- Attributes: ROAD_NAME
-
Text To Replace:
-
Click the drop-down arrow and select Conditional Value
-
Test If:
- Left Value: Confidence
- Operator: >=
- Right Value: 0.9
- Text To Replace: RoadAbbreviation
-
Test Else:
- Text to Replace: 1
-
Test If:
-
Replacement Text:
-
Click the drop-down arrow and select Conditional Value
-
Test If:
- Left Value: Confidence
- Operator: >=
- Right Value: 0.9
- Text To Replace: ReplacementText
-
Test Else:
- Text to Replace: 1
-
Test If:
-
Click the drop-down arrow and select Conditional Value
-
Click the drop-down arrow and select Conditional Value
9. Add an AttributeManager Transformer
Use an AttributeManager transformer to remove temporary schema elements before the data is written out.
Remove the following attributes:
- Response
- RoadAbbreviation
- Confidence
- ReplacementText
Now, we should be left with only the ROAD_NAME and SPEED attributes.
10. Add a GeoPackage Writer
After adding and connecting the GeoPackage writer, we can run the workspace. Open the GeoPackage output in the Data Inspector to preview the cleaned data. The screenshot below shows only a sample of the Road Centrelines from the original dataset. The road names are now fully unabbreviated with the help of AI!
You have successfully used the OpenAI API within FME Form to clean your data!
Troubleshooting
- API Timeout: If you experience delays or timeouts, reduce the number of records in the Sampler.
- Invalid API Key: Ensure your OpenAI API key is active and properly configured in the transformer.
- Unexpected Abbreviation Results: Review the prompt template and fine-tune for better results.
Additional Resources
- OpenAIConnector Transformer
- Regular Expressions and FME
- Regular Expression Language - Quick Reference
Data Attribution
City of Surrey, BC Open Data Portal - Road Centerlines Dataset (License: Open Government Licence - Surrey) https://gisservices.surrey.ca/arcgis/rest/services/OpenData/MapServer