UTF-8 Names FAQ | Encoding in FME

Introduction

As of FME 2022.0, FME will process all workspaces using UTF-8 as the encoding for most string data when running on Windows 10+ or Windows Server 2019+; this will allow FME to execute workspaces that are created in different locales and will help minimize locale issues. Existing workspaces created prior to FME 2022.0 should continue to run as before if run on the same locale on which they were originally generated. Note that Linux and macOS systems use a default encoding of UTF-8 such that we do not expect a change in behavior between old and new workspaces authored on those platforms. FME 2022.0 will attempt to use UTF-8 as the FME process encoding, provided the Windows OS version supports setting the locale to UTF-8.

Migration Guides

Why UTF-8

UTF-8 is a universal Unicode encoding, which is ASCII compatible, as any ASCII text is also UTF-8. UTF-8 can store characters from any other character encoding in a lossless way, allowing the preservation of text from any written language.

Versions of FME prior to FME 2022 produced workspaces which "assumed" that source data, feature types, attributes, names, and paths were all stored in the character encoding specified by the system's locale. This assumption became a potential issue when workspaces were moved between machines, and the locale on the machine where the workspace was authored wasn't the same as the locale on the machine running the workspace.

As an example, if a workspace were authored on a machine where the system locale was set to fr_FR, but run on a machine where the system locale was set to en_US, incorrect translation behavior often resulted. This incorrect behavior often included misreading of features and attribute values; and incorrectly interpreted file paths.

Terminology

Names: Feature type, attribute, trait, and measure names. Names are implied to be process-coded.
OS: Operating system, the default encoding is based on the region/locale.
Process: Refers to the application processes used by the FME Platform, and their respective encoding. Used to interpret the character encoding of strings, unless otherwise specified.
Tagged: Strings with explicitly known character encoding. Windows-1252, Shift_JIS, or UTF-8.
Unicode: The standard for the consistent encoding, representation, and handling of text. Includes encodings such as UTF-8, UTF-16, and UTF-32.
Untagged: Strings without a known character encoding, assumed to be in process encoding in FME 2022.0
Values: Attribute or trait string values. Values are encoding aware and may be tagged with explicit character encodings.

Process

Modern versions of Windows 10 and later allow applications to set character encoding to UTF-8, without changing the machine's configured system locale. This functionality has been automatically enabled in FME 2022.0 and newer, meaning that processes running in FME 2022.0+ will automatically make use of UTF-8 character encoding as long as the machine's OS supports this. Processing workspaces with UTF-8 character encoding allows for the correct representation of any input data or names during workspace translations. This means that other applications, such as Esri's ArcGIS Pro, which sets its locale to UTF-8 on Windows without user configuration, can work with FME. Those applications that use the FME Objects interface to integrate with FME should be careful not to change the locale from UTF-8; doing so may result in the incorrect interpretation of names.

Pros and Cons

Pros:

Workspaces will always open as UTF-8 and save without loss
Running translations will honor attribute data
FME can display attributes without misinterpreting the characters
Any changes to workspaces can be saved while preserving the authoring encoding
Most UTF-8 path names are supported

Cons:

Workspaces will always open as UTF-8
Workspaces authored prior to FME 2022.0 will continue to lack cross-platform or cross locale compatibility
Versions of Windows OS which predate Windows 10 version 1903 or Windows Server 2019 build 1809 will not allow FME to set the process locale to UTF-8; on machines with these older Windows OS, encoding behavior in FME 2022+ will be the same as in older versions of FME

2021.2 Vs 2022.0+

	2021.2 and older	2022.0 and newer
Process Encoding	Process encoding equals OS encoding	Process encoding does not equal OS encoding
Defaults	Windows default encoding uses a code page based on the region/locale	FME processes where process encoding is always UTF-8 if supported by OS
Name Encoding	Names were untagged	Names still untagged and are assumed to be process equals UTF-8 encoded strings
Value Encoding	Values may be tagged or untagged strings	Values may be tagged or untagged strings
Example 1	Attribute(string): ‘process_ended_name’ has value ‘untagged_process_encoded_value’	Attribute(string): `utf-8_encoded_name’ has value `untagged_utf-8_encoded_value'
Example 2	Attribute(encoded: Shift_JIS): `JAPANESE' has value `tagged_shift_jis_value_板橋'	Attribute(encoded: Shift_JIS): `utf-8_encoded_name' has value `tagged_shift_jis_value_板橋'

How to Determine if you are in UTF-8

As of FME 2022.0, FME is automatically in UTF-8 if supported by the machine's operating system.

To determine if FME is set to UTF-8, open FME Workbench and access Help > About FME Workbench. In the dialog that opens, select More Info. In the FME Information window, under Process Encoding, you’ll see what encoding FME is set to use.

The following image shows the About FME Workbench information for FME Workbench 2024.1 build 24612. Notice that the Process Encoding for FME Workbench has been set to UTF-8, as the operating system of the machine in use allows applications to set the process locale to UTF-8.

This second image shows the same About FME Workbench information for FME Form 2024.1 build 24612, but this time FME Form 2024.1 was installed on a machine whose operating system does not allow applications to set the process locale encoding to UTF-8. Notice how the Process Encoding for FME Workbench is set to the older windows-1252 single-byte character encoding.

If authoring workspaces that make use of UTF-8 encoding is a priority, users must ensure the operating system of their machine supports UTF-8 character encoding. This may require an upgrade of the machine's Windows OS to a more current version, which allows applications to set the process locale encoding to UTF-8. Users could also consider using macOS or Linux machines for UTF-8 support.

How to Check the Encoding for a Workspace

To check the encoding that a workspace was authored with, open the workspace .fmw in a text editor. Then scroll down until you see “FME_NAMES_ENCODING.”

The first image is from an FME workspace authored on an older version of Windows that does not support setting the process locale encoding to UTF-8. The second is from an FME workspace on a version of Windows 10 where FME has set the process locale encoding to UTF-8. Note that workspaces generated from older FME versions may lack this directive entirely, in which case FME will assume the workspace was generated using the current OS locale encoding.

Video

Examples

The following two scenarios may result in incorrectly read file or attribute names in FME 2022+:

opening a workspace in FME 2022+ that was authored
- pre-FME 2022; and
- on a machine which uses a different locale than that of the machine where the .fmw is being opened;

using FME 2022+ on a version of Windows OS which predates Windows 10 version 1903 / Windows Server 2019 build 1809

If the encoded characters are contained within the workspace file name, FME may not be able to read the .fmw at all.

For example, while using FME 2022+ on a Windows Server 2016 machine, users may encounter the following error when trying to add a reader whose source file name contains encoded characters:

The original encoded characters of the source file name have been replaced with ?????, rendering the source file undiscoverable

As another example, users who are working in FME 2022+ on a Windows Server 2016 machine may see the following information reported while reading a source file whose attribute names consist of encoded characters:

Japanese_AttributeNames_OddCharacters_WinServer2016_Form2024_crop.JPG

The source file contains attribute names which consist of Japanese characters. Without UTF-8 process encoding, these Japanese characters are replaced

Users working in FME 2022+ and who read the same source file as in the above two examples, on a machine that allows applications to adjust process encoding to UTF-8, will see encoded string values from the source dataset being read correctly:

The source file read by this CSV reader is the same file that produced the error reported in the image above. Its file name and attributes all contain encoded (Japanese) characters.

Troubleshooting

A workspace that was built using a machine set to one locale (eg. windows-1252), but is then opened on a machine set to a different locale (eg. UTF-8), which may trigger a Mismatched Schema Encoding notice if the workspace contains characters that are not compatible between the two locales involved, and the user tries to add a dataset that conflicts with the locale in which the workspace was built.

If users wish to keep the workspace compatible with the original locale, users should avoid adding datasets that may conflict with the original locale, or accept that there may be dropped features
If users wish to upgrade the workspace to the new locale, it is recommended that users recreate the workspace on a machine set to the desired locale (eg. UTF-8)

Search