UTF-8 Names FAQ | Encoding in FME

Liz Sanderson
Liz Sanderson
  • Updated

FME Version

  • FME 2022.0

Introduction

As of FME 2022.0, FME will process all workspaces using UTF-8 as the encoding for most string data when running on Windows 10+ or Windows Server 2019+; this will allow FME to execute workspaces that are created in different locales and will help minimize locale issues. Existing workspaces created prior to FME 2022.0 should continue to run as before if either run on the same locale on which they were originally generated. Note that Linux and macOS systems use a default encoding of UTF-8 such that we do not expect a change in behavior between old and new workspaces authored on those platforms. FME 2022.0 will attempt to use UTF-8 as the FME process encoding, provided the Windows OS version supports setting the locale to UTF-8. 
 

Migration Guides

 

Why UTF-8

UTF-8 is a universal Unicode encoding, which is ASCII compatible, as any ASCII text is also UTF-8. UTF-8 can losslessly store characters from any other character encoding, allowing preservation of text from any written language. In previous versions of FME, the workspace would assume that source data, feature type, attribute, names, and paths were stored in the system locale encoding. This is an issue because if the workspace is created in a different system locale than the one it is run on, it may prevent features from being read from feature types, values from being read from attributes, and datasets and other OS paths from being interpreted correctly during translation. 
 

Terminology

  • Names: Feature type, attribute, trait, and measure names. Names are implied to be process-coded. 
  • OS: Operating system, the default encoding is based on the region/locale. 
  • Process: Refers to the application processes used by the FME Platform, and their respective encoding. Used to interpret the character encoding of strings, unless otherwise specified. 
  • Tagged: Strings with explicitly known character encoding. Windows-1252, Shift_JIS, or UTF-8. 
  • Unicode: The standard for the consistent encoding, representation, and handling of text. Includes encodings such as UTF-8, UTF-16, and UTF-32.
  • Untagged: Strings without a known character encoding, assumed to be in process encoding in FME 2022.0
  • Values: Attribute or trait string values. Values are encoding aware and may be tagged with explicit character encodings. 

Process

Modern versions of Windows 10 and later allow applications to set the locale to UTF-8. This has been automatically enabled in FME 2022.0 and newer. This allows any input data or names in a workspace to be correctly represented when being processed within FME. This means that other applications that choose to set their locale to UTF-8 on Windows without user configuration can work with FME, such as ArcGIS Pro. However, applications that use the FME Objects interface to FME, which operate within the same process, should be careful not to change the locale from UTF-8, otherwise, this may result in incorrect interpretation of names.
 

Pros and Cons

Pros:

  • Workspaces will always open as UTF-8 and save without loss
  • Running translations will honor attribute data
  • FME can display attributes without misinterpreting the characters
  • Any changes to workspaces can be saved while preserving the authoring encoding
  • Most UTF-8 path names are supported 

Cons:

  • Workspaces will always open as UTF-8
  • Workspaces authored prior to FME 2022.0 will continue to lack cross-platform or cross locale compatibility
  • Older versions of Windows and Windows Server do not allow FME to set the process locale to UTF-8 and thus encoding behavior is the same as in older versions of FME  (requires Windows 10 version 1903+ or Windows Server 2019 b1809+, as well as Windows SDK version 10.0.17134.0+)

 

2021.2 Vs 2022.0+ 

  2021.2 and older 2022.0 and newer
Process Encoding Process encoding equals OS encoding Process encoding does not equal OS encoding
Defaults Windows default encoding uses a code page based on the region/locale FME processes where process encoding is always UTF-8 if supported by OS
Name Encoding Names were untagged  Names still untagged and are assumed to be process equals UTF-8 encoded strings
Value Encoding Values may be tagged or untagged strings Values may be tagged or untagged strings
Example 1 Attribute(string): ‘process_ended_name’ has value ‘untagged_process_encoded_value’ Attribute(string): `utf-8_encoded_name’ has value `untagged_utf-8_encoded_value'
Example 2 Attribute(encoded: Shift_JIS): `JAPANESE' has value `tagged_shift_jis_value_板橋' Attribute(encoded: Shift_JIS): `utf-8_encoded_name' has value `tagged_shift_jis_value_板橋'

 

How to Determine if you are in UTF-8

Note: In FME 2022.0, FME is automatically in UTF-8 if supported by the operating system.

To determine if FME is set to UTF-8, from within FME Workbench, go to Tools > About FME Workbench, then in the dialog select More Info. In the FME Information window, under Codepage, you’ll see what encoding FME is in. 
UTF82022.png
This example is in UTF-8 as the Operating System supports setting the locale to UTF-8. 


Windows12.png
This example is for an unsupported Windows Operating System, where the locale encoding is windows-1252 (aka. Latin-1 or ISO-8859-1). This results in the Codepage value still being in windows-1252. The user would need to update to a modern supported version of Windows, or switch to macOS or Linux if authoring workspaces in UTF-8 is a priority. 
 

How to Check the Encoding for a Workspace

To check the encoding that a workspace was authored with, open the workspace .fmw in a text editor. Then scroll down until you see “FME_NAMES_ENCODING.” 
FMENames.png
The first image is from an FME workspace authored on an older version of Windows that does not support setting the process locale encoding to UTF-8. The second is from an FME workspace on a version of Windows 10 where FME has set the process locale encoding to UTF-8. Note that workspaces generated from older FME versions may lack this directive entirely, in which case FME will assume the workspace was generated in the current OS locale encoding.
 

Video


 

Example

Opening a file with a different encoding prior to FME 2022.0, or using  FME 2022.0 on an older version of Windows, results in the file name or attributes being read incorrectly. If the encoded characters are contained within the file name, FME may not be able to read the file at all. 
WithoutUTF8.png

Generating a workspace to open and read an encoded source dataset in FME 2022.0, on a supported version of Windows or in macOS/Linux, results in the encoded string values from the 
WithUTF8.png


Troubleshooting

When attempting to open a workspace that was created on one locale (eg. windows-1252) on a different locale (eg. UTF-8) that contains characters that are not compatible between the two, and the user adds a dataset that conflicts with the original locale, a Mismatched Schema Encoding Detected notice will appear.  

  • If you are trying to stay in the original locale, avoid adding datasets that may conflict or accept that there may be dropped features. 
  • If you would like to upgrade to the new locale, recreate the workspace in the desired locale (eg. UTF-8) 

EncodingError.png
 

Was this article helpful?

Comments

0 comments

Please sign in to leave a comment.