Could not read FME Engine response. Connection may have been lost

Symptom

Job Failure
The job may fail when run directly with the Job Submitter; you may also notice that a scheduled job has failed to complete in the job history.

FME Flow Log
In the fmeFlow log file, you may see the following errors occur:

Could not read FME Engine response. Connection may have been lost
Failed to get translation result. Returning failed result.

In most cases, this will be followed by a resubmission attempt:

Job <#> failed and has been resubmitted. This is resubmittal 1 out of a maximum of 3.

Cannot Run Jobs
You may not be able to run jobs or see any engines.

The fmeprocessmonitorengine.log may show 'FME Engine failed to register with FME Flow '<hostname>' on port 7070. Failed to verify message authenticity'.
Check the queue.log for 'Could not create server TCP listening socket 127.0.0.1:6379: bind: address in use' or 'Failed listening on port 6379 (TCP), aborting'.
Check the fmeprocessmonitorcore.log for 'Process "Queue" ended unexpectedly and has reached its start attempts limit of 20'.
The following entries will be found in the /resources/logs/queue/queue.log if this issue is present:

<timestamp> # Warning: Could not create server TCP listening socket *:6379: bind: address in use

<timestamp> # Failed listening on port 6379 (tcp), aborting.

When the FME Flow Core service is started, it will launch several sub-processes. One of these sub-processes is the Job Queue, which runs as memurai.exe in FME Flow 2023 and newer (redis-server.exe in FME Server 2022 and older).

If this process cannot start, engines will not connect to the FME Flow Core correctly and cannot run jobs.

Job Log
You may also notice the lack of a log file or a log file that suddenly stops before providing summary stats.

Cause

These errors occur when the job running on FME Flow crashes the FME Engine. FME Engine crashes have varying causes and, by their nature, are usually unknown to Safe Software. Here are some causes we have seen in the past:

Python errors/crashes
Oracle client configuration errors
Large datasets with memory-intensive operations
Service account permissions
Port scanners related to security interrogations
Network interruptions

Resolution

Upgrade to the latest FME Flow
If you are running an older version of FME Flow, we suggest upgrading to the latest version. Changes are constantly being made to FME Flow to make the software 'smarter' against port scanners or other foreign interferences.

Check Permissions
If the issue persists for a particular workspace or job, try changing the user account running the FME Engine service to your own account (if you have admin rights) and then rerunning the job. You can run a permissions test to quickly check this. If this corrects the issue, see 'Running the FME Flow Engines Under a Different Account' for a step-by-step guide to changing the user account for the service.

Contact Technical Support
If checking the permissions does not resolve the issue, we recommend that you create a support request here. You can help us by determining which workspace and/or circumstance is causing the failure. Check the workspace logs (if available) to see if a crash is occurring. Check Windows Event Viewer for any reports of an Application Crash. Try iteratively removing parts of the workspace until the error disappears. The last piece that you removed is likely causing the crash. Send us logs, workspaces, and other information to help reproduce the error.

Special Case - Distributed Database
During a distributed installation, FME Flow may start, but the database has not been configured yet. This can cause some of the processes of FME Flow to become unhealthy. Once the system database is configured, FME Flow is often restarted to connect to the new database. Occasionally, when restarting FME Flow, some processes may not shut down cleanly. This can also occur at other times, but it is very rare, especially for Express Installs.

Watch for memurai.exe and FMEConnections.exe in Task Manager.

If engines are not appearing, and jobs can't be run it is possible the memurai.exe process is hung and won't allow engines to connect to the core correctly. Please confirm by reviewing the queue.log for confirmation of issues binding to port 6379.
To resolve situations like this:

Stop FME Flow services. NOTE: It is only necessary to stop the FME Flow Core service as this will stop the FME Flow Engine service. It is not necessary to stop the FME Flow Application Server or FME Flow Database services.
Review the Task Manager (details tab) for FME* (ignore FMEFlow_ApplicationServer.exe) and memurai.exe processes. Terminate any matching processes that exist.
Start FME Flow Services.
Review the logs and check if engines have returned to the Web UI.

If you have encountered this issue, particularly a hung memurai.exe process, please let us know.

Special Case - Python and FME Flow
If you are using Python and this error started happening after upgrading to a newer version of FME Flow, then the problem is likely due to the fact that FME is attempting to load a python interpreter that is not available. Review the workspace parameter under Scripting called Python Compatibility (newer versions of FME Form). This instructs the FME Flow engine on what Python interpreter should attempt to load. Occasionally, this setting can be set to Esri ArcGIS Desktop (or similar from an older version of FME Form) and could cause issues with the FME Flow Engine when attempting to load a non-existent library (i.e., ArcGIS Desktop is not installed, but ArcGIS Pro is).

Special Case - Network Interruptions
Job Failure due to Network Interruption: These are specific to workspaces that are interacting with network resources, for example; remote databases, or files on remote file servers. When the network fails, active and non-active connections that have been established by the engine to network resources can have unknown outcomes, causing jobs to fail or crash abruptly.

Distributed Engine disconnection due to Network Interruption: These are specific to distributed engines that are running jobs. A known issue for FME Flow, configured with distributed engines, is that when the FME Flow Core loses connection to the distributed engines, the engines are not able to reconnect to the core. During the network outage, the engines may be unaffected and may continue to run jobs to success, but upon restarting or reconnecting the core, the engines are rejected. The following error may appear on the fmeserver.log repeatedly as the engines try to connect to the core:

FME Flow license does not allow more than maximum of 'X' FME Engine(s).

('X' will be the number of engines your FME Flow is licensed for) and in the fmeprocessmonitorengine.log:

FME Engine failed to register with FME Flow '<corehost>' on port 7070. Failed to verify message authenticity

The first error relates to the fact that the core has not closed the port connections to the engines properly. The core still thinks engines are connected and will not license any new instances of engines. The second error is related to the engines being rejected by the core in the request to establish a new connection.

Any jobs submitted during the inability of the engines to reconnect to the core and are assigned to run on the remote engines will queue indefinitely. The Web UI Deployment Status page will continue to display that the old jobs are still running on the distributed engines, even if they would have normally been completed shortly after the network outage. The Jobs Running page will show the same old jobs that were still running during the network outage.

Workaround: At the OS level, there is a network setting called KEEPALIVETIME, and by default, it is set to two hours. If a network outage occurs, and nothing is done (FME Server Core service is not restarted), the core will drop the orphaned connections two hours after the network interruption, and the engines will reconnect properly on the next attempt and start pulling jobs from the queue. The jobs that were running on the distributed engines at the time of the interruption are resubmitted to the job queue regardless of the job status (success or failure). This is based on the RETRIES parameter in the fmeServerConfig.txt file.
It is possible to reduce the time the orphaned connection ports remain open. Review Microsoft KEEPALIVETIME documentation, and this article provided by Esri Technical Support. We recommend setting the timeout to 3 to 5 minutes (180,000ms to 300,000ms), but consider all other factors, including any other local software that could be affected by this change.

Search

Could not read FME Engine response. Connection may have been lost

Symptom

Cause

Resolution

Was this article helpful?