Symptom
Job Failure
The job may fail when run directly with the Job Submitter, you may also notice that a scheduled job has failed to complete in the job history.
FME Server Log
In the fmeServer log file you may see the following errors occur:
- Could not read FME Engine response. Connection may have been lost
- Failed to get translation result. Returning failed result.
In most cases, this will be followed by a resubmission attempt:
- Job <#> failed and has been resubmitted. This is resubmittal 1 out of a maximum of 3.
Cannot Run Jobs
You may not be able to run jobs or see any engines.
- The fmeprocessmonitorengine.log may show 'FME Engine failed to register with FME Flow '<hostname>' on port 7070. Failed to verify message authenticity'.
- Check the queue.log for 'Could not create server TCP listening socket 127.0.0.1:6379: bind: address in use' or 'Failed listening on port 6379 (TCP), aborting'.
- Check the fmeprocessmonitorcore.log for 'Process "Queue" ended unexpectedly and has reached its start attempts limit of 20'.
When FME Flow Core service is started it will launch several sub-processes. One of these sub-processes is the Job Queue and runs as memurai.exe in FME Flow 2023 and newer (redis-server.exe in FME Server 2022 and older).
If this process cannot start, engines will not connect to the FME Flow Core correctly and will not be able run jobs.
Job Log
You may also notice the lack of a log file or a log file that suddenly stops before providing summary stats.
Cause
These errors occur when the job that was running on FME Server crashes the FME Engine. FME Engine crashes have varying causes and by their nature are usually unknown to Safe Software. Here are some causes we have seen in the past:
- python errors/crashes
- oracle client configuration errors
- large datasets with memory-intensive operations
- service account permissions
- port scanners related to security interrogations
- network interruptions
Resolution
Upgrade to the latest FME Server
If you are running an older FME Server version we will suggest an upgrade to the latest version of FME Flow. As changes are constantly being made to FME Flow to make the software 'smarter' against port scanners or other foreign interferences.
Check Permissions
If their issue is persistent for a particular workspace or job, try changing the user account running the FME Engine service to your own account (if you have admin rights) and then run the job again. You can run a permissions test to quickly check this. If this corrects the issue, see 'Running the FME Server Engines Under a Different Account' for a step-by-step guide to changing the user account for the service.
Contact Technical Support
If checking the permissions does not resolve the issue, we recommend that you create a support request here. You can help us by determining which workspace and/or circumstance is causing the failure. Check the workspace logs (if available) to see if a crash is occurring. Check Windows Event Viewer for any reports of an Application Crash. Try iteratively removing parts of the workspace until the error disappears. The last piece that you removed is likely causing the crash. Send us logs, workspaces and any other information that can help reproduce the error.
Special Case - Distributed Database
During a distributed installation FME Flow may start but the database has not been configured yet. This can cause some of the processes of FME Flow to become unhealthy. Once the system database is configured FME Flow is often restarted to connect to the new database. Occasionally when restarting FME Flow, some processes may not shut down cleanly. This can occur at other times as well but is very rare, especially for Express Installs.
- Watch for memurai.exe and FMEConnections.exe in Task Manager.
If engines are not appearing, and jobs can't be run it is possible the memurai.exe process is hung and won't allow engines to connect to the core correctly. Please confirm by reviewing the queue.log for confirmation of issues binding to port 6379.
To resolve situations like this:
- Stop FME Flow services. NOTE: It is only necessary to stop the FME Flow Core service as this will stop the FME Flow Engine service. It is not necessary to stop the FME Flow Application Server or FME Flow Database services.
- Review the Task Manager (details tab) for FME* (ignore FMEFlow_ApplicationServer.exe) and memurai.exe processes. Terminate any matching processes that exist.
- Start FME Flow Services.
- Review the logs and check if engines have returned to the Web UI.
Special Case - Python and FME Server
If you are using Python and this error started happening after upgrading to a newer version of FME Server, then the problem is likely due to the fact that FME is attempting to load a python interpreter that is not available. Review the workspace parameter under Scripting called Python Compatibility (newer versions of FME Desktop). This instructs the FME Server engine on what python interpreter should attempt to load. Occasionally this setting can be set to Esri ArcGIS Desktop (or similar from an older version of FME Desktop) and could cause issues with the FME Server Engine when attempting to load a non-existent library (i.e. ArcGIS Desktop is not installed but ArcGIS Pro is).
Special Case - Network Interruptions
Job Failure due to Network Interruption: These are specific to workspaces that are interacting with network resources, for example; remote databases, or files on remote file servers. When the network fails, active and non-active connections that have been established by the engine to network resources can have unknown outcomes, causing jobs to fail or crash abruptly.
Distributed Engine disconnection due to Network Interruption: These are specific to distributed engines that are running jobs. A known issue for FME Server, configured with distributed engines is when the FME Server Core loses connection to the distributed engines the engines are not able to reconnect to the core. During the network outage, the engines may be unaffected and may continue to run jobs to success, but upon restarting or reconnecting the core, the engines are rejected. The following error may appear on the fmeserver.log repeatedly as the engines try to connect to the core:
FME Server license does not allow more than maximum of 'X' FME Engine(s).
('X' will be the number of engines your FME Server is licensed for) and in the fmeprocessmonitorengine.log:
FME Engine failed to register with FME Server '<corehost>' on port 7070. Failed to verify message authenticity
The first error relates to the fact that the core has not closed the port connections to the engines properly. The core still thinks engines are still connected and will not license any new instances of engines. The second error is related to the engines being rejected by the core in the request to establish a new connection.
Any jobs submitted during the inability of the engines to reconnect to the core and are assigned to run on the remote engines will queue indefinitely. The Web UI Deployment Status page will continue to display that the old jobs are still running on the distributed engines, even if they would have normally been completed shortly after the network outage. The Jobs Running page will show the same old jobs still running at the time of the network outage.
Workaround: At the OS level, there is a network setting called KEEPALIVETIME and by default, it is set to two hours. If a network outage occurs, and nothing is done (FME Server Core service is not restarted), the core will drop the orphaned connections two hours after the network interruption, and the engines will reconnect properly on the next attempt and start pulling jobs from the queue. The jobs that were running on the distributed engines at the time of the interruption are resubmitted to the job queue regardless of the job status (success or failure). This is based on the RETRIES parameter in the fmeServerConfig.txt file.
It is possible to reduce the time the orphaned connection ports remain open. Review Microsoft KEEPALIVETIME documentation and this article provided Esri Technical Support. We would recommend setting the timeout to 3 to 5 minutes (180,000ms to 300,000ms) but consider all other factors including any other local software that could be affected by this change.
Comments
0 comments
Please sign in to leave a comment.