FME Version
Introduction
FME Flow Hosted (formerly FME Cloud) has tooling built into it to allow you to deliver a high level of uptime. There are also behaviors you can adopt to ensure if there is an issue, you have the best chance of that issue being resolved (either by Safe Software or yourself).
Configure Monitoring and Alerts
FME Flow Hosted comes with a comprehensive set of monitoring tools that enables you to monitor the response time, server load, network throughput, number of FME Flow (formerly FME Server) Engines, memory usage, and disk usage. For each metric, you can configure alerts to be sent to one or more emails, URLs, or Slack channels.
Thresholds can be configured to tailor when the alerts are sent.
Memory
Memory utilization depends on the specific workspaces you run on your FME Flow Hosted instance. As a start, you could set up your instance to trigger an alert if memory utilization exceeds 85% for longer than 30 minutes. Setting up memory alerts will require some experience with the workspaces you are running.
Server Load
To correctly interpret the server load and to set a sufficient threshold for your alarm, it is important to understand the server load metric and its implications regarding the number of cores of your FME Flow Hosted instance.
A load of 1.0 means 100% utilization of 1 core. The FME Flow Hosted Standard instances come with 2 cores, and therefore a load of 2.0 indicates full utilization of the 2 cores. For alerts, we recommend starting with a threshold of around 70% utilization for a duration of more than 30 minutes. So let’s say you recently increased the engine count on your Standard instance and want to make sure your FME Flow Hosted instance can handle it. You would set your alert threshold to 1.4 (2*0.7).
Disk
When your instance runs out of primary disk space, FME Flow will become unresponsive and often won’t be able to recover without rolling back to a previous backup. That’s why the primary disk usage alert (90% usage over 10 minutes) is crucial for high uptime and is enabled for your instances by default. We also highly recommend storing any data provided by users that does not necessarily need to persist on the temporary disk and not on the primary disk. The temporary disk will be purged after every reboot and is also more flexible in resizing.
Another very useful tool to prevent running out of disk space is the FME Flow System Cleanup.
Response Time
The response time metric is the final indicator that something is wrong with your instance and that the web user interface might not be accessible for users. If you didn’t disable the default alerts during the launch process, your instance would trigger an alert when the response time is higher than 500 ms for more than 10 minutes or if the server is unresponsive at all. Ideally, you would receive alerts for either high memory, server load, or a low disk space alert before you receive an alert for a high response time or an unresponsive server because often, the high response time or not reporting metrics at all is a result of these conditions.
The better your alerting and notifications, the faster you will be able to respond to any issues.
Read more about how to configure alerts.
Set Emergency Contact
If you are running any workflows in production on FME Flow Hosted, we strongly suggest that you set an Emergency Contact. If this is not set and there are multiple users on the account, it might be unclear who we should contact in the event of an issue. We will not SSH onto the instance to fix an issue without your permission, so if we can't get in touch with you, then this may have a large impact on how fast we can resolve issues for you.
Stability
Ensure your instance is optimized for your workflows and secure.
Don’t Under Provision the Instance
When an FME Flow instance experiences downtime, 9/10 times, it is caused by provisioning an instance that is too small to handle the workflows.
Once you have the FME Flow configured, review the metrics for your instance to make sure it is correctly sized.
Fast Recovery
Configure Backups
Backups are key to delivering a high level of uptime as they allow you to roll back in minutes to a previous snapshot of the instance.
For every FME Flow you launch on FME Flow Hosted, you can configure the number of backups to store. We back up FME Flow every 24 hours and at other predefined events (e.g. Pause Instance, Resize Instance, Modify Disk Size).
You can configure a minimum of two backups and a maximum of 10 backups. Once the configured limit is reached, the oldest backup is deleted automatically. Storing more backups will cost more money, but it will mean you can roll back to an older snapshot of your instance.
If there is an issue with your instance, or you accidentally delete some data on the FME Flow, you can follow the steps here to roll back. By rolling back, you will lose all of the data on the current running FME Flow, and the data and configuration at the time of the backup will be restored.
Take a Snapshot When You Hit a Milestone
A snapshot is simply a backup that is never deleted automatically. These can also be useful in ensuring a high uptime by providing a reliable configuration to roll back to.
For example, you may set up an FME Flow ready for production on the 1st of June, 2023. You then take a snapshot of that instance. FME Flow is in use for 3 months, and then errors start happening with FME Flow. Eventually, it is identified an FME Flow admin has deleted some security permissions. Rather than figuring out which backup to roll back to, you can just roll back to the snapshot.
Comments
0 comments
Please sign in to leave a comment.