Overview
This article helps you diagnose and resolve common issues when deploying and operating FME Flow on Kubernetes. It highlights the symptoms, likely causes, and resolutions you can implement. For full deployment guidance and other insights into the deployment process, please review this article
Prerequisites
Before troubleshooting, confirm the following are in place. See Prerequisites and Considerations.
- A working Kubernetes cluster with adequate CPU and memory
- Helm installed and the Safe Software chart repo added(
helm repo add safesoftware <https://safesoftware.github.io/helm-charts>) - An ingress controller (NGINX) is available in the cluster
- Network access to any external dependencies such as image registries, license server, and database
Common issues
- Can’t reach Web UI after install
- Engines don’t scale from the Web UI
- Wrong FME version pulled
- Helm repo/values not initialized
- 502/504 through Ingress or UI timeouts
- Pods stuck in Pending due to storage
- CrashLoopBackOff on startup
- ImagePullBackOff
- OOMKilled or slow performance
- Startup/readiness probe 503
- Database Connection Errors
Can’t reach Web UI after install
- Symptoms
- 404 Not Found
- 502 Bad Gateway or 504 Gateway Timeout
- The browser shows the ingress hostname, but times out
- Likely causes
- Ingress hostname or DNS is not configured to point at the ingress address
- TLS secret name mismatch on the Ingress
- Resolution
- Set deployment.hostname and, if using DNS, set deployment.useHostnameIngress=true
- Ensure DNS resolves to the ingress address
- If using an existing certificate, make sure
deployment.tlsSecretNamematches the secret bound to the Ingress, then run helm upgrade - Alternatively, configure cert-manager via deployment.certManager.* values
Engines don’t scale from the Web UI
- Symptoms
- UI accepts changes, but the engine count in Kubernetes does not change
- No new engine Pods after saving in UI
- Likely cause
- In Kubernetes, engine definitions are controlled by the Helm values, not the Web UI
- Resolution
- In values.yaml, set fmeflow.engines[].engines and apply with helm upgrade
Wrong FME version pulled
- Symptoms
- ImagePull error: manifest for safesoftware/fmeflow:<tag> not found
- Likely cause
- Incorrect image tag
- Resolution
- Set
fmeflow.image.tagto the intended release (for example, 2025.1) and helm upgrade
- Set
Helm repo/values not initialized
- Symptoms
- Error: repository name (safesoftware) not found
- Error: chart "fmeflow" matching not found in safesoftware index
- Likely cause
- Missing chart repo or not using a values file
- Resolution
-
Add the Safe repo and fetch defaults, then install or upgrade
helm show values safesoftware/fmeflow >> values.yamlhelm repo add safesoftware <https://safesoftware.github.io/helm-charts>
-
- Proceed with the installation or upgrade.
502/504 through Ingress or UI timeouts
- Symptoms
- 502 Bad Gateway from NGINX ingress
- 504 Gateway Timeout
- Upstream connect error or disconnect/reset before headers
- Likely causes
- Backend Service has no ready endpoints
- Readiness/liveness probes failing
- Resolution
-
kubectl get endpoints <svc>; if empty, fix probes and ensure Service selector matches Pod labels - Increase
initialDelaySecondsandtimeoutSecondsfor slow starts
-
Pods stuck in Pending due to storage
- Symptoms
- pod has unbound immediate PersistentVolumeClaims
- The pods are deployed but are not running
- Error: "PVC <name> is Pending
- Error: 0/3 nodes available: volume node affinity conflict
- Error: Warning FailedMount kubelet MountVolume.SetUp failed for volume <volume_name>
- Likely causes
-
storageClassNamemismatch or missing provisioner - Access mode mismatch (RWO vs RWX)
- Capacity or quota exhausted
-
- Resolution
- If using AKS(Azure Kubernetes Service), use the correct storage class. The "Setup Shared Storage using Azure Files" section in this article will provide more details on the process
- The default PV should be bound to the default PVC
CrashLoopBackOff on startup
- Symptoms
-
CrashLoopBackOff, Back-off restarting failed container - Exit Code: 1 when you describe the pod (
kubectl describe pod <pod> -n <ns>>
-
- Likely causes
- Misconfiguration, missing secrets
- Database or license server unreachable
- Init container failing
- Resolution
- Verify env and secret values, and network access to DB and license server
- Restarting the entire pods here can also help, as sometimes, some services may not start in the correct sequence
ImagePullBackOff
- Symptoms
-
ImagePullBackOff,ErrImagePull - Error: failed to authorize: authentication required
-
- Likely causes
- Bad image reference or tag
- Missing imagePullSecret
- Blocked egress to registry
- Resolution
- Verify repository, tag, and credentials
OOMKilled or slow performance
- Symptoms
- Reason:
OOMKilledin container status - Memory group out of memory in node logs
- High latency or queue growth in UI
- Reason:
- Likely cause
- Memory limits are too low, or workload spikes
- Resolution
- Increase container memory requests and limits
- Check node pressure and add capacity if needed
Startup/readiness probe 503
- Symptoms
- Readiness probe failed: HTTP 503
- Error: Liveness probe failed: connection refused
- Likely causes
- Slow first start
- Wrong health endpoint or port
- Resolution
- Increase
initialDelaySecondsandtimeoutSeconds - Confirm probe path and port match the container’s health endpoint.
- Increase
Database connection errors
- Symptoms
- Connection refused to <db-host>:<port>
- SSL: certificate verify failed
- Authentication failed for the user
- Likely causes
- Wrong host, port, credentials, or TLS
- Network blocks between the cluster and the database
- Resolution
- Update values and secrets
- From a Pod, test connectivity: nc -vz <db-host> <port>
- Open firewall or NetworkPolicy where needed
- If using external Postgres, follow Deploying with an External Database to cross-check for misconfigurations.
References and Additional Resources
- Performing the Kubernetes deployment, including scaling engines via values.yaml. Also see Defining FME Engines for Kubernetes.
- Deploying with a trusted certificate or cert-manager.
- Deploying with an External Database for higher availability.
- Deploying with an NFS Client Provisioner and Deploying across Multiple Hosts for multi-host or RWX storage.
Commands and Snippets
# Check endpoints for a Service kubectl get endpoints <svc> # View current and previous container logs kubectl logs <pod> -n <ns> kubectl logs --previous <pod> -n <ns> # Upgrade with a values file helm upgrade <release> safesoftware/fmeflow -f values.yaml # Get a quick cluster-wide view of pod health kubectl get pods -A -o wide # Describe a specific pod to inspect events, probes, and container state kubectl describe pod <pod> -n <ns> # Stream logs (and previous-crash logs) from a container kubectl logs -f <pod> -n <ns> kubectl logs --previous <pod> -n <ns> # Verify Services have ready endpoints kubectl get svc,endpoints -n <ns> # Exec into a running container for on-box checks (curl, nc, env) kubectl exec -it <pod> -n <ns> -c <container> -- /bin/sh