Troubleshooting FME Kubernetes Deployment

Overview

This article helps you diagnose and resolve common issues when deploying and operating FME Flow on Kubernetes. It highlights the symptoms, likely causes, and resolutions you can implement. For full deployment guidance and other insights into the deployment process, please review this article

Prerequisites

Before troubleshooting, confirm the following are in place. See Prerequisites and Considerations.

A working Kubernetes cluster with adequate CPU and memory
Helm installed and the Safe Software chart repo added( helm repo add safesoftware <https://safesoftware.github.io/helm-charts>)
An ingress controller (NGINX) is available in the cluster
Network access to any external dependencies such as image registries, license server, and database

Can’t reach Web UI after install

Symptoms
- 404 Not Found
- 502 Bad Gateway or 504 Gateway Timeout
- The browser shows the ingress hostname, but times out
Likely causes
- Ingress hostname or DNS is not configured to point at the ingress address
- TLS secret name mismatch on the Ingress
Resolution
- Set deployment.hostname and, if using DNS, set deployment.useHostnameIngress=true
- Ensure DNS resolves to the ingress address
- If using an existing certificate, make sure deployment.tlsSecretName matches the secret bound to the Ingress, then run helm upgrade
- Alternatively, configure cert-manager via deployment.certManager.* values

Engines don’t scale from the Web UI

Symptoms
- UI accepts changes, but the engine count in Kubernetes does not change
- No new engine Pods after saving in UI
Likely cause
- In Kubernetes, engine definitions are controlled by the Helm values, not the Web UI
Resolution
- In values.yaml, set fmeflow.engines[].engines and apply with helm upgrade

Wrong FME version pulled

Symptoms
- ImagePull error: manifest for safesoftware/fmeflow:<tag> not found
Likely cause
- Incorrect image tag
Resolution
- Set fmeflow.image.tag to the intended release (for example, 2025.1) and helm upgrade

Helm repo/values not initialized

Symptoms
- Error: repository name (safesoftware) not found
- Error: chart "fmeflow" matching not found in safesoftware index
Likely cause
- Missing chart repo or not using a values file
Resolution
- Add the Safe repo and fetch defaults, then install or upgrade
  
  helm show values safesoftware/fmeflow >> values.yaml
  
  helm repo add safesoftware <https://safesoftware.github.io/helm-charts>
Proceed with the installation or upgrade.

502/504 through Ingress or UI timeouts

Symptoms
- 502 Bad Gateway from NGINX ingress
- 504 Gateway Timeout
- Upstream connect error or disconnect/reset before headers
Likely causes
- Backend Service has no ready endpoints
- Readiness/liveness probes failing
Resolution
- kubectl get endpoints <svc>; if empty, fix probes and ensure Service selector matches Pod labels
- Increase initialDelaySeconds and timeoutSeconds for slow starts

Pods stuck in Pending due to storage

Symptoms
- pod has unbound immediate PersistentVolumeClaims
- The pods are deployed but are not running
- Error: "PVC <name> is Pending
- Error: 0/3 nodes available: volume node affinity conflict
- Error: Warning FailedMount kubelet MountVolume.SetUp failed for volume <volume_name>
Likely causes
- storageClassName mismatch or missing provisioner
- Access mode mismatch (RWO vs RWX)
- Capacity or quota exhausted
Resolution
- If using AKS(Azure Kubernetes Service), use the correct storage class. The "Setup Shared Storage using Azure Files" section in this article will provide more details on the process
- The default PV should be bound to the default PVC

CrashLoopBackOff on startup

Symptoms
- CrashLoopBackOff, Back-off restarting failed container
- Exit Code: 1 when you describe the pod (kubectl describe pod <pod> -n <ns>>
Likely causes
- Misconfiguration, missing secrets
- Database or license server unreachable
- Init container failing
Resolution
- Verify env and secret values, and network access to DB and license server
- Restarting the entire pods here can also help, as sometimes, some services may not start in the correct sequence

ImagePullBackOff

Symptoms
- ImagePullBackOff, ErrImagePull
- Error: failed to authorize: authentication required
Likely causes
- Bad image reference or tag
- Missing imagePullSecret
- Blocked egress to registry
Resolution
- Verify repository, tag, and credentials

OOMKilled or slow performance

Symptoms
- Reason: OOMKilled in container status
- Memory group out of memory in node logs
- High latency or queue growth in UI
Likely cause
- Memory limits are too low, or workload spikes
Resolution
- Increase container memory requests and limits
- Check node pressure and add capacity if needed

Startup/readiness probe 503

Symptoms
- Readiness probe failed: HTTP 503
- Error: Liveness probe failed: connection refused
Likely causes
- Slow first start
- Wrong health endpoint or port
Resolution
- Increase initialDelaySeconds and timeoutSeconds
- Confirm probe path and port match the container’s health endpoint.

Database connection errors

Symptoms
- Connection refused to <db-host>:<port>
- SSL: certificate verify failed
- Authentication failed for the user
Likely causes
- Wrong host, port, credentials, or TLS
- Network blocks between the cluster and the database
Resolution
- Update values and secrets
- From a Pod, test connectivity: nc -vz <db-host> <port>
- Open firewall or NetworkPolicy where needed
- If using external Postgres, follow Deploying with an External Database to cross-check for misconfigurations.

References and Additional Resources

Performing the Kubernetes deployment, including scaling engines via values.yaml. Also see Defining FME Engines for Kubernetes.
Deploying with a trusted certificate or cert-manager.
Deploying with an External Database for higher availability.
Deploying with an NFS Client Provisioner and Deploying across Multiple Hosts for multi-host or RWX storage.

Commands and Snippets

# Check endpoints for a Service
kubectl get endpoints <svc>

# View current and previous container logs
kubectl logs <pod> -n <ns>
kubectl logs --previous <pod> -n <ns>

# Upgrade with a values file
helm upgrade <release> safesoftware/fmeflow -f values.yaml

# Get a quick cluster-wide view of pod health
kubectl get pods -A -o wide

# Describe a specific pod to inspect events, probes, and container state
kubectl describe pod <pod> -n <ns>

# Stream logs (and previous-crash logs) from a container
kubectl logs -f <pod> -n <ns> 
kubectl logs --previous <pod> -n <ns> 

# Verify Services have ready endpoints
kubectl get svc,endpoints -n <ns>

# Exec into a running container for on-box checks (curl, nc, env)
kubectl exec -it <pod> -n <ns> -c <container> -- /bin/sh

Search