Skip to main content
Question

FME Flow fault-tolerant setup: jobs stuck in queue after correcting engine/core configuration

  • April 28, 2026
  • 1 reply
  • 21 views

lborgerscgi
Observer
Forum|alt.badge.img

We are running 4 FME Flow 2025.1.2 environments (DTAP) and 1 FME Flow 2025.1.3 lab environment, all in a fault-tolerant setup with 2 cores and 2 engine hosts.

In our DTAP environments, we initially had an incorrect configuration:

  • processmonitorcoreconfig contained localhost instead of FQDNs for both cores
  • Each engine host was registered to a different core (processmonitorengineconfig)

This resulted in jobs mostly being picked up by a single engine host, depending on which core was leading.

In our lab environment, the setup is correct:

  • FQDNs are used
  • Both engine hosts are registered to a single core
  • Fault tolerance works as expected (both engines process jobs, and failover between cores works correctly)

We applied the same configuration changes to DTAP to match the lab setup. However, after registering both engine hosts to a single core, we observe the following issue:

  • Only a subset of jobs are processed correctly
  • Another subset remains stuck in the queue indefinitely (status = queued)
  • This happens consistently for part of the workload, not all jobs
  • Stuck jobs cannot be recovered via the UI and can only be removed by deleting them directly from the database

We suspect that these “stuck” jobs are being orchestrated by the non-leading core, which does not have an engine host registered.

One notable difference between lab and DTAP is in the fme_core_node database table:

  • In all environments, both cores are present
  • In DTAP, router_queuenodename and job_queuenodename contain localhost node for both cores
  • In lab, these fields contain FQDNs

We attempted to manually update these fields in DTAP to use FQDNs, but they are automatically overwritten back to localhost node.

Questions:

  • Could the localhost node values in fme_core_node cause jobs to be routed/orchestrated incorrectly and remain stuck in the queue?
  • Why would these values differ between environments while configuration files appear identical?
  • Is there a specific step or process required to ensure these fields are correctly set (e.g. during installation, migration, or clustering setup)?
  • Could this indicate that one of the cores is still partially configured with localhost internally?

Any guidance on where this routing behavior is controlled or how to correct it would be helpful.

1 reply

j.botterill
Influencer
Forum|alt.badge.img+58
  • Influencer
  • April 29, 2026

does the C:\Program Files\FMEFlow\Server\

  1. fmeFlowConfig.txt have the correct config/value
  2. processMonitorConfigCore.txt check the NODE_OVERWRITE value

Safe’s Process Monitor docs say:

  • NODE_NAME: if not assigned, the node takes the host name of the system. [docs.safe.com]
  • NODE_HOST: “the host name of the system on which it is running.”

So two environments can have “identical looking” config yet resolve differently at runtime depending on:

  • OS hostname values
  • DNS suffix / domain join state
  • how the service account resolves the hostname

Restart fme services after any change