We are running 4 FME Flow 2025.1.2 environments (DTAP) and 1 FME Flow 2025.1.3 lab environment, all in a fault-tolerant setup with 2 cores and 2 engine hosts.
In our DTAP environments, we initially had an incorrect configuration:
processmonitorcoreconfigcontainedlocalhostinstead of FQDNs for both cores- Each engine host was registered to a different core (
processmonitorengineconfig)
This resulted in jobs mostly being picked up by a single engine host, depending on which core was leading.
In our lab environment, the setup is correct:
- FQDNs are used
- Both engine hosts are registered to a single core
- Fault tolerance works as expected (both engines process jobs, and failover between cores works correctly)
We applied the same configuration changes to DTAP to match the lab setup. However, after registering both engine hosts to a single core, we observe the following issue:
- Only a subset of jobs are processed correctly
- Another subset remains stuck in the queue indefinitely (status = queued)
- This happens consistently for part of the workload, not all jobs
- Stuck jobs cannot be recovered via the UI and can only be removed by deleting them directly from the database
We suspect that these “stuck” jobs are being orchestrated by the non-leading core, which does not have an engine host registered.
One notable difference between lab and DTAP is in the fme_core_node database table:
- In all environments, both cores are present
- In DTAP,
router_queuenodenameandjob_queuenodenamecontainlocalhost nodefor both cores - In lab, these fields contain FQDNs
We attempted to manually update these fields in DTAP to use FQDNs, but they are automatically overwritten back to localhost node.
Questions:
- Could the
localhost nodevalues infme_core_nodecause jobs to be routed/orchestrated incorrectly and remain stuck in the queue? - Why would these values differ between environments while configuration files appear identical?
- Is there a specific step or process required to ensure these fields are correctly set (e.g. during installation, migration, or clustering setup)?
- Could this indicate that one of the cores is still partially configured with localhost internally?
Any guidance on where this routing behavior is controlled or how to correct it would be helpful.


