Skip to main content
Question

FME Flow jobs stuck in queue after unexpected leader changes in database tables

  • June 1, 2026
  • 2 replies
  • 64 views

lborgerscgi
Contributor
Forum|alt.badge.img+2

Hello,

We are still seeing an issue in our FME Flow 2025.1.2 fault-tolerant setup. In a previous forum post, we were able to solve part of the problem:

 FME Flow fault-tolerant setup: jobs stuck in queue after correcting engine/core configuration | Community. However, jobs still occasionally get stuck in the queue.

When this happens, we inspect the database and see that some values appear to be overwritten again. In the fme_core_node table, the router_queue_nodename and job_queue_nodename fields for one of the Core (not leading one) instances are changed, and that Core’s own FQDN is written there. In fme_publisher_node, the “wrong” node is marked as leader = true — meaning a different leader than the one shown in fme_core_node. In some cases, we also see a different leader marked as true in fme_queue_node, although not always.

Restarting the Core(s) usually resolves the issue temporarily. After a restart, the expected values are present in the database again, jobs are picked up normally, and the system functions correctly. However, after a day or sometimes a few days, the issue reappears and the database values have been changed again..

One detail that may be relevant is that we only always observe this behavior in the same two DTAP environments. Our other two environments, which have a comparable setup, do not exhibit this issue. This makes us wonder whether there is an environment-specific factor involved, such as connectivity, configuration, infrastructure, or something else affecting leader election and node coordination.

Our main question is: what process is responsible for updating these values? We suspect it may be related to connectivity or leader election, but we are not sure whether and how these fields are managed by FME Flow itself or whether something else could be overwriting them.

Has anyone seen similar behavior, or does anyone know under which circumstances these database values are updated?

2 replies

lborgerscgi
Contributor
Forum|alt.badge.img+2
  • Author
  • Contributor
  • June 3, 2026

We are still interested in understanding the root cause, but we have identified a recovery procedure that consistently resolves the issue when jobs become stuck in the queue.

When this happens, we stop the Core service on the node that appears to be in the wrong state and wait until it disappears from the database. We then restart that Core. After that, we restart the Core that is currently acting as leader. Once both Core services have been restarted, the expected values are restored in the database and the queued jobs are picked up again.

We have also observed something that may help narrow down the cause. In our original configuration, both Engine Hosts were registered to the same Core. After changing the setup so that each Engine Host is registered to a different Core, we have not yet experienced jobs becoming stuck in the queue.

We are not sure whether this is related to the root cause or merely a coincidence, but we thought it was worth mentioning in case it helps with troubleshooting.


lborgerscgi
Contributor
Forum|alt.badge.img+2
  • Author
  • Contributor
  • June 8, 2026

A small update after running with the modified configuration for a few days.

This morning, we noticed that several jobs remained in the queue even though there were idle engines available on one of the Engine Hosts.

Our current setup has one Engine Host registered to each Core. The jobs in question appeared to be orchestrated by one specific Core, while the engines associated with that Core's Engine Host were already fully occupied. At the same time, the engines on the other Engine Host remained idle and did not appear to pick up any of the queued jobs.

This makes us wonder whether the behavior is related to the way Engine Hosts are registered to Cores in a fault-tolerant setup. While registering one Engine Host to each Core seems to have prevented the original issue where jobs became permanently stuck in the queue, it now appears that workload may not be distributed across all available engines as we expected.

Has anyone seen similar behavior, or can anyone clarify how job distribution across Engine Hosts is expected to work in this scenario?