FME Flow jobs stuck in queue after unexpected leader changes in database tables

Question

Hello,We are still seeing an issue in our FME Flow 2025.1.2 fault-tolerant setup. In a previous forum post, we were able to solve part of the problem: FME Flow fault-tolerant setup: jobs stuck in queue after correcting engine/core configuration | Community. However, jobs still occasionally get stuck in the queue.When this happens, we inspect the database and see that some values appear to be overwritten again. In the fme_core_node table, the router_queue_nodename and job_queue_nodename fields for one of the Core (not leading one) instances are changed, and that Core’s own FQDN is written there. In fme_publisher_node, the “wrong” node is marked as leader = true — meaning a different leader than the one shown in fme_core_node. In some cases, we also see a different leader marked as true in fme_queue_node, although not always.Restarting the Core(s) usually resolves the issue temporarily. After a restart, the expected values are present in the database again, jobs are picked up normally, and the system functions correctly. However, after a day or sometimes a few days, the issue reappears and the database values have been changed again..One detail that may be relevant is that we only always observe this behavior in the same two DTAP environments. Our other two environments, which have a comparable setup, do not exhibit this issue. This makes us wonder whether there is an environment-specific factor involved, such as connectivity, configuration, infrastructure, or something else affecting leader election and node coordination.Our main question is: what process is responsible for updating these values? We suspect it may be related to connectivity or leader election, but we are not sure whether and how these fields are managed by FME Flow itself or whether something else could be overwriting them.Has anyone seen similar behavior, or does anyone know under which circumstances these database values are updated?

lborgerscgi · Answer

A small update after running with the modified configuration for a few days.

This morning, we noticed that several jobs remained in the queue even though there were idle engines available on one of the Engine Hosts.

Our current setup has one Engine Host registered to each Core. The jobs in question appeared to be orchestrated by one specific Core, while the engines associated with that Core's Engine Host were already fully occupied. At the same time, the engines on the other Engine Host remained idle and did not appear to pick up any of the queued jobs.

This makes us wonder whether the behavior is related to the way Engine Hosts are registered to Cores in a fault-tolerant setup. While registering one Engine Host to each Core seems to have prevented the original issue where jobs became permanently stuck in the queue, it now appears that workload may not be distributed across all available engines as we expected.

Has anyone seen similar behavior, or can anyone clarify how job distribution across Engine Hosts is expected to work in this scenario?

FME Flow jobs stuck in queue after unexpected leader changes in database tables

2 replies

Community Stats

Community Stats

Sign up

An FME Account is required to contribute

Login to the community

An FME Account is required to contribute