Hello,
We are still seeing an issue in our FME Flow 2025.1.2 fault-tolerant setup. In a previous forum post, we were able to solve part of the problem:
FME Flow fault-tolerant setup: jobs stuck in queue after correcting engine/core configuration | Community. However, jobs still occasionally get stuck in the queue.
When this happens, we inspect the database and see that some values appear to be overwritten again. In the fme_core_node table, the router_queue_nodename and job_queue_nodename fields for one of the Core (not leading one) instances are changed, and that Core’s own FQDN is written there. In fme_publisher_node, the “wrong” node is marked as leader = true — meaning a different leader than the one shown in fme_core_node. In some cases, we also see a different leader marked as true in fme_queue_node, although not always.
Restarting the Core(s) usually resolves the issue temporarily. After a restart, the expected values are present in the database again, jobs are picked up normally, and the system functions correctly. However, after a day or sometimes a few days, the issue reappears and the database values have been changed again..
One detail that may be relevant is that we only always observe this behavior in the same two DTAP environments. Our other two environments, which have a comparable setup, do not exhibit this issue. This makes us wonder whether there is an environment-specific factor involved, such as connectivity, configuration, infrastructure, or something else affecting leader election and node coordination.
Our main question is: what process is responsible for updating these values? We suspect it may be related to connectivity or leader election, but we are not sure whether and how these fields are managed by FME Flow itself or whether something else could be overwriting them.
Has anyone seen similar behavior, or does anyone know under which circumstances these database values are updated?
