FME Flow fault-tolerant setup: jobs stuck in queue after correcting engine/core configuration

Question

We are running 4 FME Flow 2025.1.2 environments (DTAP) and 1 FME Flow 2025.1.3 lab environment, all in a fault-tolerant setup with 2 cores and 2 engine hosts.

In our DTAP environments, we initially had an incorrect configuration:

processmonitorcoreconfig contained localhost instead of FQDNs for both cores
Each engine host was registered to a different core (processmonitorengineconfig)

This resulted in jobs mostly being picked up by a single engine host, depending on which core was leading.

In our lab environment, the setup is correct:

FQDNs are used
Both engine hosts are registered to a single core
Fault tolerance works as expected (both engines process jobs, and failover between cores works correctly)

We applied the same configuration changes to DTAP to match the lab setup. However, after registering both engine hosts to a single core, we observe the following issue:

Only a subset of jobs are processed correctly
Another subset remains stuck in the queue indefinitely (status = queued)
This happens consistently for part of the workload, not all jobs
Stuck jobs cannot be recovered via the UI and can only be removed by deleting them directly from the database

We suspect that these “stuck” jobs are being orchestrated by the non-leading core, which does not have an engine host registered.

One notable difference between lab and DTAP is in the fme_core_node database table:

In all environments, both cores are present
In DTAP, router_queuenodename and job_queuenodename contain localhost node for both cores
In lab, these fields contain FQDNs

We attempted to manually update these fields in DTAP to use FQDNs, but they are automatically overwritten back to localhost node.

Questions:

Could the localhost node values in fme_core_node cause jobs to be routed/orchestrated incorrectly and remain stuck in the queue?
Why would these values differ between environments while configuration files appear identical?
Is there a specific step or process required to ensure these fields are correctly set (e.g. during installation, migration, or clustering setup)?
Could this indicate that one of the cores is still partially configured with localhost internally?

Any guidance on where this routing behavior is controlled or how to correct it would be helpful.

lborgerscgi · Accepted Answer

i found a solution. in fme_queue_nodes there was still a localhost entry. deleting this and restarting all services made it so that the correct FQDN's are shown in the queuenodename columns.

j.botterill · Answer

does the C:\Program Files\FMEFlow\Server\

fmeFlowConfig.txt have the correct config/value
processMonitorConfigCore.txt check the NODE_OVERWRITE value

Safe’s Process Monitor docs say:

NODE_NAME: if not assigned, the node takes the host name of the system. [docs.safe.com]
NODE_HOST: “the host name of the system on which it is running.”

So two environments can have “identical looking” config yet resolve differently at runtime depending on:

OS hostname values
DNS suffix / domain join state
how the service account resolves the hostname

Restart fme services after any change

FME Flow fault-tolerant setup: jobs stuck in queue after correcting engine/core configuration

3 replies

Community Stats

Community Stats

Sign up

An FME Account is required to contribute

Login to the community

An FME Account is required to contribute

Scanning file for viruses.

This file cannot be downloaded