Skip to main content
Solved

FME Flow fault-tolerant setup: jobs stuck in queue after correcting engine/core configuration

  • April 28, 2026
  • 3 replies
  • 89 views

lborgerscgi
Participant
Forum|alt.badge.img+2

We are running 4 FME Flow 2025.1.2 environments (DTAP) and 1 FME Flow 2025.1.3 lab environment, all in a fault-tolerant setup with 2 cores and 2 engine hosts.

In our DTAP environments, we initially had an incorrect configuration:

  • processmonitorcoreconfig contained localhost instead of FQDNs for both cores
  • Each engine host was registered to a different core (processmonitorengineconfig)

This resulted in jobs mostly being picked up by a single engine host, depending on which core was leading.

In our lab environment, the setup is correct:

  • FQDNs are used
  • Both engine hosts are registered to a single core
  • Fault tolerance works as expected (both engines process jobs, and failover between cores works correctly)

We applied the same configuration changes to DTAP to match the lab setup. However, after registering both engine hosts to a single core, we observe the following issue:

  • Only a subset of jobs are processed correctly
  • Another subset remains stuck in the queue indefinitely (status = queued)
  • This happens consistently for part of the workload, not all jobs
  • Stuck jobs cannot be recovered via the UI and can only be removed by deleting them directly from the database

We suspect that these “stuck” jobs are being orchestrated by the non-leading core, which does not have an engine host registered.

One notable difference between lab and DTAP is in the fme_core_node database table:

  • In all environments, both cores are present
  • In DTAP, router_queuenodename and job_queuenodename contain localhost node for both cores
  • In lab, these fields contain FQDNs

We attempted to manually update these fields in DTAP to use FQDNs, but they are automatically overwritten back to localhost node.

Questions:

  • Could the localhost node values in fme_core_node cause jobs to be routed/orchestrated incorrectly and remain stuck in the queue?
  • Why would these values differ between environments while configuration files appear identical?
  • Is there a specific step or process required to ensure these fields are correctly set (e.g. during installation, migration, or clustering setup)?
  • Could this indicate that one of the cores is still partially configured with localhost internally?

Any guidance on where this routing behavior is controlled or how to correct it would be helpful.

Best answer by lborgerscgi

i found a solution. in fme_queue_nodes there was still a localhost entry. deleting this and restarting all services made it so that the correct FQDN's are shown in the queuenodename columns.

3 replies

j.botterill
Influencer
Forum|alt.badge.img+58
  • Influencer
  • April 29, 2026

does the C:\Program Files\FMEFlow\Server\

  1. fmeFlowConfig.txt have the correct config/value
  2. processMonitorConfigCore.txt check the NODE_OVERWRITE value

Safe’s Process Monitor docs say:

  • NODE_NAME: if not assigned, the node takes the host name of the system. [docs.safe.com]
  • NODE_HOST: “the host name of the system on which it is running.”

So two environments can have “identical looking” config yet resolve differently at runtime depending on:

  • OS hostname values
  • DNS suffix / domain join state
  • how the service account resolves the hostname

Restart fme services after any change


lborgerscgi
Participant
Forum|alt.badge.img+2
  • Author
  • Participant
  • April 29, 2026

Following the suggestions, we performed a series of checks and controlled tests to isolate the issue:

  • Configuration
    • Verified fmeFlowConfig.txt and processMonitorConfigCore.txt (and all other config files) across environments
    • NODE_NAME and NODE_HOST are explicitly set to FQDNs
    • No significant differences between LAB and DTAP
  • NODE_OVERWRITE
    • Tested with both true and false
    • No impact on behavior in either environment
  • Hostname / DNS resolution
    • Checked via hostname, nslookup, and PowerShell → all return correct FQDNs
    • No differences between LAB and DTAP
    • Hosts files are identical and do not map core hostnames to 127.0.0.1
  • Service account context
    • Verified hostname resolution under the service account context → also resolves to FQDN
    • No differences with LAB
  • Java / network
    • -Djava.net.preferIPv4Stack=true is already set in both environments
  • Runtime behavior (logs)
    • Logs confirm:
      • NODE_NAME is detected as FQDN
      • Utility engine registers using FQDN
  • Controlled startup tests

    • Stopped all services across all nodes
    • Waited until fme_core_node was empty
    • Started a single core in isolation

    Observations:

    • Node registers correctly using FQDN
    • router_queuenodename / job_queuenodename are initially empty
    • Shortly after, they are automatically set to localhost node
    • Manual updates to these fields are immediately overwritten
  • Log comparison (LAB vs DTAP)
    • Startup logs are largely identical
    • Could not identify a clear difference explaining the behavior

Current understanding

  • Node identity (NODE_NAME / NODE_HOST) is correctly resolved and used
  • The issue appears isolated to queue/router node naming, where localhost node is assigned after initial registration
  • This behavior only occurs in DTAP; LAB correctly uses FQDN values for these fields

At this point it seems that:

  • The queue/router component determines its node name separately from NODE_NAME
  • In DTAP this step consistently resolves to localhost node, despite correct configuration and hostname resolution

Any insight into how router_queuenodename / job_queuenodename are derived, or what could cause them to fall back to localhost node, would be appreciated.


lborgerscgi
Participant
Forum|alt.badge.img+2
  • Author
  • Participant
  • Best Answer
  • May 1, 2026

i found a solution. in fme_queue_nodes there was still a localhost entry. deleting this and restarting all services made it so that the correct FQDN's are shown in the queuenodename columns.