Question

FME Server Fault Tolerance with Distributed Engine Hosts

  • 19 June 2020
  • 1 reply
  • 15 views

Badge

We're in the middle of setting up fault tolerance in FME Server with distributed engine hosts, per the optional configuration in the following documentation:

 

https://knowledge.safe.com/articles/74845/introducing-the-new-20181-fault-tolerant-architect.html

 

I found the following un-answered question posted by @swedper, and I have similar concerns about how distributed engine hosts work:

 

https://knowledge.safe.com/questions/107724/engines-disappear-when-active-core-is-down-in-a-fa.html

We are planning to have two completely independent sets of core and engine hosts, each in its own data center - one primary and one secondary. We have a load balancer that will send all traffic to the primary core host, unless the primary goes down, in which case the load balancer will send traffic to the secondary core host. Both core hosts will be configured to point to the load balancer URL, so given that I have some questions about how FME Server works with this setup:

  1. Where is the queue itself stored? Core, database, file system? I would hope either the file system or db...
  2. What happens to running jobs in the primary if the primary goes down?
  3. If the primary core host can't see the secondary engines and vice versa, how do we keep engine queues aligned between environments?
    1. Corollary question - do we tell the engine hosts to point to the load balancer url? If we did that, our current plan breaks down a bit...

Any advice here would be appreciated.


1 reply

Badge +9

Hi @richsnyder0,

I tried my best to answer your questions below. We may need to have a support call about this to fully discuss different options so please open up a support case if you have any questions and we can walk through everything. Please note this advice came from the ever so wise @steveatsafe so any further questions should be directed at him :).

 

We are planning to have two completely independent sets of core and engine hosts, each in its own data center - one primary and one secondary. It’s not recommended for each to have it’s own data center if you are planning for fault tolerance. If you are duplicating 2 cores/2 engines and they will not be interacting with each other then it should be fine.

We have a load balancer that will send all traffic to the primary core host, unless the primary goes down. This wasn’t how our fault tolerant architecture was designed. In our design, both hosts are active and should round-robin but it should be fine if you want to do this.

in which case the load balancer will send traffic to the secondary core host. We have our load balancers configured to health check both and send requests to both. The load balancers will mark a core unhealthy and stop using it until it is healthy again.

 

Both core hosts will be configured to point to the load balancer URL so given that I have some questions about how FME Server works with this setup:

  1. Where is the queue itself stored? Core, database, file system? I would hope either the file system or db... We have an in-memory queue that is stored in the database at intervals for recovery if something goes offline, a system crash etc. The queue is replicated between the cores, and one is the leader. The database is a backup of the in-memory queue.

  2. What happens to running jobs in the primary if the primary goes down? This is not so easy to answer. If you have a distributed engine then, it will continue to run and will report to the core when done, finding the core missing it will try the next core and report job done. If the engine is on the core system, it most likely will have a job failure if you shut down the core because you can’t shut down the core without shutting down the engine service, the job will be reported as failed and resubmitted to another engine.

  3. If the primary core host can’t see the secondary engines and vice versa, how do we keep engine queues aligned between environments? There is a misconception here. There is no such thing as a primary core host and the installation should not be implemented to act like this. The cores should be equal (and in the same data centre). As mentioned earlier, the in-memory queue is replicated across all cores.

  4. Corollary question - do we tell the engine hosts to point to the load balancer url? If we did that, our current plan breaks down a bit... No. The core installations should be treated as a single installation, in distributed mode with external database and system share. FME Server should be able to handle this. Install the second core (in the same data centre is recommended) and treat it as a single installation, but use the same external database and system share. The Cores will be aware of one another and requests via the load balancer can be handled by both. [EDIT: Engine hosts... missed this in my first answer above. If you have 2 core hosts and 2 engine hosts, point one engine host to one core host, and the other engine host to the other core host. This is not necessary but in a failure of one core, you'll have half the engines not impacted at all. Recommended, but again, not 100% necessary. Do not go through the load balancer.]

Any advice here would be appreciated.

  • I would recommend you register half the engines with one core and the other half with the other core.
  • I would recommend keeping everything in the same data centre. Design for a disaster recovery in the second data centre by replicating the database and system share there. Performing backups and restore to the DR when necessary. If you need a 99.9999 system then perhaps there’s another way to swing this and I would contact Safe Software.
  • Distributed engines not on a core system will behave better and switch to the new core more efficiently. It is hard to separate the engines on the core systems, and typically they will go down when the core goes down because it is more often a system problem but this is a personal choice.
  • If you are actually doing the optional configuration listed here, I strongly recommend you not separate the FME Server Application Server from the FME Server Core. Keep it simple. There is no value here in doing this. If possible do not separate these components to separate systems. The diagram was written on the heels of retiring the active-passive model of FME Server 2017 and older. The Application Server and Core have a lot of moving parts and growing all the time.

Reply