Question

Issue with FME Server Engine processes shutting down on FME Server Docker Setup

  • 27 September 2018
  • 7 replies
  • 41 views

I am have been trying to stand up a FME Server Docker environment to migrate our current FME Servers setup too. I am using the 2018.1.0-20180801 image builds that include a Nginx container to handle SSL. I have also converted the provider docker compose yaml file to a Hashicorp Terraform configuration (as we use this technology for provisioning/orchestration). We have also made a few modifications to the FME Engine container so that a persistent Kerberos ticket is created on the container for authentication to our MS SQL Server databases. This issue I am seeing with both our modify FME Engine image and the default FME Engine image is that any active engine running on one of the FME Engine containers will eventually shut down. The FME Engine container will still be registered as an engine host with the FME Core but no active engines will be running on the FME Engine container (even though 1+ are requested). I can restart the engine service on the container, and it will appear as an active engine for a while, before shutting down again. I have been looking into the logs and I see is a message that says -

Could not read from socket; connection may have been lost Program Terminating

Translation FAILED.Although, there has been times were this is not even present in the log when the active engine process shuts down. I have compared the ports/process running on the problem setup versus one that was created from Safe's provided docker compose yaml configuration, and I do not see any major differences.I was wondering if anyone could provide me with any suggests as were I should go looking to resolve or diagnose the issue? I am pretty sure it is something to do with the setup I have created, as I do not see the same issue if I stand up an environment created from Safe's provided docker compose yaml configuration.


7 replies

Badge

Hi @rsmith,

 

when you say it will appear active for a while, does that mean that you can run jobs successfully on the engine during that time? Or does it shut down before you even can submit a job? Regarding the number of engines, it is recommended to scale with additional engine containers in a docker deployment instead of adding engines via the FME Server Web UI.

Also, in general, the engines are expected to restart after a specific number of jobs failed/succeeding or after being idle for a certain amount of time. The settings can be found in the "Limits" section in this FME Server configuration file. It might be helpful to check whether the engines shut down because of hitting one of those limits (you can change the default values for testing). This could help to narrow down the reasons why they might not restart.

I am not too familiar with Terraform, but I understand that it is used to potentially deploy the containers across multiple hosts/VMs. If this is the case you might also want to make sure communication is working properly through these ports used by the core.

Since your Terraform deployment is nothing that we test internally it is tricky to resolve this easily, but I hope my suggestions give you some ideas on how to troubleshoot this. Please let us know how this goes.

Hi @rsmith,

 

when you say it will appear active for a while, does that mean that you can run jobs successfully on the engine during that time? Or does it shut down before you even can submit a job? Regarding the number of engines, it is recommended to scale with additional engine containers in a docker deployment instead of adding engines via the FME Server Web UI.

Also, in general, the engines are expected to restart after a specific number of jobs failed/succeeding or after being idle for a certain amount of time. The settings can be found in the "Limits" section in this FME Server configuration file. It might be helpful to check whether the engines shut down because of hitting one of those limits (you can change the default values for testing). This could help to narrow down the reasons why they might not restart.

I am not too familiar with Terraform, but I understand that it is used to potentially deploy the containers across multiple hosts/VMs. If this is the case you might also want to make sure communication is working properly through these ports used by the core.

Since your Terraform deployment is nothing that we test internally it is tricky to resolve this easily, but I hope my suggestions give you some ideas on how to troubleshoot this. Please let us know how this goes.

Alright, after a lot of testing I have the issue narrowed down to a particular case. It appears that running the FME Server environment as a Swarm cluster is causing the issue I noted above. I can get the environment going, have the engine licensed (show it shows a single active engine for the engine host), run a sample workspace just to be sure it is working, and then walk about for an hour or so. When I come back and re-run the sample workspace it fails to run, then I look at the active engines and there are none (and the running workspace is stuck in the queue). I have attached the docker compose file I am using to spin of the environment (it is the one Safe provided with a few modifications for switching from bridge networks to overlay networks). Hopefully this will provide some better information/reproducible problem.

 

 

docker-compose-safeyaml.zip

 

 

I am using the following command to deploy this compose file to a Swarm host. 

 

 

$ docker stack deploy --compose-file docker-compose-safe.yaml test
Badge
Alright, after a lot of testing I have the issue narrowed down to a particular case. It appears that running the FME Server environment as a Swarm cluster is causing the issue I noted above. I can get the environment going, have the engine licensed (show it shows a single active engine for the engine host), run a sample workspace just to be sure it is working, and then walk about for an hour or so. When I come back and re-run the sample workspace it fails to run, then I look at the active engines and there are none (and the running workspace is stuck in the queue). I have attached the docker compose file I am using to spin of the environment (it is the one Safe provided with a few modifications for switching from bridge networks to overlay networks). Hopefully this will provide some better information/reproducible problem.

 

 

docker-compose-safeyaml.zip

 

 

I am using the following command to deploy this compose file to a Swarm host. 

 

 

$ docker stack deploy --compose-file docker-compose-safe.yaml test
@rsmith thanks for sharing this. I'll give it a go and will run it by our developers once reproduced, to if we have options to improve this.

 

Badge
Alright, after a lot of testing I have the issue narrowed down to a particular case. It appears that running the FME Server environment as a Swarm cluster is causing the issue I noted above. I can get the environment going, have the engine licensed (show it shows a single active engine for the engine host), run a sample workspace just to be sure it is working, and then walk about for an hour or so. When I come back and re-run the sample workspace it fails to run, then I look at the active engines and there are none (and the running workspace is stuck in the queue). I have attached the docker compose file I am using to spin of the environment (it is the one Safe provided with a few modifications for switching from bridge networks to overlay networks). Hopefully this will provide some better information/reproducible problem.

 

 

docker-compose-safeyaml.zip

 

 

I am using the following command to deploy this compose file to a Swarm host. 

 

 

$ docker stack deploy --compose-file docker-compose-safe.yaml test
I am able to reproduce the issue with the engine disappearing deploying your modified compose file to a single swarm host. But I could not see the error message that you previously reported. I which log file did you see the message?

 

 

I am able to reproduce the issue with the engine disappearing deploying your modified compose file to a single swarm host. But I could not see the error message that you previously reported. I which log file did you see the message?

 

 

@GerhardAtSafe that entry came from the current engine log. Although I am pretty sure it does not have much to do with the actual issue occurring as I have seen no error in the log more times than not now when the engine quits.

 

Badge
@GerhardAtSafe that entry came from the current engine log. Although I am pretty sure it does not have much to do with the actual issue occurring as I have seen no error in the log more times than not now when the engine quits.

 

@rsmith thanks for confirming this. I'll let you know once we got to the bottom of this.

 

 

Also, in case you plan to deploy across multiple hosts, check out this Q&A; if you haven't yet.

 

 

We also provide a quick start script to deploy the same containers in with k8s, which is still in tech preview but might be preferred over a docker swarm deployment in some cases: FME Server Container deployments

 

Badge

Hi @rsmith,

 

It seems like this is a known issue in docker swarm that one of our developers already reported: https://github.com/moby/moby/issues/33685 The issue is that long-running socket connections in docker swarm get closed after a period of inactivity, which applies to our engine connection. Once the engine is idle for a period of around 15 minutes the socket connection is terminated and the engine is no longer connected until it is restarted.A workaround is to periodically send messages over the socket to keep the connection alive. We implemented tcp_keepalive on our socket connections, but the frequency is controlled by a kernel setting and defaults to 7200 seconds, which is too long for the docker timeout. This setting needs to be changed on the actual Server that the containers are running on. Here is a support article from docker with more details: https://success.docker.com/article/ipvs-connection-timeout-issue It also looks like docker is trying to fix this by allowing sysctl options to be set on services themselves which would remove the need from setting this on the actual server: https://github.com/moby/moby/pull/37701

 

I hope this helps you to resolve this issue.

Reply