Question

Is it possible to know if a job (translation) has been launched by the Job Recovery process and what the cause was?


I have a job that processes a lot of data (sqlite file +/- 50MB) and during the execution the job get automatically relaunched. The first job finishes without failure after the relaunch.

When I execute the same process with a smaller dataset this doesn't happen. It is as if that the recovery call to the rest api doesn't know that the job is still running


4 replies

Badge +2

Hi @bbnj356​ ,

 

FME Server will resubmit jobs under Job Recovery mode if the engine crashes during the translation. In this mode the job is resubmitted under the same Job ID, however, a different log is generated for each run. On the Jobs Completed page only the last job log is displayed, but you can find a history of the logs from the previous run(s) if you go to Files & Connections > Resources > Logs > engine > current > jobs. Rather than a single file, you should see a folder for your job ID, and inside jobs logs with a suffix _n with 0 being the first job log.

 

To troubleshoot why the engine is crashing during the job run I'd recommend reviewing this job log, in addition to the fmeprocessmonitorengine.log at the timestamp where the job log cut off. This log file is located one folder up in Files & Connections > Resources > Logs > engine > current .

 

Lastly, if you'd prefer the job does not get resubmitted you can turn job recovery off by making a change to the FME Server configuration: https://docs.safe.com/fme/html/FME_Server_Documentation/AdminGuide/Job_Recovery.htm

Hi @hollyatsafe​ ,

 

Thank you for the quick reply. I've checked the logs as suggested. I see no crashes and both jobs have a different Job ID. So based on your explanation, the Job Recovery mechanism isn't responsible for the launch of the second job. I've tried this several times over on 2 different servers and I have the same result: during the processing of job1 after a couple of minutes a similar job with the same parameters is launched. Both jobs finish without errors of warnings. More strangely, this behaviour only happens when processing a large file using a Workspace app. When I launch the process for the same file running the workspace directly on the server, the second job isn't started. Neither when I use a smaller file.

I reran the tests on 2 different servers and current conclusion:

When running the process with the 50MB file from the ServerApp, 2 jobs get launched with different Job ID's with same parameters. The second job is launched a couple of minutes after the first.

When using smaller files (6MB and 21MB) also from the ServerApp only 1 job is started.

When starting the process directly on the server (without using the ServerApp) for the 50MB file, only 1 job is launched.

 

I don't find any indication of error in the different log files.

Badge +2

Hi @hollyatsafe​ ,

 

Thank you for the quick reply. I've checked the logs as suggested. I see no crashes and both jobs have a different Job ID. So based on your explanation, the Job Recovery mechanism isn't responsible for the launch of the second job. I've tried this several times over on 2 different servers and I have the same result: during the processing of job1 after a couple of minutes a similar job with the same parameters is launched. Both jobs finish without errors of warnings. More strangely, this behaviour only happens when processing a large file using a Workspace app. When I launch the process for the same file running the workspace directly on the server, the second job isn't started. Neither when I use a smaller file.

I reran the tests on 2 different servers and current conclusion:

When running the process with the 50MB file from the ServerApp, 2 jobs get launched with different Job ID's with same parameters. The second job is launched a couple of minutes after the first.

When using smaller files (6MB and 21MB) also from the ServerApp only 1 job is started.

When starting the process directly on the server (without using the ServerApp) for the 50MB file, only 1 job is launched.

 

I don't find any indication of error in the different log files.

Hi @bbnj356​ ,

This is a very curious problem, do you have a load balancer in front of FME Server?

 

I am wondering if there is some kind of timeout happening, where the Load Balancer loses connection with FME Server after a certain duration (given the job works processing smaller files and if run on FME Server directly) and so the load balancer resubmits the job request.

Hi @hollyatsafe​ ,

 

I'm pretty sure we don't have a Loadbalancer in place but I will check tomorrow. The weird thing is that this only occurs when using a workspace App to submit the job. It is as if the workspace App run mechanism has 'forgotten' it has already launched the job. Because when running the workspace directly there is only 1 job. I will check tomorrow if the same behaviour also occurs when I first put the file in a shared_resources folder and select that one in stead of having it transferred to the server using drag and drop. This transfer takes quite some time (+30sec).

Reply