Question

FME Server: aborted job keeps running on engine


Badge

Whenever we cancel a running job (in our case: using the REST API), the job is cancelled successfully on the FME Server core (the job reports itself as aborted when we fetch the status), but the engine on which the job is running (on a different physical machine than the Server core) just keeps on going: the job simply runs until the end of the translation. This is not what I expected, but since I don't see any errors in the server logs e.g. about communication issues with the engine, I started to wonder.

Is this intended behaviour? If so, are there ways to actually stop the job process on the engine as well? If not, any suggestions on how to fix this?


17 replies

Userlevel 4
Badge +13

Hi Sander,

Cancelling the job should restart the engine and the job should not finish so what you are seeing is not expected. I quickly tested in a non-distributed environment and it worked but we will need to test with the engine on separate machine. Just to confirm your REST API call, are you doing something like this:

DELETE http://localhost/fmerest/v2/transformations/jobs/running/1221

Userlevel 4
Badge +13

Hi Again

I tested again with a distributed system where the FME Engine was on a separate machine and I still see the job getting cancelled properly and the engine being restarted. You might want to submit this to support if you don't mind sending your version information and log file directory.

Thanks

Badge

Hi Again

I tested again with a distributed system where the FME Engine was on a separate machine and I still see the job getting cancelled properly and the engine being restarted. You might want to submit this to support if you don't mind sending your version information and log file directory.

Thanks

Hi Ken,

Thanks for the answers and good to know that it's not supposed to happen (which does make sense :)).

 

I've did some more tests. The problem occurs only when cancelling through REST:

  • If I submit the job and cancel (delete) it using REST, the job starts and gets marked as aborted but isn't cancelled on the engine.
  • If I submit the job through REST and cancel it in the Web UI, the engine also cancels and restarts.

Maybe it's important to mention here that the job has been tagged (so it runs on a specific engine). If you have time, could you perhaps also try testing that?

Either way, I suppose I will submit a support case.

Thanks!

Userlevel 4
Badge +13

Hi Sander

Tried it by POST via REST to transformations/commands/submit/ with a tag to force a specific engine, and the deleting via REST. Still seems to work fine, engine goes down and job is not completed. I know the because a queued job is started. We'll look for this in support.

Thanks

Badge +11

This issue of Cancelled Jobs continuing to run in FME Server 2016.1 (and older) is observed when the workspace running on FME Server contains FMEServerJobSubmitter transformers that were spawning child subprocesses.

In FME Server 2017.0, the underlying architecture of workflow management with FME Server Engines has been changed. Testing with Build 17167 this issue was no longer experienced.

This Knowledge Center article has been updated to reflect this information.

@sander_s - I would recommend testing this in FME Server 2017.0 Build 17167 (or newer). If the problem persists, please contact (if you haven't already).

 

Badge

This issue of Cancelled Jobs continuing to run in FME Server 2016.1 (and older) is observed when the workspace running on FME Server contains FMEServerJobSubmitter transformers that were spawning child subprocesses.

In FME Server 2017.0, the underlying architecture of workflow management with FME Server Engines has been changed. Testing with Build 17167 this issue was no longer experienced.

This Knowledge Center article has been updated to reflect this information.

@sander_s - I would recommend testing this in FME Server 2017.0 Build 17167 (or newer). If the problem persists, please contact (if you haven't already).

 

I'm not using the FMEServerJobSubmitter transformer, but thanks for the heads up!

 

As for the REST API: we haven't had any problems no more with aborted jobs that keep on running. Could have been a configuration issue maybe, because it didn't occur anymore after an upgrade.

 

Badge

This issue of Cancelled Jobs continuing to run in FME Server 2016.1 (and older) is observed when the workspace running on FME Server contains FMEServerJobSubmitter transformers that were spawning child subprocesses.

In FME Server 2017.0, the underlying architecture of workflow management with FME Server Engines has been changed. Testing with Build 17167 this issue was no longer experienced.

This Knowledge Center article has been updated to reflect this information.

@sander_s - I would recommend testing this in FME Server 2017.0 Build 17167 (or newer). If the problem persists, please contact (if you haven't already).

 

@RylanAtSafe, do you know if this may also fix the issue we were having in C111837?

 

Badge +11
@RylanAtSafe, do you know if this may also fix the issue we were having in C111837?

 

@larry - After reviewing ticket C111837, I do not think this fix is related to that issue. This resolution covers hanging fme.exe processes rather than abrupt crashes.

 

If you still are experiencing issues in your environment, I encourage you to contact again, or reopen the existing thread for C111837.

 

Badge
@larry - After reviewing ticket C111837, I do not think this fix is related to that issue. This resolution covers hanging fme.exe processes rather than abrupt crashes.

 

If you still are experiencing issues in your environment, I encourage you to contact again, or reopen the existing thread for C111837.

 

Ok, thank you.
Badge

This issue of Cancelled Jobs continuing to run in FME Server 2016.1 (and older) is observed when the workspace running on FME Server contains FMEServerJobSubmitter transformers that were spawning child subprocesses.

In FME Server 2017.0, the underlying architecture of workflow management with FME Server Engines has been changed. Testing with Build 17167 this issue was no longer experienced.

This Knowledge Center article has been updated to reflect this information.

@sander_s - I would recommend testing this in FME Server 2017.0 Build 17167 (or newer). If the problem persists, please contact (if you haven't already).

 

@RylanAtSafe, this is issue is still happening on FME Server 2017.1 - Build 17539 - linux-x64.

A member of my team kicked off 2 concurrent processes running on individual engines using the REST API. I was able to cancel one of them through the Web UI where the engine returned, but the other is still running. The engine is also not showing on the Engines & Licensing page. He has also seen this behavior on previous job submissions. Thanks.

Badge +11
@RylanAtSafe, this is issue is still happening on FME Server 2017.1 - Build 17539 - linux-x64.

A member of my team kicked off 2 concurrent processes running on individual engines using the REST API. I was able to cancel one of them through the Web UI where the engine returned, but the other is still running. The engine is also not showing on the Engines & Licensing page. He has also seen this behavior on previous job submissions. Thanks.

Hi @ggarza – I'm sorry to hear about this inconvenience. It's interesting that you cannot see the other FME Engine on the Engines & Licensing page of the Web Interface.. Can you still see the Hosts listed?

 

Try upgrading your FMEServerJobSubmitter transformer to see if that helps (if the workflow was created in an older version of FME Workbench).

 

If you need to manually cancel the job, you can see the individual records in the fme_jobs table of the backend database.

 

 

In any event, this behaviour might warrant further investigation. Please consider opening a support ticket and providing the offending workspace(s), job logs, and FME Server logs (fmeserver.log, fmeprocessmonitorcore.log, fmeprocessmonitorengine.log).

 

Badge
Hi @ggarza – I'm sorry to hear about this inconvenience. It's interesting that you cannot see the other FME Engine on the Engines & Licensing page of the Web Interface.. Can you still see the Hosts listed?

 

Try upgrading your FMEServerJobSubmitter transformer to see if that helps (if the workflow was created in an older version of FME Workbench).

 

If you need to manually cancel the job, you can see the individual records in the fme_jobs table of the backend database.

 

 

In any event, this behaviour might warrant further investigation. Please consider opening a support ticket and providing the offending workspace(s), job logs, and FME Server logs (fmeserver.log, fmeprocessmonitorcore.log, fmeprocessmonitorengine.log).

 

@RylanAtSafe , Yes I can still all the Hosts listed. We are not using the FMEServerJobSubmitter. I'll take a look at the backend database. Thanks.

 

Badge
Hi @ggarza – I'm sorry to hear about this inconvenience. It's interesting that you cannot see the other FME Engine on the Engines & Licensing page of the Web Interface.. Can you still see the Hosts listed?

 

Try upgrading your FMEServerJobSubmitter transformer to see if that helps (if the workflow was created in an older version of FME Workbench).

 

If you need to manually cancel the job, you can see the individual records in the fme_jobs table of the backend database.

 

 

In any event, this behaviour might warrant further investigation. Please consider opening a support ticket and providing the offending workspace(s), job logs, and FME Server logs (fmeserver.log, fmeprocessmonitorcore.log, fmeprocessmonitorengine.log).

 

@RylanAtSafe, I meant to say, I could see all the Hosts listed even though the Engine was missing. Once the process finished, the missing Engine reappeared on the Engines & Licensing page. Also, while the cancelled process was still running, I could not resubmit the job either by Web UI or REST call. However, my colleague was able to resubmit by pointing the job to a different output folder.
Badge
Hi @ggarza – I'm sorry to hear about this inconvenience. It's interesting that you cannot see the other FME Engine on the Engines & Licensing page of the Web Interface.. Can you still see the Hosts listed?

 

Try upgrading your FMEServerJobSubmitter transformer to see if that helps (if the workflow was created in an older version of FME Workbench).

 

If you need to manually cancel the job, you can see the individual records in the fme_jobs table of the backend database.

 

 

In any event, this behaviour might warrant further investigation. Please consider opening a support ticket and providing the offending workspace(s), job logs, and FME Server logs (fmeserver.log, fmeprocessmonitorcore.log, fmeprocessmonitorengine.log).

 

@RylanAtSafe , what did you mean by "if you need to manually cancel the job, you can see the individual records in the fme_jobs table of the backend database?" Do we need to change the job record using SQL? Reason I ask is that the FME support person who helped us get access to the database just told us not to mess with the database. How do we proceed when this happens again? Thanks.

 

Badge +11
@RylanAtSafe , what did you mean by "if you need to manually cancel the job, you can see the individual records in the fme_jobs table of the backend database?" Do we need to change the job record using SQL? Reason I ask is that the FME support person who helped us get access to the database just told us not to mess with the database. How do we proceed when this happens again? Thanks.

 

@ggarza - I could not disagree. It's definitely a best practice to not mess with the database.. I'm not exactly sure why this issue is occurring in the first place – please open a support ticket so that we can investigate further. (Referencing this Q&A; thread will be helpful!)

 

*And by my statement, I meant that if the job that could not be cancelled via the FME Server Web Interface (i.e., the REST API) it could be manually removed (or "cancelled") by deleting the record from the fme_jobs table in the backend database.

 

I observed the similar issue today on my FME server 2021.2 build. I cancelled the job on FME server web interface. I submitted case C670096. But I cann't edit it at the moment to provide my log. Because our server is operational, i would like to know how to get the engine back as soon as possible so that it can process Queued jobs. It seems because it still continue to process the canceled job, my other jobs are running much slower than normal.

Userlevel 3
Badge +13

I observed the similar issue today on my FME server 2021.2 build. I cancelled the job on FME server web interface. I submitted case C670096. But I cann't edit it at the moment to provide my log. Because our server is operational, i would like to know how to get the engine back as soon as possible so that it can process Queued jobs. It seems because it still continue to process the canceled job, my other jobs are running much slower than normal.

Hello @pinkautumn​ , sorry to hear you're having issues editing your case details. I think the easiest way to do this, would be to respond to the case email you recieve and include your logfile. If you have issues replying to the support email, please do let us know! Best, Kailin Opaleychuk.

Reply