Question

Parallel processing for multiple Workspace Runners

  • 7 September 2021
  • 4 replies
  • 23 views

Good morning to everyone!

My purpose is to build a Workspace (father) able to divide in N equal subsets some zipped files located in a path, then pass every group to a Workspace runner that will call another dedicated Workspace (son) that will do the job I need.

The question is: how can I parallelize this process?

I mean, I have built the first Workspace (father) and it works quite well but unfortunately I'm able to process just a single Subset at a time (First Workspace Son), finished the first one the process pass to the second flow and so on, until the end.

I would like to start all the Workspace runners simultaneously. Is it possible? How? Every Workspace (son) will write on a different Geodatabase.

Thank you very much to everyone who will answer.

 

P.S.

FME v 2020.0 (Data Interoperability 2.6)

I have uploaded a picture of the Father Workspace.


4 replies

Userlevel 6
Badge +32

Use one WorkspaceRunner, not 4, set "Wait for Job to Complete" = "No" and set "Maximum Concurrent FME Processes" higher than 1.

Hi Niels,

thank you for the answer.

I have tried this way for a previous version of the Father Workbench without results. At the beginning I just needed to merge some features classes and tables coming from multiple Geodatabases (several hundreds) in a single Geodatabase and I just had a single Workspace runner. I set the parameters like you suggest for enabling multiple processes but It didn't work. Probably having multiple processes writing on the same Geodatabase isn't allowed, the initial Workaspace ended always after the first GDB was processed.

 

Right now I have the need to read hundreds zipped GDB files, split them in groups of equal number and write each group (after some processing that are done in the Son Workbench) in his own Geodatabase.

For example: I have 200 hundreds zipped GDB in a Directory.

In the father Workbench I have the need to:

  1. Copy and rename the zips in another path (without renaming them, the workbench stops with errors because of the name lenght);
  2. Classify and divide the zips in groups of fixed numerosity. If I decide 50, then 4 groups;
  3. Each group is passed to a Workspace Runner that calls a dedicated Dynamic Workspace which will unzip, process, merge the FC/Table and write all the entities on a dedicated GDB.
  4. The output are N's GDB that will be used for another ETL.

I'm afraid that If I would use a single Workspace Runner (shifting in the Son Workspace the group division part and all the 4 the writers, I would have the same issue that I had with the first version.

Thanks again

Userlevel 6
Badge +32

Hi Niels,

thank you for the answer.

I have tried this way for a previous version of the Father Workbench without results. At the beginning I just needed to merge some features classes and tables coming from multiple Geodatabases (several hundreds) in a single Geodatabase and I just had a single Workspace runner. I set the parameters like you suggest for enabling multiple processes but It didn't work. Probably having multiple processes writing on the same Geodatabase isn't allowed, the initial Workaspace ended always after the first GDB was processed.

 

Right now I have the need to read hundreds zipped GDB files, split them in groups of equal number and write each group (after some processing that are done in the Son Workbench) in his own Geodatabase.

For example: I have 200 hundreds zipped GDB in a Directory.

In the father Workbench I have the need to:

  1. Copy and rename the zips in another path (without renaming them, the workbench stops with errors because of the name lenght);
  2. Classify and divide the zips in groups of fixed numerosity. If I decide 50, then 4 groups;
  3. Each group is passed to a Workspace Runner that calls a dedicated Dynamic Workspace which will unzip, process, merge the FC/Table and write all the entities on a dedicated GDB.
  4. The output are N's GDB that will be used for another ETL.

I'm afraid that If I would use a single Workspace Runner (shifting in the Son Workspace the group division part and all the 4 the writers, I would have the same issue that I had with the first version.

Thanks again

Ah I now better understand your challenge.

 

I believe your issue now is that the flow in workbench is always sequential, not parallel... So the second WorkspaceRunner will only run when the jobs of the first WorkspaceRunner are finished...

I must admit I don't see how to work around this at the moment. If I would try to set this up I would use a real database, like PostGIS, which should support multi editing on the same table, and a single WorkspaceRunner. But I never tried so I can't say for sure this works. And I can think of a couple of reasons why you wont like to go from gdb to postgis to gdb...

 

Probably having multiple processes writing on the same Geodatabase isn't allowed, the initial Workspace ended always after the first GDB was processed.

Correct. A File Geodatabase is still file based, the featureclass will get a lock when one of the processes is writing. Also see this article from Esri. Writing to the same gdb from different workspaces should not work.

Ah I now better understand your challenge.

 

I believe your issue now is that the flow in workbench is always sequential, not parallel... So the second WorkspaceRunner will only run when the jobs of the first WorkspaceRunner are finished...

I must admit I don't see how to work around this at the moment. If I would try to set this up I would use a real database, like PostGIS, which should support multi editing on the same table, and a single WorkspaceRunner. But I never tried so I can't say for sure this works. And I can think of a couple of reasons why you wont like to go from gdb to postgis to gdb...

 

Probably having multiple processes writing on the same Geodatabase isn't allowed, the initial Workspace ended always after the first GDB was processed.

Correct. A File Geodatabase is still file based, the featureclass will get a lock when one of the processes is writing. Also see this article from Esri. Writing to the same gdb from different workspaces should not work.

Hi Niels,

yes, that's exactly my problem, sequential flow. Probably passing through PostGIS and then write again on GDB could Improve a little the process time but I'm not so sure.

Right now I'm trying to find a way to mantain a single Workspace runner...maybe adding another Workspace between father and son.

It's just an experiment, I really don't know how could be more efficient set a Workspace that calls another Workspace that calls another Workspace (and I'm still thinking how to do that).

If I will find a solution I'll let you know, If I won't generate a blackhole during the process!

Thanks again

Reply