Skip to main content

Hi All,

Today’s million dollar question: I have created a workflow to process some polygon and table data into a template GDB. The process isn’t too complicated but I think the most consuming transformers in the process are the DataTimeConverter and the FeatureWriter. I have done some tests but assume that processing around 94M of records will takes days. 

Mainly, writing 94M of records into a gdb template will be a nightmare.

Here is a screenshot of the workflow, I have also attached a version of it:

How can I improve the performance of the workflow and process around 94M of records in a clever way?

Open to suggestions :)

Thanks

I notice in the screenshot you have featurecaching enabled, turn that off, that will give a massive performance boost. Also make sure your FME Temp directory is set to an SSD rather than a HDD.
 

https://support.safe.com/hc/en-us/articles/25407446479373-Setting-a-temporary-file-location-for-FME-to-use-via-the-FME-TEMP-environment-variable


Another thing you could look to do is split the data into smaller subsets and process them one at a time. This could be done manually or via a parent/child workspace using the workspace runner

You could also look to split it out based on the output feature classes your writing (so each workbench on writes one output FC rather than five)

 

Whilst some of these might not necessarily speed up the entire process, it will ‘micro service’ them into smaller processes that can be rerun separately and if it fails, then you’ve only lost progress from the one process


Thanks @hkingsbury I was planning to run it via quick translator to improve the performance. 

Where can I find info about the parent/child workspace runner? I couldn't find much online.

Would also the option 'Grouping By' help to process massive data?

Thanks 


Looking at the workspace I don’t see really anything in there that should cause a huge performance hit. 

94 million records is a lot for sure, however, I wouldn’t expect days. I took a look at the writer and I noticed that the Transaction Type is set to Edit Session. Is there a specific reason for this? I think this could be a big part of your performance drain. Transactions is the better choice if it’s an option. This could indeed change the process time from days to hours. Here’s a similar question where changing the transaction type sped up the process from days to hours: 



That and the Feature Caching of course as @hkingsbury mentioned. 

Do you see in the log file any mention about splitting features out of bulk mode? if so you should focus your attention on those spots to see if you can maintain bulk mode processing. 

 


Thanks @hkingsbury I was planning to run it via quick translator to improve the performance. 

Where can I find info about the parent/child workspace runner? I couldn't find much online.

Would also the option 'Grouping By' help to process massive data?

Thanks 

A parent/child setup would involve one ‘parent’ workspace that has a workspace runner in it. The parent process would be responsible for telling the child process (through use of where clauses etc on the reader via published parameters) what data to read in. Essentially it splits the data into smaller chunks.

It’s likely that this has no performance gains, but what it does provide is a safety net of being able to rerun subsets of the data and not needing to wait for the whole process to run again should it fail on a specific feature.

There is a very slim chance that you may see a minor performance increase using this. A large dataset may fill up the ram and have to write to disk based temp files - which is slower. But on the flip side, its very possible the overhead of starting up multiple smaller processes negates any processing speed improvements


Thanks @virtualcitymatt and @hkingsbury for your suggestions. 

I’ve just changed the settings of the FeatureWriter with the following settings and the performance has improved :

 

The total time pf the process is now 12h to process and write 94M of records. A very good improvement but will need some tweaks.

Would the performance improve by increase the Feature per Transaction form 5,000 to 20,000….or maybe leave it blank?


Increasing the transaction size probably won’t make any noticeable difference, especially on a GDB


Reply