Solved

Memory issues - Looping a folder

  • 27 February 2024
  • 10 replies
  • 73 views

Badge +4

Dear,

I have a folder with 2208 xml files, approx. 30 GB. I read these files using a File Path Reader followed by an XML reader. From these files I select some features and save them locally. However, with my 16 GB RAM I run into memory issues. I already reduced the amount of data as much as possible within my workbench but still its not enough.

  1. Is there a method I read first 10 xml files – process them – save the outcome and go to the next 10 files, like a loop?
  2. My second question is whether it will safe memory in case I use a fanout expression or just write one final dataset with the results.

My version is: FME 2023.2.1

icon

Best answer by nielsgerrits 27 February 2024, 11:53

View original

10 replies

Userlevel 6
Badge +32

Multiple way’s to solve this. One way is to use a WorkspaceRunner to start a second workspace processing the file. You will need to create a parameter in the child workspace where the workspacerunner can send the path to.

Another way is to create a custom transformer. You can then configure it to do parallel processing, but this is a more complex solution.

I think you will not safe memory using a fanout expression, because I think the writer does not know when the last feature within a group has arrived and have to wait for all features to arrive before the writing can start.

Also, did you ran the workspace with Feature Caching on? This is nice for development but eats memory as it stores cached files for each outputport on every transformer. Switching this off in heavy production runs really helps.

Badge +4

Multiple way’s to solve this. One way is to use a WorkspaceRunner to start a second workspace processing the file. You will need to create a parameter in the child workspace where the workspacerunner can send the path to.

Another way is to create a custom transformer. You can then configure it to do parallel processing, but this is a more complex solution.

I think you will not safe memory using a fanout expression, because I think the writer does not know when the last feature within a group has arrived and have to wait for all features to arrive before the writing can start.

Also, did you ran the workspace with Feature Caching on? This is nice for development but eats memory as it stores cached files for each outputport on every transformer. Switching this off in heavy production runs really helps.

Thanks! Could me give some more support for the Workspace Runner and Custom transformer idea?

I switched the Feature Caching off when running my complete script, that safes memory for sure

Userlevel 6
Badge +32

...

Thanks! Could me give some more support for the Workspace Runner and Custom transformer idea?

I switched the Feature Caching off when running my complete script, that safes memory for sure

 

Sure, attached demo set. 

Userlevel 6
Badge +32

Also a custom transformer sample added.

 

When I tested this with 30 files, the custom transformer was quicker than the workspacerunner. (20 vs 30 seconds)

Badge +4

@nielsgerrits thanks a lot! That looks promising.

I applied this concept to my own workbench. For me to be sure, the parent workbench derives the path_windows value for the specified folder in the FeatureReader. This value will be automatically transferred as a User Parameter to the workbench you choose in the WorkspaceRunner? Or do I have to manually set this User Parameter in the child workspace?

Userlevel 6
Badge +32

First create the Child workspace. Then you go to the FeatureReader, click the arrow button right of the Dataset field, choose User Parameter, Create User Parameter. If you like you can change the values of Parameter Identifier and Prompt, click OK. Save workspace.

Then create the Parent workspace. Add WorkspaceRunner and go to settings. Select the child workspace. Now the Published Parameter(s) you created in the child are being visible in the WorkspaceRunner. Click the arrow button next to the parameter and select the attribute you want to use. In this case, this is path_windows.

Badge +4

Thanks @nielsgerrits. I will try the custom transformer as well! 

I have a question about the WorkspaceRunner, I applied this to my project and its running now:

  • I restricted the WorkspaceRunner to run max 2 concurrent FME processes.
  • Wait for Job to complete: No
  • Workspace Runs per process: 1


1st question:
My memory seems to handle this, but what will happen if a started child workspace runs out of memory. Will my parent workspace give a warning or will it simply continue with the next XML file? In case a child workspace gives a warning there is probably a way to write these warning to a log file.

2nd question:
Some of the XMLs processed will not result into a final dataset because I use a spatialfilter. Can I get somehow a message that a workbech was processed succesfully but did not produce an outcome?

Userlevel 6
Badge +32

If a run fails, it will leave the WorkspaceRunner’s Failed output port. Not sure what happens in your case when running out of memory? Does it crash?

When I have memory issues I see “Optimizing Memory...” messages in the log. If I understand well FME starts swapping (writing in memory data to disk to free memory) which causes a lot of writes / reads which kills performance. So when I see this, I stop running the workspace and redesign it.

What I do when using a workspacerunner is create a work list in a database format (because of the possibility of multiple processes writing to the same table at the same moment, which does not work with files because these are read only when edited) where the child process updates the work list with file “1.txt” status = “done”. This way I can keep track of what is processed and what not, and have the possibility to stop and restart the proces without having to do it all over again.

I think the possibilities are almost endless, it just takes some work and a small development / test dataset to get it how you want it. You can send yourself an email using the Emailer when something happens. FeatureWriters are key because of the outputport.

Badge +4

@nielsgerrits good insights!

I faced another interesting issue. For example,

1st run => output dataset A_test is saved. 

2nd run => output dataset A_test is saved  
I thought I should include a unique counter somehow to prevent the outputs of the executed FME processes will overwrite each other. Do you have ideas for this? Maybe I create this counter in the Parent workspace and send it into each run as part of the output name for the dataset. 

A_test_run1
A_test_run2 etc.

Userlevel 6
Badge +32

I use a datetimestamp for this. yyyymmddhhmmss. Create a Python Scripted parameter and use that for the foldername. Generate it in the parent, pass it to the child through a parameter.

Reply