Question

Memory issues reading from large CSV file

  • 5 March 2018
  • 5 replies
  • 14 views

Badge +5

I'm using FME Desktop 64-bit (v 2017.1.2.1) and trying to read a large CSV file (approx 1gb and 60 million records) into a PostgreSQL database table with no geometry involved. The only transformers I am using are an attributerenamer to map the fields and two expressionevaluators to calculate KM/hr and Metres/Sec from Miles/Hour (see below).

My issue is that running the workbench consumes most of the 8GB RAM available on the VM and this appears to be due to the CSV reader reading in all the records prior to the first transformer (the attributerenamer). I ran the workbench overnight and it failed due to an FME app crash at approx. 52 million records processed and I'm assuming this was memory related as the last few lines of the log are:

2018-03-03 01:20:55|17816.4| 1.0|STATRP|AttributeRenamer(AttrSetFactory): Processed 52421236 of 59977315 features

2018-03-03 01:20:56|17818.1| 2.8|STATRP|ResourceManager: Optimizing Memory Usage. Please wait...

2018-03-03 01:20:56|17818.1| 2.8|INFORM|ResourceManager: Optimizing Memory Usage. Please wait...

2018-03-05 12:11:39|17818.2| 0.1|WARN |Warning: not all FMESessions that were created were destroyed before shutdown. This may cause instability

2018-03-05 12:11:39|17818.2| 0.0|WARN |Warning: not all Stashed Objects that were registered were dropped before shutdown. This may cause instability

My question is, why is FME using so much memory reading from the file?

I'm not using any group level transformers that would usually require all the features to be held in memory, so surely the records can be read from the CSV, processed and disposed in sequence, with very little memory utilised?

Despite the large file size, I was expecting FME to handle this much better.

Cheers

John


5 replies

Badge +5
Just following up...I've been experimenting with this today and have found that the PostgreSQL writer does not appear to honour either the 'Bulk Insert' or 'Features Per Transaction' parameters. I have these set to 'No' and '1000', but if check the count in the target table as the process is run, there are no records reported. There are also no writer transaction messages appearing in the log. This suggests FME is not writing them until the last feature has entered the writer.

 

 

If I replace the PostgreSQL writer with a FeatureWriter transformer, with 'Bulk Insert' set to either Yes or No and 'Features Per Transaction' set to 1000, then I get log messages like the below, and the count in the target tables is increasing as the process runs.

 

 

2018-03-05 16:26:09| 75.8| 0.5|INFORM|Committed transaction at feature '23000'

 

 

Memory usage is also stable at approx 150mb.

 

 

I therefore suspect the there is a problem with the PostgreSQL writer which is causing FME to attempt to write the features in bulk as a single transaction. The workaround is to use the FeatureWriter.

 

 

John

 

Userlevel 4
Badge +25

I think a good way to test would be to add a breakpoint just before the writer and run the workspace in breakpoints/debug mode. Then you can for sure move one feature at a time through the workspace and see what is happening. If you set Features per Transaction to 1 or 2 then it would be easier to see what is happening. Ultimately you might be right about problem, but sadly I can't test it because I'm out of the office today. But someone else at Safe will follow up for sure.

Userlevel 4
Badge +13

Hi @john_gis4busine -- thanks for the good detective work here. I wonder wonder if it is possible you have more than one writer in your workspace. If the PostgreSQL writer was not the first one, then I'd expect exactly the behaviour you're seeing (and yes, it does make me cry). Could you double check, and let us know. If it is the second writer, then please move it up to be the first and see if that makes any change. If it already was the first writer, please send the workspace and log file in to support@safe.com and we'll dig in. Definitely shouldn't be doing what you're seeing.

Badge +5

Hi @daleatsafe & @Mark2AtSafe.

Thanks for the quick responses. Yes, the PostgreSQL writer is the second writer (I have a PostGIS writer in the same workbench). Sure enough, when I switch the writers around the features are written to the PostgreSQL writer in transactions as the process runs. However, the PostGIS writer then has the same issue and features are not written until the end.

I'll send a test workbench and logs from the two test runs with the writers switched.

Userlevel 4
Badge +13

Hi @daleatsafe & @Mark2AtSafe.

Thanks for the quick responses. Yes, the PostgreSQL writer is the second writer (I have a PostGIS writer in the same workbench). Sure enough, when I switch the writers around the features are written to the PostgreSQL writer in transactions as the process runs. However, the PostGIS writer then has the same issue and features are not written until the end.

I'll send a test workbench and logs from the two test runs with the writers switched.

Ah yes, the "multi-writers are not simultaneous" issue. We've long wanted to make that change. You've now given us more motivation.

 

 

In the meantime, using only 1 writer and a FeatureWriter for your other output will avoid problems. Or you could use 2 FeatureWriters.

 

 

Thanks for letting us know, and sorry you have to do a workaround.

 

 

Reply