Skip to main content

Hi, what is the best way to avoid loading in a complete huge singel Parquet file (58 million records).

My memory keeps running out.

I have the option to split the data into smaller pieces with a Tester transformer selecting the data on date and then saving it to further process, and have done that with a child-parent workspace, and a tester + published parameter for the dates. But it still seems to go wrong due to memory problems. The process takes very long to save by date, especially loading in 58 million records before running the data through the tester. I would love tips on splitting up the data before loading in the complete 58 million records. or any other tips for dealing with Parquet files.

 

Hello @wrijs,

 

i understand your problem with large data and a small amount of RAM Storage.

 

had u tried to run your PARQUET Reader with the Features Read Parameters for "Start Feature:" and "Min Features to Read:?

 

i would try to run my workspace in pieces with these 2 Paramters, like:

 

Workspace run 1.

Start Feature:

Min Feature to Read: 1 000 000

 

Workspace run 2.

Start Feature 1 000 000

Min Feature to Read 2 000 00

 

and so on.....

 

i hope this helps :-/

 

Greeting Michael

 

 


Reply