Skip to main content
Question

Tips for handeling huge Parquet singel file, trying not to load in the complete file.

  • July 22, 2023
  • 1 reply
  • 136 views

wrijs
Contributor
Forum|alt.badge.img+5
  • Contributor
  • 11 replies

Hi, what is the best way to avoid loading in a complete huge singel Parquet file (58 million records).

My memory keeps running out.

I have the option to split the data into smaller pieces with a Tester transformer selecting the data on date and then saving it to further process, and have done that with a child-parent workspace, and a tester + published parameter for the dates. But it still seems to go wrong due to memory problems. The process takes very long to save by date, especially loading in 58 million records before running the data through the tester. I would love tips on splitting up the data before loading in the complete 58 million records. or any other tips for dealing with Parquet files.

 

1 reply

featuremichael
Enthusiast
Forum|alt.badge.img+9
  • Enthusiast
  • 61 replies
  • July 24, 2023

Hello @wrijs,

 

i understand your problem with large data and a small amount of RAM Storage.

 

had u tried to run your PARQUET Reader with the Features Read Parameters for "Start Feature:" and "Min Features to Read:?

 

i would try to run my workspace in pieces with these 2 Paramters, like:

 

Workspace run 1.

Start Feature:

Min Feature to Read: 1 000 000

 

Workspace run 2.

Start Feature 1 000 000

Min Feature to Read 2 000 00

 

and so on.....

 

i hope this helps :-/

 

Greeting Michael