Skip to main content
Question

How can we read and process the data in batches of say 1 million records ?

  • January 31, 2022
  • 7 replies
  • 164 views

bhavyagandhi
Contributor
Forum|alt.badge.img+1

Suppose I have a data containing 10 million records but I want to read every 1 million features and run my processing parallel in 10 batches. Can someone please suggest any method?

7 replies

takashi
Influencer
  • January 31, 2022

What is the sourse dataset format?​


bhavyagandhi
Contributor
Forum|alt.badge.img+1
  • Author
  • Contributor
  • January 31, 2022
takashi wrote:

What is the sourse dataset format?​

shapefile, but regardless what the format is, I want to send 1 million records or you can say rows (if its in database format) in 1 set and parallel I want to run 10 more batches like this from same series of processes and transformers, just wanted to know how it can be done in batches in parallel


takashi
Influencer
  • January 31, 2022
bhavyagandhi wrote:

shapefile, but regardless what the format is, I want to send 1 million records or you can say rows (if its in database format) in 1 set and parallel I want to run 10 more batches like this from same series of processes and transformers, just wanted to know how it can be done in batches in parallel

A possible way is to convert the transformers which you need to perform in parallel to a custom transformer, configure its parallel processing parameters, and run it for each group (i.e. block of 1 million features).

The attached screenshots illustrate how you can ​create a transformer parameter linked to the Group By parameter, and set a parallel mode (minimal or above) to the Parallel Processing parameter.custom-transformer-parameters-1custom-transformer-parameters-2


bhavyagandhi
Contributor
Forum|alt.badge.img+1
  • Author
  • Contributor
  • January 31, 2022
bhavyagandhi wrote:

shapefile, but regardless what the format is, I want to send 1 million records or you can say rows (if its in database format) in 1 set and parallel I want to run 10 more batches like this from same series of processes and transformers, just wanted to know how it can be done in batches in parallel

how we can process it in blocks? like 1 million features then next million features & then the next ? Is there a way to segregate in blocks and run in parallel, I saw all of the features which are going inside the custom transformer through different streams are going together one by one but not in parallel


hkingsbury
Celebrity
Forum|alt.badge.img+53
  • Celebrity
  • January 31, 2022
bhavyagandhi wrote:

shapefile, but regardless what the format is, I want to send 1 million records or you can say rows (if its in database format) in 1 set and parallel I want to run 10 more batches like this from same series of processes and transformers, just wanted to know how it can be done in batches in parallel

Setting the Group By Mode to "Process At End" and using a transformers like the modulo counter to group features into X number of groups


takashi
Influencer
  • January 31, 2022
bhavyagandhi wrote:

shapefile, but regardless what the format is, I want to send 1 million records or you can say rows (if its in database format) in 1 set and parallel I want to run 10 more batches like this from same series of processes and transformers, just wanted to know how it can be done in batches in parallel

I think it would be efficient to keep the order of features, in this case. A possible way is to use a Counter to add sequential number to the features, then calculate group ID (integer number) with this expression.

@floor(@Value(_count) / 1000000)

You can then set "Process When Group Changes (Advanced)" to the Group By Mode parameter.

[Add] The attached screenshot illustrates my intention.

workflow-example


bhavyagandhi
Contributor
Forum|alt.badge.img+1
  • Author
  • Contributor
  • February 2, 2022
bhavyagandhi wrote:

shapefile, but regardless what the format is, I want to send 1 million records or you can say rows (if its in database format) in 1 set and parallel I want to run 10 more batches like this from same series of processes and transformers, just wanted to know how it can be done in batches in parallel

Thank you for the suggestions, appreciate it 😊


Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

 
Cookie settings