Question

How can we read and process the data in batches of say 1 million records ?

  • 31 January 2022
  • 7 replies
  • 30 views

Badge

Suppose I have a data containing 10 million records but I want to read every 1 million features and run my processing parallel in 10 batches. Can someone please suggest any method?


7 replies

Userlevel 2
Badge +17

What is the sourse dataset format?​

Badge

What is the sourse dataset format?​

shapefile, but regardless what the format is, I want to send 1 million records or you can say rows (if its in database format) in 1 set and parallel I want to run 10 more batches like this from same series of processes and transformers, just wanted to know how it can be done in batches in parallel

Userlevel 2
Badge +17

shapefile, but regardless what the format is, I want to send 1 million records or you can say rows (if its in database format) in 1 set and parallel I want to run 10 more batches like this from same series of processes and transformers, just wanted to know how it can be done in batches in parallel

A possible way is to convert the transformers which you need to perform in parallel to a custom transformer, configure its parallel processing parameters, and run it for each group (i.e. block of 1 million features).

The attached screenshots illustrate how you can ​create a transformer parameter linked to the Group By parameter, and set a parallel mode (minimal or above) to the Parallel Processing parameter.custom-transformer-parameters-1custom-transformer-parameters-2

Badge

shapefile, but regardless what the format is, I want to send 1 million records or you can say rows (if its in database format) in 1 set and parallel I want to run 10 more batches like this from same series of processes and transformers, just wanted to know how it can be done in batches in parallel

how we can process it in blocks? like 1 million features then next million features & then the next ? Is there a way to segregate in blocks and run in parallel, I saw all of the features which are going inside the custom transformer through different streams are going together one by one but not in parallel

Userlevel 5
Badge +29

shapefile, but regardless what the format is, I want to send 1 million records or you can say rows (if its in database format) in 1 set and parallel I want to run 10 more batches like this from same series of processes and transformers, just wanted to know how it can be done in batches in parallel

Setting the Group By Mode to "Process At End" and using a transformers like the modulo counter to group features into X number of groups

Userlevel 2
Badge +17

shapefile, but regardless what the format is, I want to send 1 million records or you can say rows (if its in database format) in 1 set and parallel I want to run 10 more batches like this from same series of processes and transformers, just wanted to know how it can be done in batches in parallel

I think it would be efficient to keep the order of features, in this case. A possible way is to use a Counter to add sequential number to the features, then calculate group ID (integer number) with this expression.

@floor(@Value(_count) / 1000000)

You can then set "Process When Group Changes (Advanced)" to the Group By Mode parameter.

[Add] The attached screenshot illustrates my intention.

workflow-example

Badge

shapefile, but regardless what the format is, I want to send 1 million records or you can say rows (if its in database format) in 1 set and parallel I want to run 10 more batches like this from same series of processes and transformers, just wanted to know how it can be done in batches in parallel

Thank you for the suggestions, appreciate it 😊

Reply