Suppose I have a data containing 10 million records but I want to read every 1 million features and run my processing parallel in 10 batches. Can someone please suggest any method?

How can we read and process the data in batches of say 1 million records ?

takashi
Forum|Forum|3 years ago
January 31, 2022

What is the sourse dataset format?

Why not inspect features with Visual/Data Preview and Feature/Record Information before writing them into a destination dataset?

Upvote

+6

bhavyagandhi
Author
Contributor
Forum|Forum|3 years ago
January 31, 2022

What is the sourse dataset format?

shapefile, but regardless what the format is, I want to send 1 million records or you can say rows (if its in database format) in 1 set and parallel I want to run 10 more batches like this from same series of processes and transformers, just wanted to know how it can be done in batches in parallel

Upvote

takashi
Forum|Forum|3 years ago
January 31, 2022

shapefile, but regardless what the format is, I want to send 1 million records or you can say rows (if its in database format) in 1 set and parallel I want to run 10 more batches like this from same series of processes and transformers, just wanted to know how it can be done in batches in parallel

A possible way is to convert the transformers which you need to perform in parallel to a custom transformer, configure its parallel processing parameters, and run it for each group (i.e. block of 1 million features).

The attached screenshots illustrate how you can create a transformer parameter linked to the Group By parameter, and set a parallel mode (minimal or above) to the Parallel Processing parameter. custom-transformer-parameters-1

Why not inspect features with Visual/Data Preview and Feature/Record Information before writing them into a destination dataset?

Upvote

+6

bhavyagandhi
Author
Contributor
Forum|Forum|3 years ago
January 31, 2022

shapefile, but regardless what the format is, I want to send 1 million records or you can say rows (if its in database format) in 1 set and parallel I want to run 10 more batches like this from same series of processes and transformers, just wanted to know how it can be done in batches in parallel

how we can process it in blocks? like 1 million features then next million features & then the next ? Is there a way to segregate in blocks and run in parallel, I saw all of the features which are going inside the custom transformer through different streams are going together one by one but not in parallel

Upvote

+67

hkingsbury
Celebrity
Forum|Forum|3 years ago
January 31, 2022

shapefile, but regardless what the format is, I want to send 1 million records or you can say rows (if its in database format) in 1 set and parallel I want to run 10 more batches like this from same series of processes and transformers, just wanted to know how it can be done in batches in parallel

Setting the Group By Mode to "Process At End" and using a transformers like the modulo counter to group features into X number of groups

Upvote

takashi
Forum|Forum|3 years ago
January 31, 2022

shapefile, but regardless what the format is, I want to send 1 million records or you can say rows (if its in database format) in 1 set and parallel I want to run 10 more batches like this from same series of processes and transformers, just wanted to know how it can be done in batches in parallel

I think it would be efficient to keep the order of features, in this case. A possible way is to use a Counter to add sequential number to the features, then calculate group ID (integer number) with this expression.

@floor(@Value(_count) / 1000000)

You can then set "Process When Group Changes (Advanced)" to the Group By Mode parameter.

[Add] The attached screenshot illustrates my intention.

workflow-example

Why not inspect features with Visual/Data Preview and Feature/Record Information before writing them into a destination dataset?

Upvote

+6

bhavyagandhi
Author
Contributor
Forum|Forum|3 years ago
February 2, 2022

shapefile, but regardless what the format is, I want to send 1 million records or you can say rows (if its in database format) in 1 set and parallel I want to run 10 more batches like this from same series of processes and transformers, just wanted to know how it can be done in batches in parallel

Thank you for the suggestions, appreciate it 😊

Upvote

How can we read and process the data in batches of say 1 million records ?

7 replies

Community Stats

Latest FME

Community Stats

Latest FME

Sign up

An FME Account is required to contribute

Login to the community

An FME Account is required to contribute

Scanning file for viruses.

This file cannot be downloaded