Skip to main content
Question

How to feed a workspacerunner parts of a 60 million dataset.

  • July 18, 2023
  • 3 replies
  • 21 views

wrijs
Contributor
Forum|alt.badge.img+5
  • Contributor
  • 11 replies

I would like to input only 1,000,000 features at a time from a single 60,000,000 feature dataset (Parquet) to a workspacerunner. Can I use featurReader for this? I could select features based on attributes (dates or names) but I cant proces the data per tile or a spatial filter.

3 replies

hkingsbury
Celebrity
Forum|alt.badge.img+63
  • Celebrity
  • 1625 replies
  • July 18, 2023

Thats not really how the workspacerunner works. If you "sent" 1m features to it, you would trigger the workbench 1m times.

In the workbench you're triggering, you would need filtering logic.

 

As a very simple example say you had the following data...

+-----------+-----------+-------+
|   name    |   type    | count |
+-----------+-----------+-------+
| banana    | fruit     |    10 |
| apple     | fruit     |     2 |
| carrot    | vegetable |   100 |
| chocolate | other     |    10 |
| potato    | vegetable |    53 |
+-----------+-----------+-------+

You'd set you child workspace (the one referenced in the workspace runner) to read data in based on the "type". This would be a published parameter.

 

In your parent workspace (the one that calls the workspace runner) you would have logic to filter what types you have. Then a single feature (for each type) would trigger the workspace runner.


wrijs
Contributor
Forum|alt.badge.img+5
  • Author
  • Contributor
  • 11 replies
  • July 21, 2023

Thats not really how the workspacerunner works. If you "sent" 1m features to it, you would trigger the workbench 1m times.

In the workbench you're triggering, you would need filtering logic.

 

As a very simple example say you had the following data...

+-----------+-----------+-------+
|   name    |   type    | count |
+-----------+-----------+-------+
| banana    | fruit     |    10 |
| apple     | fruit     |     2 |
| carrot    | vegetable |   100 |
| chocolate | other     |    10 |
| potato    | vegetable |    53 |
+-----------+-----------+-------+

You'd set you child workspace (the one referenced in the workspace runner) to read data in based on the "type". This would be a published parameter.

 

In your parent workspace (the one that calls the workspace runner) you would have logic to filter what types you have. Then a single feature (for each type) would trigger the workspace runner.

Thank you for your reply. This has helped me understand Workspacerunner better. I have been trying things out in fme but it is still very difficlut to proces the 60000000 records. 

I now wonder if it is possible to input into workspacerunner published parameters in the child workspace for Start Features and Max Features read? I would love to avoid having a reader read in 60,000,000 records before it can do anything at all. 


hkingsbury
Celebrity
Forum|alt.badge.img+63
  • Celebrity
  • 1625 replies
  • July 26, 2023

Thank you for your reply. This has helped me understand Workspacerunner better. I have been trying things out in fme but it is still very difficlut to proces the 60000000 records.

I now wonder if it is possible to input into workspacerunner published parameters in the child workspace for Start Features and Max Features read? I would love to avoid having a reader read in 60,000,000 records before it can do anything at all.

You could do that, using a featurereader and exposing the parameters, but that apporach is sometimes a bit flacky. I'd recommend using a WHERE clause