Skip to main content
Question

How to feed a workspacerunner parts of a 60 million dataset.


wrijs
Contributor
Forum|alt.badge.img+3
  • Contributor

I would like to input only 1,000,000 features at a time from a single 60,000,000 feature dataset (Parquet) to a workspacerunner. Can I use featurReader for this? I could select features based on attributes (dates or names) but I cant proces the data per tile or a spatial filter.

3 replies

hkingsbury
Celebrity
Forum|alt.badge.img+53
  • Celebrity
  • July 18, 2023

Thats not really how the workspacerunner works. If you "sent" 1m features to it, you would trigger the workbench 1m times.

In the workbench you're triggering, you would need filtering logic.

 

As a very simple example say you had the following data...

+-----------+-----------+-------+
|   name    |   type    | count |
+-----------+-----------+-------+
| banana    | fruit     |    10 |
| apple     | fruit     |     2 |
| carrot    | vegetable |   100 |
| chocolate | other     |    10 |
| potato    | vegetable |    53 |
+-----------+-----------+-------+

You'd set you child workspace (the one referenced in the workspace runner) to read data in based on the "type". This would be a published parameter.

 

In your parent workspace (the one that calls the workspace runner) you would have logic to filter what types you have. Then a single feature (for each type) would trigger the workspace runner.


wrijs
Contributor
Forum|alt.badge.img+3
  • Author
  • Contributor
  • July 21, 2023
hkingsbury wrote:

Thats not really how the workspacerunner works. If you "sent" 1m features to it, you would trigger the workbench 1m times.

In the workbench you're triggering, you would need filtering logic.

 

As a very simple example say you had the following data...

+-----------+-----------+-------+
|   name    |   type    | count |
+-----------+-----------+-------+
| banana    | fruit     |    10 |
| apple     | fruit     |     2 |
| carrot    | vegetable |   100 |
| chocolate | other     |    10 |
| potato    | vegetable |    53 |
+-----------+-----------+-------+

You'd set you child workspace (the one referenced in the workspace runner) to read data in based on the "type". This would be a published parameter.

 

In your parent workspace (the one that calls the workspace runner) you would have logic to filter what types you have. Then a single feature (for each type) would trigger the workspace runner.

Thank you for your reply. This has helped me understand Workspacerunner better. I have been trying things out in fme but it is still very difficlut to proces the 60000000 records. 

I now wonder if it is possible to input into workspacerunner published parameters in the child workspace for Start Features and Max Features read? I would love to avoid having a reader read in 60,000,000 records before it can do anything at all. 


hkingsbury
Celebrity
Forum|alt.badge.img+53
  • Celebrity
  • July 26, 2023
wrijs wrote:

Thank you for your reply. This has helped me understand Workspacerunner better. I have been trying things out in fme but it is still very difficlut to proces the 60000000 records.

I now wonder if it is possible to input into workspacerunner published parameters in the child workspace for Start Features and Max Features read? I would love to avoid having a reader read in 60,000,000 records before it can do anything at all.

You could do that, using a featurereader and exposing the parameters, but that apporach is sometimes a bit flacky. I'd recommend using a WHERE clause


Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

 
Cookie settings