Is it possible to stop further upstream processing once the number of features sampled matches the sampling rate when using First N Features?

Sampler - stop further upstream processing once sample size is reached

Userlevel 5

+25

redgeographics
Influencer
3339 replies
6 years ago
14 July 2017

A Terminator attached to the non-sampled port maybe?

Could you explain a bit more what you're trying to achieve, it's not quite clear to me.

Userlevel 2

+12

erik_jan
Contributor
2177 replies
6 years ago
14 July 2017

Could you use the "Max Features to Read" parameter instead of the Sampler?

Userlevel 1

+10

ebygomm
Author
Participant
3078 replies
6 years ago
14 July 2017

Could you use the "Max Features to Read" parameter instead of the Sampler?

The sample I need has to meet certain criteria which is tested in the first part of the workbench. I don't know how many features I need to read in advance to get the sample size I need

Userlevel 5

+25

redgeographics
Influencer
3339 replies
6 years ago
14 July 2017

The sample I need has to meet certain criteria which is tested in the first part of the workbench. I don't know how many features I need to read in advance to get the sample size I need

Right, I think just the Sampler would do then. Set it to pass through the first x features, where x is your sample size. All other features will not be processed further upstream.

However... if you've reached your sample size and there's still features downstream they will be processed up to the Sampler anyway and there's no stopping that

Userlevel 1

+10

ebygomm
Author
Participant
3078 replies
6 years ago
14 July 2017

Right, I think just the Sampler would do then. Set it to pass through the first x features, where x is your sample size. All other features will not be processed further upstream.

However... if you've reached your sample size and there's still features downstream they will be processed up to the Sampler anyway and there's no stopping that

Upstream is before the sampler - downstream is after the sampler. I want to avoid continuing to read further features and prolonging the processing time.

Userlevel 5

+25

redgeographics
Influencer
3339 replies
6 years ago
14 July 2017

Upstream is before the sampler - downstream is after the sampler. I want to avoid continuing to read further features and prolonging the processing time.

Got my streams mixed up :)

Userlevel 2

+12

erik_jan
Contributor
2177 replies
6 years ago
14 July 2017

What is your Source Format? If it is a database you could use the WHERE clause to restrict the reading. If not, could you use a FeatureReader, using the restriction and "Max Features to Read" parameter to check the criteria and sample size?

We need a little more information, @egomm

Userlevel 1

+10

ebygomm
Author
Participant
3078 replies
6 years ago
14 July 2017

Features are read into the workspace using an SQL creator, a statistics calculator is then used to create a cumulative total summing one of the attributes, a tester then passes features where the cumulative total is less than a set value and the sampler then should return the first x of these records

Userlevel 2

+12

erik_jan
Contributor
2177 replies
6 years ago
14 July 2017

Features are read into the workspace using an SQL creator, a statistics calculator is then used to create a cumulative total summing one of the attributes, a tester then passes features where the cumulative total is less than a set value and the sampler then should return the first x of these records

Seems like that could be done in the SQLCreator:

Something like this:

Select * from table where value < (Select sum(value) from table) and rownum < limit

Userlevel 1

+10

ebygomm
Author
Participant
3078 replies
6 years ago
14 July 2017

Seems like that could be done in the SQLCreator:

Something like this:

Select * from table where value < (Select sum(value) from table) and rownum < limit

The sql requires a running total which complicates matters. I'm trying to avoid creating a workspace with complex sql queries which others will struggle to support, so will probably just live with the extra wasted processing time.

Userlevel 2

+12

erik_jan
Contributor
2177 replies
6 years ago
14 July 2017

The sql requires a running total which complicates matters. I'm trying to avoid creating a workspace with complex sql queries which others will struggle to support, so will probably just live with the extra wasted processing time.

With the running total it will be like:

Select * from table t1 where t1.value < (Select sum(t2.value) from table t2 where t2.value <= t1.value) and rownum < limit order by t1.value

Still fairly simple.

Userlevel 4

+25

If you intend to use the same sample set again, then why not create a workspace that samples the data and writes the samples to an FFS file? Then use the FFS file as the source data in your main workspace. Of course, it's only saving time if you're going to use the sample dataset multiple times.

Userlevel 1

+10

ebygomm
Author
Participant
3078 replies
6 years ago
17 July 2017

If you intend to use the same sample set again, then why not create a workspace that samples the data and writes the samples to an FFS file? Then use the FFS file as the source data in your main workspace. Of course, it's only saving time if you're going to use the sample dataset multiple times.

The source data is a moving target, the first 1000 records today won't be the same as the first 1000 records tomorrow

+4

bilal
Contributor
13 replies
4 years ago
3 May 2019

I completely agree. Using the sampler with a large data set wastes a lot of time since it reaches the sample rate then continues to port the rest of the data set through the NotSampled port. I would like to see the sampler transformer updated to have an option to stop reading further records/features when the sampled limit is reached. Surely this would be a simple improvement for SAFE to implement.

P

A Terminator attached to the non-sampled port maybe?

Could you explain a bit more what you're trying to achieve, it's not quite clear to me.

The Terminator worked for me, somewhat. I needed the first 100 records from a file with millions of records and it stopped reading at 100,000 .

Sampler - stop further upstream processing once sample size is reached

15 replies

Reply

Community Stats

Reply

Community Stats

Sign up

An FME Account is required to contribute

Login to the community

An FME Account is required to contribute

Scanning file for viruses.

This file cannot be downloaded