Question

Sampler - stop further upstream processing once sample size is reached

7 years ago
July 14, 2017
15 replies
29 views

+31

ebygomm
Influencer
3236 replies

Is it possible to stop further upstream processing once the number of features sampled matches the sampling rate when using First N Features?

+48

redgeographics
Celebrity
3606 replies
7 years ago
July 14, 2017

A Terminator attached to the non-sampled port maybe?

Could you explain a bit more what you're trying to achieve, it's not quite clear to me.

+17

erik_jan
Contributor
2181 replies
7 years ago
July 14, 2017

Could you use the "Max Features to Read" parameter instead of the Sampler?

+31

ebygomm
Author
Influencer
3236 replies
7 years ago
July 14, 2017

erik_jan wrote:

Could you use the "Max Features to Read" parameter instead of the Sampler?

The sample I need has to meet certain criteria which is tested in the first part of the workbench. I don't know how many features I need to read in advance to get the sample size I need

+48

redgeographics
Celebrity
3606 replies
7 years ago
July 14, 2017

ebygomm wrote:

The sample I need has to meet certain criteria which is tested in the first part of the workbench. I don't know how many features I need to read in advance to get the sample size I need

Right, I think just the Sampler would do then. Set it to pass through the first x features, where x is your sample size. All other features will not be processed further upstream.

However... if you've reached your sample size and there's still features downstream they will be processed up to the Sampler anyway and there's no stopping that

+31

ebygomm
Author
Influencer
3236 replies
7 years ago
July 14, 2017

redgeographics wrote:

Right, I think just the Sampler would do then. Set it to pass through the first x features, where x is your sample size. All other features will not be processed further upstream.

However... if you've reached your sample size and there's still features downstream they will be processed up to the Sampler anyway and there's no stopping that

Upstream is before the sampler - downstream is after the sampler. I want to avoid continuing to read further features and prolonging the processing time.

+48

redgeographics
Celebrity
3606 replies
7 years ago
July 14, 2017

ebygomm wrote:

Upstream is before the sampler - downstream is after the sampler. I want to avoid continuing to read further features and prolonging the processing time.

Got my streams mixed up :)

+17

erik_jan
Contributor
2181 replies
7 years ago
July 14, 2017

What is your Source Format? If it is a database you could use the WHERE clause to restrict the reading. If not, could you use a FeatureReader, using the restriction and "Max Features to Read" parameter to check the criteria and sample size?

We need a little more information, @egomm

+31

ebygomm
Author
Influencer
3236 replies
7 years ago
July 14, 2017

Features are read into the workspace using an SQL creator, a statistics calculator is then used to create a cumulative total summing one of the attributes, a tester then passes features where the cumulative total is less than a set value and the sampler then should return the first x of these records

+17

erik_jan
Contributor
2181 replies
7 years ago
July 14, 2017

ebygomm wrote:

Seems like that could be done in the SQLCreator:

Something like this:

Select * from table where value < (Select sum(value) from table) and rownum < limit

+31

ebygomm
Author
Influencer
3236 replies
7 years ago
July 14, 2017

erik_jan wrote:

Seems like that could be done in the SQLCreator:

Something like this:

Select * from table where value < (Select sum(value) from table) and rownum < limit

The sql requires a running total which complicates matters. I'm trying to avoid creating a workspace with complex sql queries which others will struggle to support, so will probably just live with the extra wasted processing time.

+17

erik_jan
Contributor
2181 replies
7 years ago
July 14, 2017

ebygomm wrote:

With the running total it will be like:

Select * from table t1 where t1.value < (Select sum(t2.value) from table t2 where t2.value <= t1.value) and rownum < limit order by t1.value

Still fairly simple.

+43

mark2atsafe
Safer
2500 replies
7 years ago
July 14, 2017

If you intend to use the same sample set again, then why not create a workspace that samples the data and writes the samples to an FFS file? Then use the FFS file as the source data in your main workspace. Of course, it's only saving time if you're going to use the sample dataset multiple times.

+31

ebygomm
Author
Influencer
3236 replies
7 years ago
July 17, 2017

mark2atsafe wrote:

The source data is a moving target, the first 1000 records today won't be the same as the first 1000 records tomorrow

bilal
Contributor
13 replies
5 years ago
May 3, 2019

I completely agree. Using the sampler with a large data set wastes a lot of time since it reaches the sample rate then continues to port the rest of the data set through the NotSampled port. I would like to see the sampler transformer updated to have an option to stop reading further records/features when the sampled limit is reached. Surely this would be a simple improvement for SAFE to implement.

pddmobile
1 reply
5 years ago
July 24, 2019

redgeographics wrote:

A Terminator attached to the non-sampled port maybe?

Could you explain a bit more what you're trying to achieve, it's not quite clear to me.

The Terminator worked for me, somewhat. I needed the first 100 records from a file with millions of records and it stopped reading at 100,000 .

Reply

Rich Text Editor, editor1

Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

Cookie settings

We use 3 different kinds of cookies. You can choose which cookies you want to accept. We need basic cookies to make this site work, therefore these are the minimum you can select. Learn more about our cookies.

Basic
Functional

Normal
Functional + analytics

Complete
Functional + analytics + social media + embedded videos

Sampler - stop further upstream processing once sample size is reached