Skip to main content
Question

Sampler - stop further upstream processing once sample size is reached


ebygomm
Influencer
Forum|alt.badge.img+31

Is it possible to stop further upstream processing once the number of features sampled matches the sampling rate when using First N Features?

15 replies

redgeographics
Celebrity
Forum|alt.badge.img+48

A Terminator attached to the non-sampled port maybe?

Could you explain a bit more what you're trying to achieve, it's not quite clear to me.


erik_jan
Contributor
Forum|alt.badge.img+17
  • Contributor
  • July 14, 2017

Could you use the "Max Features to Read" parameter instead of the Sampler?


ebygomm
Influencer
Forum|alt.badge.img+31
  • Author
  • Influencer
  • July 14, 2017
erik_jan wrote:

Could you use the "Max Features to Read" parameter instead of the Sampler?

The sample I need has to meet certain criteria which is tested in the first part of the workbench. I don't know how many features I need to read in advance to get the sample size I need

 

 


redgeographics
Celebrity
Forum|alt.badge.img+48
ebygomm wrote:
The sample I need has to meet certain criteria which is tested in the first part of the workbench. I don't know how many features I need to read in advance to get the sample size I need

 

 

Right, I think just the Sampler would do then. Set it to pass through the first x features, where x is your sample size. All other features will not be processed further upstream.

 

 

However... if you've reached your sample size and there's still features downstream they will be processed up to the Sampler anyway and there's no stopping that

 

 


ebygomm
Influencer
Forum|alt.badge.img+31
  • Author
  • Influencer
  • July 14, 2017
redgeographics wrote:
Right, I think just the Sampler would do then. Set it to pass through the first x features, where x is your sample size. All other features will not be processed further upstream.

 

 

However... if you've reached your sample size and there's still features downstream they will be processed up to the Sampler anyway and there's no stopping that

 

 

Upstream is before the sampler - downstream is after the sampler. I want to avoid continuing to read further features and prolonging the processing time.

 

 


redgeographics
Celebrity
Forum|alt.badge.img+48
ebygomm wrote:
Upstream is before the sampler - downstream is after the sampler. I want to avoid continuing to read further features and prolonging the processing time.

 

 

Got my streams mixed up :)

erik_jan
Contributor
Forum|alt.badge.img+17
  • Contributor
  • July 14, 2017

What is your Source Format? If it is a database you could use the WHERE clause to restrict the reading. If not, could you use a FeatureReader, using the restriction and "Max Features to Read" parameter to check the criteria and sample size?

We need a little more information, @egomm


ebygomm
Influencer
Forum|alt.badge.img+31
  • Author
  • Influencer
  • July 14, 2017

Features are read into the workspace using an SQL creator, a statistics calculator is then used to create a cumulative total summing one of the attributes, a tester then passes features where the cumulative total is less than a set value and the sampler then should return the first x of these records


erik_jan
Contributor
Forum|alt.badge.img+17
  • Contributor
  • July 14, 2017
ebygomm wrote:

Features are read into the workspace using an SQL creator, a statistics calculator is then used to create a cumulative total summing one of the attributes, a tester then passes features where the cumulative total is less than a set value and the sampler then should return the first x of these records

Seems like that could be done in the SQLCreator:

 

Something like this:

 

Select * from table where value < (Select sum(value) from table) and rownum < limit

 

 


ebygomm
Influencer
Forum|alt.badge.img+31
  • Author
  • Influencer
  • July 14, 2017
erik_jan wrote:
Seems like that could be done in the SQLCreator:

 

Something like this:

 

Select * from table where value < (Select sum(value) from table) and rownum < limit

 

 

The sql requires a running total which complicates matters. I'm trying to avoid creating a workspace with complex sql queries which others will struggle to support, so will probably just live with the extra wasted processing time.

 

 


erik_jan
Contributor
Forum|alt.badge.img+17
  • Contributor
  • July 14, 2017
ebygomm wrote:
The sql requires a running total which complicates matters. I'm trying to avoid creating a workspace with complex sql queries which others will struggle to support, so will probably just live with the extra wasted processing time.

 

 

With the running total it will be like:

 

Select * from table t1 where t1.value < (Select sum(t2.value) from table t2 where t2.value <= t1.value) and rownum < limit order by t1.value

 

Still fairly simple.

 

 


mark2atsafe
Safer
Forum|alt.badge.img+43

If you intend to use the same sample set again, then why not create a workspace that samples the data and writes the samples to an FFS file? Then use the FFS file as the source data in your main workspace. Of course, it's only saving time if you're going to use the sample dataset multiple times.


ebygomm
Influencer
Forum|alt.badge.img+31
  • Author
  • Influencer
  • July 17, 2017
mark2atsafe wrote:

If you intend to use the same sample set again, then why not create a workspace that samples the data and writes the samples to an FFS file? Then use the FFS file as the source data in your main workspace. Of course, it's only saving time if you're going to use the sample dataset multiple times.

The source data is a moving target, the first 1000 records today won't be the same as the first 1000 records tomorrow

bilal
Contributor
Forum|alt.badge.img+4
  • Contributor
  • May 3, 2019

I completely agree. Using the sampler with a large data set wastes a lot of time since it reaches the sample rate then continues to port the rest of the data set through the NotSampled port. I would like to see the sampler transformer updated to have an option to stop reading further records/features when the sampled limit is reached. Surely this would be a simple improvement for SAFE to implement.


redgeographics wrote:

A Terminator attached to the non-sampled port maybe?

Could you explain a bit more what you're trying to achieve, it's not quite clear to me.

The Terminator worked for me, somewhat. I needed the first 100 records from a file with millions of records and it stopped reading at 100,000 .


Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

 
Cookie settings