Question

Sampler - stop further upstream processing once sample size is reached


Userlevel 1
Badge +10

Is it possible to stop further upstream processing once the number of features sampled matches the sampling rate when using First N Features?


15 replies

Userlevel 5
Badge +25

A Terminator attached to the non-sampled port maybe?

Could you explain a bit more what you're trying to achieve, it's not quite clear to me.

Userlevel 2
Badge +12

Could you use the "Max Features to Read" parameter instead of the Sampler?

Userlevel 1
Badge +10

Could you use the "Max Features to Read" parameter instead of the Sampler?

The sample I need has to meet certain criteria which is tested in the first part of the workbench. I don't know how many features I need to read in advance to get the sample size I need

 

 

Userlevel 5
Badge +25
The sample I need has to meet certain criteria which is tested in the first part of the workbench. I don't know how many features I need to read in advance to get the sample size I need

 

 

Right, I think just the Sampler would do then. Set it to pass through the first x features, where x is your sample size. All other features will not be processed further upstream.

 

 

However... if you've reached your sample size and there's still features downstream they will be processed up to the Sampler anyway and there's no stopping that

 

 

Userlevel 1
Badge +10
Right, I think just the Sampler would do then. Set it to pass through the first x features, where x is your sample size. All other features will not be processed further upstream.

 

 

However... if you've reached your sample size and there's still features downstream they will be processed up to the Sampler anyway and there's no stopping that

 

 

Upstream is before the sampler - downstream is after the sampler. I want to avoid continuing to read further features and prolonging the processing time.

 

 

Userlevel 5
Badge +25
Upstream is before the sampler - downstream is after the sampler. I want to avoid continuing to read further features and prolonging the processing time.

 

 

Got my streams mixed up :)
Userlevel 2
Badge +12

What is your Source Format? If it is a database you could use the WHERE clause to restrict the reading. If not, could you use a FeatureReader, using the restriction and "Max Features to Read" parameter to check the criteria and sample size?

We need a little more information, @egomm

Userlevel 1
Badge +10

Features are read into the workspace using an SQL creator, a statistics calculator is then used to create a cumulative total summing one of the attributes, a tester then passes features where the cumulative total is less than a set value and the sampler then should return the first x of these records

Userlevel 2
Badge +12

Features are read into the workspace using an SQL creator, a statistics calculator is then used to create a cumulative total summing one of the attributes, a tester then passes features where the cumulative total is less than a set value and the sampler then should return the first x of these records

Seems like that could be done in the SQLCreator:

 

Something like this:

 

Select * from table where value < (Select sum(value) from table) and rownum < limit

 

 

Userlevel 1
Badge +10
Seems like that could be done in the SQLCreator:

 

Something like this:

 

Select * from table where value < (Select sum(value) from table) and rownum < limit

 

 

The sql requires a running total which complicates matters. I'm trying to avoid creating a workspace with complex sql queries which others will struggle to support, so will probably just live with the extra wasted processing time.

 

 

Userlevel 2
Badge +12
The sql requires a running total which complicates matters. I'm trying to avoid creating a workspace with complex sql queries which others will struggle to support, so will probably just live with the extra wasted processing time.

 

 

With the running total it will be like:

 

Select * from table t1 where t1.value < (Select sum(t2.value) from table t2 where t2.value <= t1.value) and rownum < limit order by t1.value

 

Still fairly simple.

 

 

Userlevel 4
Badge +25

If you intend to use the same sample set again, then why not create a workspace that samples the data and writes the samples to an FFS file? Then use the FFS file as the source data in your main workspace. Of course, it's only saving time if you're going to use the sample dataset multiple times.

Userlevel 1
Badge +10

If you intend to use the same sample set again, then why not create a workspace that samples the data and writes the samples to an FFS file? Then use the FFS file as the source data in your main workspace. Of course, it's only saving time if you're going to use the sample dataset multiple times.

The source data is a moving target, the first 1000 records today won't be the same as the first 1000 records tomorrow
Badge +4

I completely agree. Using the sampler with a large data set wastes a lot of time since it reaches the sample rate then continues to port the rest of the data set through the NotSampled port. I would like to see the sampler transformer updated to have an option to stop reading further records/features when the sampled limit is reached. Surely this would be a simple improvement for SAFE to implement.

A Terminator attached to the non-sampled port maybe?

Could you explain a bit more what you're trying to achieve, it's not quite clear to me.

The Terminator worked for me, somewhat. I needed the first 100 records from a file with millions of records and it stopped reading at 100,000 .

Reply