Skip to main content

The basic workflow is:

  1. Read in features
  2. Count them (via StatisticsCalculator, Total Count)
  3. Depending on count value, calculate sample value
  4. Random sample features based on calculated sample value

Sampler doesn't allow to set Sampling Rate (N) as attribute value. Maybe there is a workaround?

No, that won't work. The Sampler requires the sampling rate to be a single value for all features. If you use attributes there is the potential that they have different values, which can lead to all kinds of issues.

I thought maybe wrapping the Sampler in a Custom Transformer and exposing the Sampling Rate parameter as a parameter of the Custom Transformer would work, but no...

There is the option to use a User Parameter for the sampling rate, but if you want to use that you'd have to cut the process in to two parts, one to perform steps 1-3 from your list and then call a 2nd workspace, using a WorkspaceRunner, to do part 4 using a User Parameter as input. The downside is that you're reading all of your data twice.


No, that won't work. The Sampler requires the sampling rate to be a single value for all features. If you use attributes there is the potential that they have different values, which can lead to all kinds of issues.

I thought maybe wrapping the Sampler in a Custom Transformer and exposing the Sampling Rate parameter as a parameter of the Custom Transformer would work, but no...

There is the option to use a User Parameter for the sampling rate, but if you want to use that you'd have to cut the process in to two parts, one to perform steps 1-3 from your list and then call a 2nd workspace, using a WorkspaceRunner, to do part 4 using a User Parameter as input. The downside is that you're reading all of your data twice.

In my case, the sample size attribute always will have same values.

Yeah, 2 workspace method is the least favorite and te last attempt if no other method works. But maybe it is possible to pass sample size attribute value as sampling rate via Python or smth.


As an alternative, you could create a random number on each attribute, sort by this attribute, then a counter, then a tester to only keep features where count is less than the sample size


Count <= sample size in a tester would give you the sample size. So, generate a random number, sort, and then get the 1st two records of the sorted random numbers.

 

 

Sample


If you did want to go down the python route

import fme
import fmeobjects
import random
 
class FeatureProcessor(object):
    def __init__(self):
        #create empty list
        self.feature = <]
        self.i = 0
    def input(self,feature):
        #set samplesize based on value from first feature entering transformer
        if self.i == 0:
            self.n = feature.getAttribute('samplesize')
            self.i+=1
        #add each feature to list   
        self.feature.append(feature)
 
    def close(self):
        #shuffle list of features
        random.shuffle(self.feature)
        if len(self.feature)<self.n:
            print ("Error: sample size greater than number of features")
        else:
        #return 1st n amount of features
            for x in range(0,self.n):
                self.pyoutput(self.feature x])

but in this circumstance I would go with a random number generator, sorter, counter, tester option


Reply