Question

Quick alternative to StatisticsCalculator?


Badge

I'm calling 6 separate StatisticsCalculator transformers, and they each take a (relatively) long time to run. All I'm calculating each time is a min, max and count attribute so this seems like overkill.

Is there an easy alternative I'm missing? Is there a way to use a math function against the entire dataset for example?


15 replies

Userlevel 2
Badge +12

Have a look at the InlineQuerier.

A SQL statement might be easier and quicker.

Badge +22

Depending on your data, it might also be quicker to sort on your attribute and sample the first and/or last feature for the min/max.

The quickest way I've found to get a total count attribute is to use the Aggregator (with List) with a count attribute, followed by a deaggregator to restore the original features.

Userlevel 4

If your data source is a SQL database, I think it would be hard to beat the performance of a SQLExecutor with something like:

select 
  max(my_value) as my_max, 
  min(my_value) as my_min, 
  sum(my_value) as my_sum 
from 
  my_table
where
  ...

Badge +22

If your data source is a SQL database, I think it would be hard to beat the performance of a SQLExecutor with something like:

select 
  max(my_value) as my_max, 
  min(my_value) as my_min, 
  sum(my_value) as my_sum 
from 
  my_table
where
  ...

Agreed.  However if the source data in not a database, and is sufficiently large that the StatsCalculator is taking a long time to run, I wonder about the overhead of generating the SQLite database in the InlineQuerier.

 

 

Badge +22

Depending on your data, it might also be quicker to sort on your attribute and sample the first and/or last feature for the min/max.

The quickest way I've found to get a total count attribute is to use the Aggregator (with List) with a count attribute, followed by a deaggregator to restore the original features.

I changed my mind, the quickest way to get a total feature count for medium sized datasets is via python.  (10% more efficient than aggregator/deaggregator w/ 10000 features)

 

 

class FeatureProcessor(object):
    def __init__(self):
        self.Total = 0
        self.Features=[]
    def input(self,feature):
        self.Total+=1
        self.Features.append(feature)
        
    def close(self):
        for feature in self.Features:
            feature.setAttribute("TotalCount", self.Total)
            self.pyoutput(feature)
I haven't done any benchmarking for large or complex datasets (lots of attributes, heavy geometry)
Userlevel 3
Badge +13
Agreed. However if the source data in not a database, and is sufficiently large that the StatsCalculator is taking a long time to run, I wonder about the overhead of generating the SQLite database in the InlineQuerier.

 

 

I suspect that if you have a large amount of data, the InlineQuerier overhead may not be too bad compared to doing 6 stats calcs one after the other. WHen you set up the InlineQuerier, just be sure to only define the columns you will want to do the select against to speed things even more.

 

 

Userlevel 2
Badge +17

Hi @ld, before thinking of an alternative, I'd like to know why you need to use 6 StatisticsCalculators. If you intend to perform statistics calculation simultaneously on 6 attributes in the same feature type, just a single StatisticsCalculator would be enough.

Badge

Hi @ld, before thinking of an alternative, I'd like to know why you need to use 6 StatisticsCalculators. If you intend to perform statistics calculation simultaneously on 6 attributes in the same feature type, just a single StatisticsCalculator would be enough.

Hi, they're all operating on separate datasets at different stages of the workbench. Believe me I would put them all through the same transformer if that were an option!

 

 

Badge

If your data source is a SQL database, I think it would be hard to beat the performance of a SQLExecutor with something like:

select 
  max(my_value) as my_max, 
  min(my_value) as my_min, 
  sum(my_value) as my_sum 
from 
  my_table
where
  ...

Hi, the input data is a .dwg and an XML.

 

 

Badge
I changed my mind, the quickest way to get a total feature count for medium sized datasets is via python.  (10% more efficient than aggregator/deaggregator w/ 10000 features)

 

 

class FeatureProcessor(object):
    def __init__(self):
        self.Total = 0
        self.Features=[]
    def input(self,feature):
        self.Total+=1
        self.Features.append(feature)
        
    def close(self):
        for feature in self.Features:
            feature.setAttribute("TotalCount", self.Total)
            self.pyoutput(feature)
I haven't done any benchmarking for large or complex datasets (lots of attributes, heavy geometry)

 

Thank you, why didn't I think of Python? I will try this.

 

Userlevel 2
Badge +17

Hi @ld, before thinking of an alternative, I'd like to know why you need to use 6 StatisticsCalculators. If you intend to perform statistics calculation simultaneously on 6 attributes in the same feature type, just a single StatisticsCalculator would be enough.

I got it. As other experts suggested, the InlineQuerior or the PythonCaller would be a good solution.

 

I provide another way. If you are using FME 2018.0, you can use the Sampler (sampling the last 1 feature) effectively to do that. This custom transformer implementation might be useful.

 

quick-statisticscalculator.fmw (FME 2018.0.0.3)

 

 

 

Userlevel 2
Badge +17

Hi @ld, before thinking of an alternative, I'd like to know why you need to use 6 StatisticsCalculators. If you intend to perform statistics calculation simultaneously on 6 attributes in the same feature type, just a single StatisticsCalculator would be enough.

Updated to add the source attribute name as prefix to the resulting attribute names.

 

quick-statisticscalculator-2.fmw (FME 2018.0.0.3)

 

 

Badge
I got it. As other experts suggested, the InlineQuerior or the PythonCaller would be a good solution.

 

I provide another way. If you are using FME 2018.0, you can use the Sampler (sampling the last 1 feature) effectively to do that. This custom transformer implementation might be useful.

 

quick-statisticscalculator.fmw (FME 2018.0.0.3)

 

 

 

Thanks @takashi, I'm stuck with 2017.1 for a little while longer but will look at this when our organisation updates to 2018.

 

Badge +22

 

Thank you, why didn't I think of Python? I will try this.

 

It should be fairly simple to adapt this to also check max min values for any number of attributes.

 

 

The above code assumes that you want the values added to every feature, if you just want a summary feature instead, then you don't need to build the feature list, and can just output a single feature in the close method. This will be a lot more efficient, since all the features do not need to be stored in memory.

 

 

Badge +22
I got it. As other experts suggested, the InlineQuerior or the PythonCaller would be a good solution.

 

I provide another way. If you are using FME 2018.0, you can use the Sampler (sampling the last 1 feature) effectively to do that. This custom transformer implementation might be useful.

 

quick-statisticscalculator.fmw (FME 2018.0.0.3)

 

 

 

That's a nice, efficient way of getting a summary feature.

 

 

Reply