Skip to main content

I need to remove duplicates from varying sized datasets. Would it be more efficient to use a sorter with a sampler or a statistical calculator with a tester?

I would use the Sampler with Group Processing set to whatever your duplicate value is. No need to use the Sorter before it. Alternatively, you can use the DuplicateFilter.


Didn't even think of that! That really helps keeps things simple!

 


I would avoid both the Sorter and the StatisticsCalculator, as they are both blocking and can consume a lot of memory for larger datasets.

As @dustin​ mentions, either the Sampler with a Group-By, or the DuplicateFilter (my personal recommendation) will work.


@bibold​ I would agree with @david_r​ - try the DuplicateFilter first. It should give better performance than sorting


Reply