Question

Counting distinct values per column

  • 6 April 2020
  • 4 replies
  • 65 views

The starting point for my question is a table containing information about fruits and vegetables:

IDNameTypeColorKcal1AppleFruitGreen522BananaFruitYellow893CarrotVegetableOrange414KaleVegetableGreen395ZucchiniVegetableGreen17

 

I would like to use FME to list all distinct values per column, including a count per distinct value, such as:

ColumnNameVarietyNameCountNameApple1NameBanana1NameCarrot1NameKale1NameZucchini1TypeFruit2TypeVegetable3ColorGreen3ColorYellow1ColorOrange1Kcal521Kcal891Kcal411Kcal391Kcal171

 

I was considering to use an AttributeExploder to list all columns in separate rows, followed by a StatisticsCalculator to determine the count per variety. However, that seems to consume a lot of time for larger datasets (containing over 1.000.000 records).

Does anyone have suggestions how to accomplish the above result in a better fashion?


4 replies

Userlevel 2
Badge +17

Hi @fentanyl,

The Aggregator might be faster than the StatisiticsCalculator, especially before FME 2020:

Badge

Use an InlineQuerier & SQL Union statements....

SELECT 'Name' AS "ColumnName",
        Name AS "VarietyName",
        COUNT(ID) AS "Count"
FROM TableName
GROUP BY Name
UNION
SELECT 'Type' AS "ColumnName",
        Type AS "VarietyName",
        COUNT(ID) AS "Count"
FROM TableName
GROUP BY Type
UNION
SELECT 'Color' AS "ColumnName",
        Color AS "VarietyName",
        COUNT(ID) AS "Count"
FROM TableName
GROUP BY Color
UNION
SELECT 'Kcal' AS "ColumnName",
        Kcal AS "VarietyName",
        COUNT(ID) AS "Count"
FROM TableName
GROUP BY Kcal

I'm not sure how that'll affect your processing speed, but worth a try. 

0684Q00000ArJAcQAN.png

Badge

Use an InlineQuerier & SQL Union statements....

SELECT 'Name' AS "ColumnName",
        Name AS "VarietyName",
        COUNT(ID) AS "Count"
FROM TableName
GROUP BY Name
UNION
SELECT 'Type' AS "ColumnName",
        Type AS "VarietyName",
        COUNT(ID) AS "Count"
FROM TableName
GROUP BY Type
UNION
SELECT 'Color' AS "ColumnName",
        Color AS "VarietyName",
        COUNT(ID) AS "Count"
FROM TableName
GROUP BY Color
UNION
SELECT 'Kcal' AS "ColumnName",
        Kcal AS "VarietyName",
        COUNT(ID) AS "Count"
FROM TableName
GROUP BY Kcal

I'm not sure how that'll affect your processing speed, but worth a try. 

0684Q00000ArJxwQAF.png

Userlevel 1
Badge +10

Python will probably be more efficient. I'm not sure how well/if this would work for larger/other datasets

 

count_unique.fmw

Reply