Question

Export Unique Values For Each Field From Multiple Datasets

  • 3 March 2017
  • 5 replies
  • 36 views

Badge

I'm trying to determine all of the unique values for each field for a dataset. I'd like to create a dynamic workspace because I need to do this for 100+ datasets. Each dataset has a different number of fields with different field names.

I'd either like to export a csv for each dataset with the field name and all the unique values listed.

The data will be coming from feature classes in an Oracle database. Any suggestions?


5 replies

Userlevel 2
Badge +12

Not sure if this can be done in FME alone.

But I would crate a value table first in Orcale using a PL/SQL script like this:

create table values_table(table varchar2(30), column varchar2(30), value varchar2(2000));

declare l_c cursor is select table_name, column_name from user_tab_columns; l_s_statement varchar2(2000); l_value varchar2(2000); begin for l_r in l_c loop for l_s_value in 'select distinct l_r.column_name from l_r.table_name' loop Insert into values_table (l_r.table_name, l_r.column_name, l_s_value ); end loop; end loop; end;Hope this helps solving your challenge.
Badge +16

Hi @mikey,

I think this can be done with creating a list of the values for each field and using the list histogrammer to find all the unique values per field.

Hope this helps.

Userlevel 4

You can use a dynamic reader (expose fme_feature_type) followed by an AttributeExploser. You can then create content statistics using a group by on the fme_feature_type and _attr_name, e.g. using a StatisticsCalculator.

That said, I think a solution like the one suggested by @erik_jan will be faster by an order of magnitude. The reason is that you won't have to extract the data from the database first, and also that using the AttributeExploder on large datasets tend to consume a lot of memory.

Userlevel 6
Badge +31

Or read the records from all_tab_cols with the SQLExecutor in FME and dynamically create "select distinct" statements for all table / column combinations. This way you can easily exclude columns containing primary keys etc as it is useless and time consuming to get the unique values of those.

This example took 1 minute to run.

00-workbench.fmw

Generally speaking it is best practice to let the database do the work if possible. Instead of loading all the data in FME and process it locally.

Select distinct doesn't work on BLOB's / CLOB's tho. You might need query all not null values from the db to analyse those.

Badge +22

The most efficient way I found for generic data (ie no taking advantage of built in database functionality) was to use python, to get the attribute names (feature.getAllAttributeNames()) add them to a master list (not all features necessarily have all attributes) and then loop through the masterList and for each attribute loop through the features and built the unique value list. I outputted one feature per attribute, with a list of unique values as well the number of features that had an "empty" value for that attribute.

Reply