I have about 20 tables which I would like to do some data profiling on. I would like to generate a summary showing the percentage of null or empty values in each column.For example if the input table looks like this: I would want an output like this:I have played around with a few transformers like AttributeValidator, StatisticsCalculator and ListBuilder, but is PythonCaller really needed here?I am using FME 2020Thanks!

Python is probably the most efficient way if you have lots of tables, but there are ways to do this without.e.g. a null attribute mapper to map all values to 0, a second to map all null values to 1 (edited for accuracy), an attributeexploder then an aggregator on AttributeValue and sum the valuesOr after the null attribute mappers you could use a statistics calculator to sum the values and then pivot the outputSome other non-python suggestions herehttps://community.safe.com/s/question/0D54Q000080heE3SAI/statistical-calculator

Summarising null value counts for every column

david_r
8356 replies
4 years ago
September 10, 2020

If your backend database support SQL, that would be my preferred solution as it would be an order of magnitude quicker than doing it in FME, especially for larger tables. You can use the SQLExecutor in FME to get the results back into FME.

For example you could use the "Schema (any format)" reader to retrieve all the tables and column names, then send them to the SQLExecutor.

Example SQL:

SELECT
  COUNT(1) AS TotalRowCount
 ,COUNT(myColumn) AS TotalNotNull
 ,COUNT(1) - COUNT(myColumn) AS TotalNull
 ,100.0 * COUNT(myColumn) / COUNT(1) AS PercentNotNull
FROM myTable

This calculates the percentage of records where column myColumn is not null in table myTable.

+39

ebygomm
Influencer
3330 replies
Best Answer
4 years ago
September 10, 2020

Python is probably the most efficient way if you have lots of tables, but there are ways to do this without.

e.g. a null attribute mapper to map all values to 0, a second to map all null values to 1 (edited for accuracy), an attributeexploder then an aggregator on AttributeValue and sum the values

Capture

Or after the null attribute mappers you could use a statistics calculator to sum the values and then pivot the output

Some other non-python suggestions here

https://community.safe.com/s/question/0D54Q000080heE3SAI/statistical-calculator

J

jimcargil
Author
3 replies
4 years ago
September 10, 2020

ebygomm wrote:

Python is probably the most efficient way if you have lots of tables, but there are ways to do this without.

e.g. a null attribute mapper to map all values to 0, a second to map all null values to 1 (edited for accuracy), an attributeexploder then an aggregator on AttributeValue and sum the values

Capture

Or after the null attribute mappers you could use a statistics calculator to sum the values and then pivot the output

Some other non-python suggestions here

https://community.safe.com/s/question/0D54Q000080heE3SAI/statistical-calculator

This is a great answer. I think you meant "a null attribute mapper to map all values to 0, a second to map all null values to 1". I hadn't realised that NullAttributeMapper could use a regex to process everthing that isn't null or empty.

One problem I found in the solution was the need to use Selected Attributes on NullAttributeMapper in order to process the "missing values". I found that I then needed a new transformer for every input table. I got around this by first converting all the data to SQLite, which then I think uses proper nulls rather than Missing.

+39

ebygomm
Influencer
3330 replies
4 years ago
September 10, 2020

jimcargil wrote:

This is a great answer. I think you meant "a null attribute mapper to map all values to 0, a second to map all null values to 1". I hadn't realised that NullAttributeMapper could use a regex to process everthing that isn't null or empty.

One problem I found in the solution was the need to use Selected Attributes on NullAttributeMapper in order to process the "missing values". I found that I then needed a new transformer for every input table. I got around this by first converting all the data to SQLite, which then I think uses proper nulls rather than Missing.

Yes, map all null values to 1, I've edited the answer now, thanks

J

jimcargil
Author
3 replies
4 years ago
September 14, 2020

david_r wrote:

If your backend database support SQL, that would be my preferred solution as it would be an order of magnitude quicker than doing it in FME, especially for larger tables. You can use the SQLExecutor in FME to get the results back into FME.

For example you could use the "Schema (any format)" reader to retrieve all the tables and column names, then send them to the SQLExecutor.

Example SQL:

SELECT
  COUNT(1) AS TotalRowCount
 ,COUNT(myColumn) AS TotalNotNull
 ,COUNT(1) - COUNT(myColumn) AS TotalNull
 ,100.0 * COUNT(myColumn) / COUNT(1) AS PercentNotNull
FROM myTable

This calculates the percentage of records where column myColumn is not null in table myTable.

I got interested again in this solution, since as you point out, it would scale up quite well. However it doesn't seem to work with SQLite. My input data is mainly spreadsheets and Esri shapefile - its easy to add FeatureWriter to convert to SQLite Non Spatial .

Here is my SQL from the SQLExecutor - I need a different SQL for each column so I have substituted the column name in. What I think happens is that SQLExecutor deals with this correctly where the substution string is in place of a SQL expression, but not in place of schema strings such as COUNT(column_name).

SELECT '@Value(_attr_name)' AS AttrName
 ,COUNT(1) AS TotalRowCount
 ,COUNT('@Value(_attr_name)') AS TotalNotNull
 ,COUNT(1) - COUNT('@Value(_attr_name)') AS TotalNull
 ,100.0 * COUNT('@Value(_attr_name)') / COUNT(1) AS PercentNotNull
FROM "ForumExample"

Here is the table leading into my SQL Executor

FME3

And this is what I get out:

FMEForum4

david_r
8356 replies
4 years ago
September 14, 2020

jimcargil wrote:

I got interested again in this solution, since as you point out, it would scale up quite well. However it doesn't seem to work with SQLite. My input data is mainly spreadsheets and Esri shapefile - its easy to add FeatureWriter to convert to SQLite Non Spatial .

Here is my SQL from the SQLExecutor - I need a different SQL for each column so I have substituted the column name in. What I think happens is that SQLExecutor deals with this correctly where the substution string is in place of a SQL expression, but not in place of schema strings such as COUNT(column_name).

SELECT '@Value(_attr_name)' AS AttrName
 ,COUNT(1) AS TotalRowCount
 ,COUNT('@Value(_attr_name)') AS TotalNotNull
 ,COUNT(1) - COUNT('@Value(_attr_name)') AS TotalNull
 ,100.0 * COUNT('@Value(_attr_name)') / COUNT(1) AS PercentNotNull
FROM "ForumExample"

Here is the table leading into my SQL Executor

FME3

And this is what I get out:

FMEForum4

Indeed, you need to put double quotation marks around the column references (attribute names) inside the COUNT functions. If not, they'll be interpreted by SQLite as string literals rather than column references.

    SELECT '@Value(_attr_name)' AS AttrName
     ,COUNT(1) AS TotalRowCount
     ,COUNT("@Value(_attr_name)") AS TotalNotNull
     ,COUNT(1) - COUNT("@Value(_attr_name)") AS TotalNull
     ,100.0 * COUNT("@Value(_attr_name)") / COUNT(1) AS PercentNotNull
    FROM "ForumExample"

+39

ebygomm
Influencer
3330 replies
4 years ago
September 14, 2020

jimcargil wrote:

This is a great answer. I think you meant "a null attribute mapper to map all values to 0, a second to map all null values to 1". I hadn't realised that NullAttributeMapper could use a regex to process everthing that isn't null or empty.

One problem I found in the solution was the need to use Selected Attributes on NullAttributeMapper in order to process the "missing values". I found that I then needed a new transformer for every input table. I got around this by first converting all the data to SQLite, which then I think uses proper nulls rather than Missing.

As an aside, there is a setting when reading Excel files that allows you to map empty cells to null values instead of missing Capture

J

jimcargil
Author
3 replies
4 years ago
September 14, 2020

jimcargil wrote:

I got interested again in this solution, since as you point out, it would scale up quite well. However it doesn't seem to work with SQLite. My input data is mainly spreadsheets and Esri shapefile - its easy to add FeatureWriter to convert to SQLite Non Spatial .

Here is my SQL from the SQLExecutor - I need a different SQL for each column so I have substituted the column name in. What I think happens is that SQLExecutor deals with this correctly where the substution string is in place of a SQL expression, but not in place of schema strings such as COUNT(column_name).

SELECT '@Value(_attr_name)' AS AttrName
 ,COUNT(1) AS TotalRowCount
 ,COUNT('@Value(_attr_name)') AS TotalNotNull
 ,COUNT(1) - COUNT('@Value(_attr_name)') AS TotalNull
 ,100.0 * COUNT('@Value(_attr_name)') / COUNT(1) AS PercentNotNull
FROM "ForumExample"

Here is the table leading into my SQL Executor

FME3

And this is what I get out:

FMEForum4

Thank you - your SQL works, I hadn't appreciated that subtlety about the single versus double quotes

Summarising null value counts for every column

8 replies

Reply

Helpful Members This Week

Recently Solved Questions

Translation Failure for unknown reason (FME 2025.1)

FME Flow 2025.1 Distributed deployment in Azure - Unable to Run a workspace

Locating a duplicate column

Datetime DD-MM-YYYY in userparameter

Using an AGOL Web Connection as a user parameter, would I be abe to get the AGOL Org account URL from this web connection during run time?

Community Stats

Latest FME

Cookie policy

Cookie settings

Reply

Related Topics

I am unable to establish connection with WebMethods using Software AG (Nirvana).icon

JMS Connection to WebMethods using Software AG (Nirvana) - FMEFlow2024.1icon

AzureIOTConnect transformer in FMEicon

Unable to connect ESRI Geodatabae (File Geodb) reader between FME Flow 2025.0.3 and ArcGIS Pro 3.5.0icon

SDE & SQL Connection Issueicon

Helpful Members This Week

Recently Solved Questions

Translation Failure for unknown reason (FME 2025.1)

FME Flow 2025.1 Distributed deployment in Azure - Unable to Run a workspace

Locating a duplicate column

Datetime DD-MM-YYYY in userparameter

Using an AGOL Web Connection as a user parameter, would I be abe to get the AGOL Org account URL from this web connection during run time?

Popular Tags

Community Stats

Latest FME

Sign up

An FME Account is required to contribute

Login to the community

An FME Account is required to contribute

Scanning file for viruses.

This file cannot be downloaded

Cookie policy

Cookie settings