Skip to main content

I have a database table in which a number of the records are duplicated. I'd like to remove the duplicates but like to keep 1 record (so similar to the duplicateFilter unique port).

I have no unique identifier in the table so if I match a record (based on a combination of attributes) I'll remove all of them from the table, I need to add 1 of them again afterwards.

 

 

Currently I use the following approach:

  • duplicateFilter to find the duplicates
  • remove the records from the table using fme_db_operation = DELETE (this re
  • re-insert the unduplicated record

 

Any suggestions to improve this methodology?

Truncate the table and write all the non duplicates in again? Depends on what percentage of the table are duplicates


The fastest is almost certainly to use regular SQL in a SQLExecutor. Here's an example using in SQL Server:

WITH cte AS (
    SELECT 
        contact_id, 
        first_name, 
        last_name, 
        email, 
        ROW_NUMBER() OVER (
            PARTITION BY 
                first_name, 
                last_name, 
                email
            ORDER BY 
                first_name, 
                last_name, 
                email
        ) row_num
     FROM 
        sales.contacts
)
DELETE FROM cte
WHERE row_num > 1;

Assumes you have a table "sales.contacts" and you want to remove all duplicates of the combination of first_name + last_name + email. It only needs to be executed once. If you have the right indexes this should be very fast.


hi @david_r I am facing the same issue of duplicate records in the database table. Some records are being inserted multiple times in the database. I want to keep only one record and remove the duplicates, preserving the original or unique record in the table. I haven't used SQL Executor before and was wondering if the SQL query mentioned above would work for my situation as well.Thanks!


Hi @fme_can_do_it , that should work for Microsoft SQL Server. If you’re using anything else it might need to be adapted to that particular SQL flavor.

You can check the syntax by using this non-destructive query (i.e. nothing gets deleted, only selected):

WITH cte AS (
SELECT
contact_id, -- modify as needed
first_name, -- modify as needed
last_name, -- modify as needed
email, -- modify as needed
ROW_NUMBER() OVER (
PARTITION BY
first_name, -- modify as needed
last_name, -- modify as needed
email -- modify as needed
ORDER BY
first_name, -- modify as needed
last_name, -- modify as needed
email -- modify as needed
) row_num
FROM
sales.contacts -- modify as needed
)
SELECT * FROM cte
WHERE row_num > 1;

Of course you’ll have to modify/add/remove the columns and also modify your table name, I’ve flagged those with the comment “--modify as needed”

If you’re happy with the selection of records, you can delete the returned records by simply modifying the second to last line into DELETE FROM cte as in the example above. If you’re unsure, I recommend making a backup of the table first.


Of course, you could also do the same thing very easily with only FME and no SQL, e.g.

Reader → DuplicateFilter (Port: Duplicate) → DatabaseDeleter


@david_r  i used Duplicate filter in my workflow and routed all duplicate features to the writer. In the configuration of this writer, I specified the delete operation based on the objectID column, which serves as a unique identifier. It worked! thank you!


Reply