Skip to main content
Solved

Remove duplicates from database table (keeping all unique record)

  • January 11, 2021
  • 6 replies
  • 468 views

joeri
Contributor
Forum|alt.badge.img+6
  • Contributor

I have a database table in which a number of the records are duplicated. I'd like to remove the duplicates but like to keep 1 record (so similar to the duplicateFilter unique port).

I have no unique identifier in the table so if I match a record (based on a combination of attributes) I'll remove all of them from the table, I need to add 1 of them again afterwards.

 

 

Currently I use the following approach:

  • duplicateFilter to find the duplicates
  • remove the records from the table using fme_db_operation = DELETE (this re
  • re-insert the unduplicated record

 

Any suggestions to improve this methodology?

Best answer by david_r

The fastest is almost certainly to use regular SQL in a SQLExecutor. Here's an example using in SQL Server:

WITH cte AS (
    SELECT 
        contact_id, 
        first_name, 
        last_name, 
        email, 
        ROW_NUMBER() OVER (
            PARTITION BY 
                first_name, 
                last_name, 
                email
            ORDER BY 
                first_name, 
                last_name, 
                email
        ) row_num
     FROM 
        sales.contacts
)
DELETE FROM cte
WHERE row_num > 1;

Assumes you have a table "sales.contacts" and you want to remove all duplicates of the combination of first_name + last_name + email. It only needs to be executed once. If you have the right indexes this should be very fast.

View original
Did this help you find an answer to your question?

6 replies

ebygomm
Influencer
Forum|alt.badge.img+32
  • Influencer
  • January 11, 2021

Truncate the table and write all the non duplicates in again? Depends on what percentage of the table are duplicates


david_r
Celebrity
  • Best Answer
  • January 11, 2021

The fastest is almost certainly to use regular SQL in a SQLExecutor. Here's an example using in SQL Server:

WITH cte AS (
    SELECT 
        contact_id, 
        first_name, 
        last_name, 
        email, 
        ROW_NUMBER() OVER (
            PARTITION BY 
                first_name, 
                last_name, 
                email
            ORDER BY 
                first_name, 
                last_name, 
                email
        ) row_num
     FROM 
        sales.contacts
)
DELETE FROM cte
WHERE row_num > 1;

Assumes you have a table "sales.contacts" and you want to remove all duplicates of the combination of first_name + last_name + email. It only needs to be executed once. If you have the right indexes this should be very fast.


fme_can_do_it
Enthusiast
Forum|alt.badge.img+9

hi @david_r I am facing the same issue of duplicate records in the database table. Some records are being inserted multiple times in the database. I want to keep only one record and remove the duplicates, preserving the original or unique record in the table. I haven't used SQL Executor before and was wondering if the SQL query mentioned above would work for my situation as well.Thanks!


david_r
Celebrity
  • April 9, 2024

Hi @fme_can_do_it , that should work for Microsoft SQL Server. If you’re using anything else it might need to be adapted to that particular SQL flavor.

You can check the syntax by using this non-destructive query (i.e. nothing gets deleted, only selected):

WITH cte AS (
    SELECT 
        contact_id, -- modify as needed
        first_name, -- modify as needed 
        last_name,  -- modify as needed
        email,  -- modify as needed
        ROW_NUMBER() OVER (
            PARTITION BY 
                first_name,  -- modify as needed
                last_name,  -- modify as needed
                email -- modify as needed
            ORDER BY 
                first_name,  -- modify as needed
                last_name,  -- modify as needed
                email -- modify as needed
        ) row_num
     FROM 
        sales.contacts -- modify as needed
)
SELECT * FROM cte
WHERE row_num > 1;

Of course you’ll have to modify/add/remove the columns and also modify your table name, I’ve flagged those with the comment “--modify as needed”

If you’re happy with the selection of records, you can delete the returned records by simply modifying the second to last line into DELETE FROM cte as in the example above. If you’re unsure, I recommend making a backup of the table first.


david_r
Celebrity
  • April 9, 2024

Of course, you could also do the same thing very easily with only FME and no SQL, e.g.

Reader → DuplicateFilter (Port: Duplicate) → DatabaseDeleter


fme_can_do_it
Enthusiast
Forum|alt.badge.img+9

@david_r  i used Duplicate filter in my workflow and routed all duplicate features to the writer. In the configuration of this writer, I specified the delete operation based on the objectID column, which serves as a unique identifier. It worked! thank you!


Reply


Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

 
Cookie settings