Question

Data duplication detection

6 years ago
April 2, 2019
10 replies
93 views

+20

philippeb
Enthusiast
307 replies

Hi,

I have a database with FIRST_NAME, NAME and BIRTH_DATE.

I have to find all the potentially duplicate people.

If I find two features with FIRST_NAME and NAME written the same way or looks similar, I have to compare the features between each other to check if they have the same BIRTH_DATE. If they do, it's the same person, if they don't it's two different people.

In this example, John Lennon and Jon Lenon are written almost the same way and they have the same BIRTH_DATE. We can consider it's the same person.

But John Lennon (1950-01-01) is not the same John Lennon (1985-01-01).

IDFIRST_NAMENAMEBIRTH_DATE1JohnLennon1950-01-012JonLenon1950-01-013JohnLennon1985-01-014RingoStar1945-01-015RingoloStar1945-01-016RingoStar2000-01-017GeorgeHarrisson2000-01-01

Would you have an idea how to do this kind of comparison? I tried the FuzzyDuplicateRemover but it doesn't give very good results and it doesn't store into a list the duplicated features potentially found (so I can't compare BIRTH_DATE).

Thanks!

miladahmad
97 replies
6 years ago
April 2, 2019

which criteria is used to determine that John Lennon and Jon Lenon are the same features?

Anyway if you want to detect the real duplicated features then see the attached image

+20

philippeb
Author
Enthusiast
307 replies
6 years ago
April 10, 2019

After a lot of thinking about this, I decided to do a ListBuilder on the Birth_Date at the beggining.

And do a FuzzyDuplicateRemover on each of the List. (don't know how to do that part yet though)

It would be great to have a ListFuzzyDuplicateRemover.

FME Lover

+20

philippeb
Author
Enthusiast
307 replies
6 years ago
April 10, 2019

miladahmad wrote:

which criteria is used to determine that John Lennon and Jon Lenon are the same features?

Anyway if you want to detect the real duplicated features then see the attached image

I though maybe on a ratio calculated on the number of letters who are the same between the two values.

I think the Fuzzy transformer will do the job though.

FME Lover

+39

ebygomm
Influencer
3312 replies
6 years ago
April 10, 2019

Building on your idea of comparing each feature in a list for similarity.

Found some python somewhere online for comparing the Levenshtein Distance and then compare each value in the list to each other. If ratio > 89 (arbitrary figure i picked) determine a match and assign a match_group. Output a feature for each ID in list.

fuzzyfun.fmwt

+20

philippeb
Author
Enthusiast
307 replies
6 years ago
April 10, 2019

ebygomm wrote:

Building on your idea of comparing each feature in a list for similarity.

fuzzyfun.fmwt

This is absolutly great! I can definitly do my analysis with this script. Thanks a lot!!!

FME Lover

+39

ebygomm
Influencer
3312 replies
6 years ago
April 10, 2019

philippeb wrote:

This is absolutly great! I can definitly do my analysis with this script. Thanks a lot!!!

There's a slight error with the All Matches column, the match group is fine

+20

philippeb
Author
Enthusiast
307 replies
6 years ago
April 10, 2019

ebygomm wrote:

There's a slight error with the All Matches column, the match group is fine

haha yeah I noticed that, but thanks for the confirmation ;)

FME Lover

djmcdermott
Contributor
42 replies
4 years ago
September 17, 2020

@ebygomm I have just stumbled across this thread and on the face of it, will resolve my current issue. Unfortunately the link doesn't work. Does anyone know how I might be able to get a copy?

+20

philippeb
Author
Enthusiast
307 replies
4 years ago
September 17, 2020

I found it on my computer. Enjoy!

FME Lover

djmcdermott
Contributor
42 replies
4 years ago
September 17, 2020

philippeb wrote:

I found it on my computer. Enjoy!

Excellent. Thank you for your quick response!

Reply

Rich Text Editor, editor1

Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

Cookie settings

We use 3 different kinds of cookies. You can choose which cookies you want to accept. We need basic cookies to make this site work, therefore these are the minimum you can select. Learn more about our cookies.

Basic
Functional

Normal
Functional + analytics

Complete
Functional + analytics + social media + embedded videos + marketing

Data duplication detection