Question

Data duplication detection

5 years ago
2 April 2019
10 replies
3 views

philippeb
245 replies

Hi,

I have a database with FIRST_NAME, NAME and BIRTH_DATE.

I have to find all the potentially duplicate people.

If I find two features with FIRST_NAME and NAME written the same way or looks similar, I have to compare the features between each other to check if they have the same BIRTH_DATE. If they do, it's the same person, if they don't it's two different people.

In this example, John Lennon and Jon Lenon are written almost the same way and they have the same BIRTH_DATE. We can consider it's the same person.

But John Lennon (1950-01-01) is not the same John Lennon (1985-01-01).

IDFIRST_NAMENAMEBIRTH_DATE1JohnLennon1950-01-012JonLenon1950-01-013JohnLennon1985-01-014RingoStar1945-01-015RingoloStar1945-01-016RingoStar2000-01-017GeorgeHarrisson2000-01-01

Would you have an idea how to do this kind of comparison? I tried the FuzzyDuplicateRemover but it doesn't give very good results and it doesn't store into a list the duplicated features potentially found (so I can't compare BIRTH_DATE).

Thanks!

10 replies

which criteria is used to determine that John Lennon and Jon Lenon are the same features?

Anyway if you want to detect the real duplicated features then see the attached image

philippeb
Author
245 replies
5 years ago
10 April 2019

After a lot of thinking about this, I decided to do a ListBuilder on the Birth_Date at the beggining.

And do a FuzzyDuplicateRemover on each of the List. (don't know how to do that part yet though)

It would be great to have a ListFuzzyDuplicateRemover.

philippeb
Author
245 replies
5 years ago
10 April 2019

which criteria is used to determine that John Lennon and Jon Lenon are the same features?

Anyway if you want to detect the real duplicated features then see the attached image

I though maybe on a ratio calculated on the number of letters who are the same between the two values.

I think the Fuzzy transformer will do the job though.

Userlevel 1

+21

ebygomm
Contributor
3079 replies
5 years ago
10 April 2019

Building on your idea of comparing each feature in a list for similarity.

Found some python somewhere online for comparing the Levenshtein Distance and then compare each value in the list to each other. If ratio > 89 (arbitrary figure i picked) determine a match and assign a match_group. Output a feature for each ID in list.

fuzzyfun.fmwt

philippeb
Author
245 replies
5 years ago
10 April 2019

Building on your idea of comparing each feature in a list for similarity.

fuzzyfun.fmwt

This is absolutly great! I can definitly do my analysis with this script. Thanks a lot!!!

Userlevel 1

+21

ebygomm
Contributor
3079 replies
5 years ago
10 April 2019

This is absolutly great! I can definitly do my analysis with this script. Thanks a lot!!!

There's a slight error with the All Matches column, the match group is fine

philippeb
Author
245 replies
5 years ago
10 April 2019

There's a slight error with the All Matches column, the match group is fine

haha yeah I noticed that, but thanks for the confirmation ;)

@ebygomm I have just stumbled across this thread and on the face of it, will resolve my current issue. Unfortunately the link doesn't work. Does anyone know how I might be able to get a copy?

I found it on my computer. Enjoy!

Excellent. Thank you for your quick response!

Data duplication detection

10 replies

Reply

Community Stats

Reply

Community Stats

Sign up

An FME Account is required to contribute

Login to the community

An FME Account is required to contribute

Scanning file for viruses.

This file cannot be downloaded