Question

Data duplication detection


Badge +8

Hi,

I have a database with FIRST_NAME, NAME and BIRTH_DATE.

I have to find all the potentially duplicate people.

If I find two features with FIRST_NAME and NAME written the same way or looks similar, I have to compare the features between each other to check if they have the same BIRTH_DATE. If they do, it's the same person, if they don't it's two different people.

In this example, John Lennon and Jon Lenon are written almost the same way and they have the same BIRTH_DATE. We can consider it's the same person.

But John Lennon (1950-01-01) is not the same John Lennon (1985-01-01).

 

IDFIRST_NAMENAMEBIRTH_DATE1JohnLennon1950-01-012JonLenon1950-01-013JohnLennon1985-01-014RingoStar1945-01-015RingoloStar1945-01-016RingoStar2000-01-017GeorgeHarrisson2000-01-01

 

Would you have an idea how to do this kind of comparison? I tried the FuzzyDuplicateRemover but it doesn't give very good results and it doesn't store into a list the duplicated features potentially found (so I can't compare BIRTH_DATE).

Thanks!


10 replies

Badge

which criteria is used to determine that John Lennon and Jon Lenon are the same features?

Anyway if you want to detect the real duplicated features then see the attached image

Badge +8

After a lot of thinking about this, I decided to do a ListBuilder on the Birth_Date at the beggining.

And do a FuzzyDuplicateRemover on each of the List. (don't know how to do that part yet though)

It would be great to have a ListFuzzyDuplicateRemover.

 

Badge +8

which criteria is used to determine that John Lennon and Jon Lenon are the same features?

Anyway if you want to detect the real duplicated features then see the attached image

I though maybe on a ratio calculated on the number of letters who are the same between the two values.

I think the Fuzzy transformer will do the job though.

 

 

Userlevel 1
Badge +21

Building on your idea of comparing each feature in a list for similarity.

Found some python somewhere online for comparing the Levenshtein Distance and then compare each value in the list to each other. If ratio > 89 (arbitrary figure i picked) determine a match and assign a match_group. Output a feature for each ID in list.

fuzzyfun.fmwt

Badge +8

Building on your idea of comparing each feature in a list for similarity.

Found some python somewhere online for comparing the Levenshtein Distance and then compare each value in the list to each other. If ratio > 89 (arbitrary figure i picked) determine a match and assign a match_group. Output a feature for each ID in list.

fuzzyfun.fmwt

This is absolutly great! I can definitly do my analysis with this script. Thanks a lot!!!

Userlevel 1
Badge +21

This is absolutly great! I can definitly do my analysis with this script. Thanks a lot!!!

There's a slight error with the All Matches column, the match group is fine

Badge +8

There's a slight error with the All Matches column, the match group is fine

haha yeah I noticed that, but thanks for the confirmation ;)

Badge +5

@ebygomm​ I have just stumbled across this thread and on the face of it, will resolve my current issue. Unfortunately the link doesn't work. Does anyone know how I might be able to get a copy?

Badge +8

I found it on my computer. Enjoy!

 

Badge +5

I found it on my computer. Enjoy!

 

Excellent. Thank you for your quick response!

Reply