Fuzzy matching the same column with a user-defined threshold

Question

Hi! I have a dataset with 75,000 entries that were user inputted, and therefore have several typos. For example, they are proper nouns, some are labelled "kathryn" vs "katheryn" vs "kathrye". I'm trying to find out how to use a fuzzymatcher where each of these is matched to the ID that they most align with.

Some challenges here:

1) This is a register, so there may be 50 entries that are labelled "katheryn" and 50 entries that are labelled "kathryn". If I use a fuzzymatcher which doesn't have a threshold, then it doesn't work for me as each one connects to itself. I'm considering a similarity ratio between 0.7 and 0.99 for this reason.

2) I want to preserve all entries, as entries include other data that I need for further analysis.

Ideally, the output would be a column which uses an identifier for all entries that fall within the similarity threshold. Do you know how I can do this?

Apologies if it isn't clear, I don't have a ton of experience with FME or coding in general. Thanks!

danminneyatsaf · Answer

Hi @safeershersad​, in case you didn't see it, I left a reply on your other post with an explanation and a demonstration workspace. All the best,Dan M

Fuzzy matching the same column with a user-defined threshold

1 reply

Reply

Community Stats

Reply

Community Stats

Sign up

An FME Account is required to contribute

Login to the community

An FME Account is required to contribute

Scanning file for viruses.

This file cannot be downloaded