Question

Fuzzy matching the same column with a user-defined threshold

  • 4 November 2021
  • 1 reply
  • 0 views

Hi! I have a dataset with 75,000 entries that were user inputted, and therefore have several typos. For example, they are proper nouns, some are labelled "kathryn" vs "katheryn" vs "kathrye". I'm trying to find out how to use a fuzzymatcher where each of these is matched to the ID that they most align with.

 

Some challenges here:

1) This is a register, so there may be 50 entries that are labelled "katheryn" and 50 entries that are labelled "kathryn". If I use a fuzzymatcher which doesn't have a threshold, then it doesn't work for me as each one connects to itself. I'm considering a similarity ratio between 0.7 and 0.99 for this reason.

2) I want to preserve all entries, as entries include other data that I need for further analysis.

 

Ideally, the output would be a column which uses an identifier for all entries that fall within the similarity threshold. Do you know how I can do this?

 

Apologies if it isn't clear, I don't have a ton of experience with FME or coding in general. Thanks!


1 reply

Userlevel 2
Badge +10

Hi @safeershersad​, in case you didn't see it, I left a reply on your other post with an explanation and a demonstration workspace.

 

All the best,

Dan M

Reply