Question

Fuzzy matching within the same column

  • 4 November 2021
  • 5 replies
  • 42 views

Hi! I have a dataset with 75,000 entries that were user inputted, and therefore have several typos. For example, they are proper nouns, some are labelled "kathryn" vs "katheryn" vs "kathrye". I'm trying to find out how to use a fuzzymatcher where each of these is matched to the ID that they most align with.

 

Some challenges here:

1) This is a register, so there may be 50 entries that are labelled "katheryn" and 50 entries that are labelled "kathryn". If I use a fuzzymatcher which doesn't have a threshold, then it doesn't work for me as each one connects to itself. I'm considering a similarity ratio between 0.7 and 0.99 for this reason.

2) I want to preserve all entries, as entries include other data that I need for further analysis.

 

Ideally, the output would be a column which uses an identifier for all entries that fall within the similarity threshold. Do you know how I can do this?

 

Apologies if it isn't clear, I don't have a ton of experience with FME or coding in general. Thanks!


5 replies

Userlevel 2
Badge +10

Hi @safeershersad​, it looks like the FuzzyStringComparer might be the transformer you are looking for! It compare two string attributes and calculate a similarity ratio. This transformer also preserves all entries.

 

As for the threshold part of this workspace, you could use a Tester to test if the ratio has a value greater than 0.7. Features that have a ratio higher than 0.7 will go to the Passed port, whereas features that have a ratio lower than 0.7 will go to the Failed port.

You can then connect the Passed port to an AttributeManager to assign a new identifier to all those entries that fall within the similarity threshold.

 

To better demonstrate this I have attached a simple demonstration.

 

Let me know if this helps!

Hi @safeershersad​, it looks like the FuzzyStringComparer might be the transformer you are looking for! It compare two string attributes and calculate a similarity ratio. This transformer also preserves all entries.

 

As for the threshold part of this workspace, you could use a Tester to test if the ratio has a value greater than 0.7. Features that have a ratio higher than 0.7 will go to the Passed port, whereas features that have a ratio lower than 0.7 will go to the Failed port.

You can then connect the Passed port to an AttributeManager to assign a new identifier to all those entries that fall within the similarity threshold.

 

To better demonstrate this I have attached a simple demonstration.

 

Let me know if this helps!

Hi @danminneyatsaf​ , thanks for the demo - I took a look and I believe this won't work, as I'm looking for matches within the same column. I've attached a sample input and output that I'm trying to achieve. Is there an option for the fuzzy string comparer to search the same column until it finds a match above 0.7?

Userlevel 2
Badge +10

Hi @danminneyatsaf​ , thanks for the demo - I took a look and I believe this won't work, as I'm looking for matches within the same column. I've attached a sample input and output that I'm trying to achieve. Is there an option for the fuzzy string comparer to search the same column until it finds a match above 0.7?

@safeershersad​ are you trying to fuzzy compare each of these names against a single name/value (ex: Boatel) and then sort them by highest similarity ratio to lowest?

Hi @danminneyatsaf​ , thanks for the demo - I took a look and I believe this won't work, as I'm looking for matches within the same column. I've attached a sample input and output that I'm trying to achieve. Is there an option for the fuzzy string comparer to search the same column until it finds a match above 0.7?

@danminneyatsaf​  I'm trying to clean the list so that there aren't repeated close matches, like BOATEL1 and BOATELI. from what I'm seeing, it looks like this might not be possible as fuzzystringcomparer only checks similarity with one string per entry.

Userlevel 2
Badge +10

Hi @danminneyatsaf​ , thanks for the demo - I took a look and I believe this won't work, as I'm looking for matches within the same column. I've attached a sample input and output that I'm trying to achieve. Is there an option for the fuzzy string comparer to search the same column until it finds a match above 0.7?

@safeershersad​ I also found the FuzzyStringCompareFrom2Datasets transformer on the FME Hub. This transformer allows you to do what you were looking for above, which is compare each value against each value in a column. What you could do here is set the same column/attribute for the String attribute Feature Type 1 & String Attribute Feature Type 2. From testing this myself it seems that every value is compared against each value in a column.

Reply