Skip to main content
Solved

Fuzzy string matching from two datasets


Hi, FuzzyStringComparer is a great transformer but it only works within one dataset. Has anyone created a method of doing the same with two datasets?

 

 

I have done a feature merge on two streetname lists. As output I now have two lists which contain streetnames that are missing/wrong in either list.

 

 

So I would like to continue by doing a fuzzy comparison and return to ListA the most probable candidate from ListB for each attribute.

 

 

It is likely there are just typos, and I will then find them. Where the probability is very low the streetname is most likely missing completely or superfluous in either list which is also useful information.

 

 

Thanks for any ideas how to proceed...

Best answer by takashi

Hi @shelley98, the "Join On" parameter in the FeatureMerger also accepts an expression, so you can just set the same constant value (e.g. 1) to the field for both Requestor and Supplier to perform unconditional merging, like this.

Hope this helps.

View original
Did this help you find an answer to your question?

14 replies

david_r
Evangelist
  • October 14, 2014
Hi,

 

 

I would've done this using a PythonCaller and something like the difflib (https://docs.python.org/2/library/difflib.html) module.

 

 

Sample use:

 

 

from difflib import get_close_matches

 

matches = get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])

 

 

This will return the list ['apple', 'ape']

 

 

David

takashi
Influencer
  • October 16, 2014
Hi,

 

 

I guess that your requirement is to compare a string (e.g. "str1") in the dataset 1 with every string (e.g. "str2") in the dataset 2 with the same manner as the FuzzyStringComparer, and save the results into a list.

 

If my understanding is correct, this might help you.

 

 

1) FeatureMerger: Merge "str2" of every dataset 2 features to each dataset 1 feature. Specify a constant (e.g. "1") to the "Join On" parameter to perform unconditional merging, and:

 

Process Duplicate Feature: Yes

 

Duplicate Supplier List Name: dataset2 (for example)

 

 

2) PythonCaller: Calculate similarity ratios between "str1" and every "str2" stored in a list, and store the results in a list; expose resulting list name (e.g. "dataset2{}.ratio").

 

Script example:

 

-----

 

import difflib

 

def fuzzyCompareString(feature):

 

    str1 = feature.getAttribute('str1')

 

    for i, str2 in enumerate(feature.getAttribute('dataset2{}.str2')):

 

        ratio = difflib.SequenceMatcher(None, str1, str2).ratio()

 

        feature.setAttribute('dataset2{%d}.ratio'  % i, ratio)

 

-----

 

 

3) ListSorter: Sort "dataset2{}.ratio" by Numeric Decreasing.

 

 

The Python method (difflib....ratio()) in the example above is same as the method that is used in the FuzzyStringComparer.

 

 

Takashi

  • Author
  • October 17, 2014
Thank you both very much for your support. I will need to try this once I get a chance!

Hi @takashi I am trying to use your example above, please could you guide me through setting up the feature merger for unconditional merging? I am a complete novice with FME. Do I select fields from the attribute value list under both requester and supplier in the Join On box or do I select 'User Parameter' and set the constant to '1', or set up a conditional statement from the conditional value option.

Thanks


takashi
Influencer
  • Best Answer
  • June 8, 2016

Hi @shelley98, the "Join On" parameter in the FeatureMerger also accepts an expression, so you can just set the same constant value (e.g. 1) to the field for both Requestor and Supplier to perform unconditional merging, like this.

Hope this helps.


takashi
Influencer
  • June 8, 2016
shelley98 wrote:

Hi @takashi I am trying to use your example above, please could you guide me through setting up the feature merger for unconditional merging? I am a complete novice with FME. Do I select fields from the attribute value list under both requester and supplier in the Join On box or do I select 'User Parameter' and set the constant to '1', or set up a conditional statement from the conditional value option.

Thanks

Hi @shelley98, the "Join On" parameter in the FeatureMerger also accepts an expression, so you can just set the same constant value (e.g. 1) to the field for both Requestor and Supplier to perform unconditional merging, like this.

Hope this helps.


That's great, thanks so much @taskashi


Forum|alt.badge.img+4
  • July 28, 2016

@takashi Takashi-san strikes again, ????????!


Forum|alt.badge.img+4
  • July 31, 2016
takashi wrote:
Hi,

 

 

I guess that your requirement is to compare a string (e.g. "str1") in the dataset 1 with every string (e.g. "str2") in the dataset 2 with the same manner as the FuzzyStringComparer, and save the results into a list.

 

If my understanding is correct, this might help you.

 

 

1) FeatureMerger: Merge "str2" of every dataset 2 features to each dataset 1 feature. Specify a constant (e.g. "1") to the "Join On" parameter to perform unconditional merging, and:

 

Process Duplicate Feature: Yes

 

Duplicate Supplier List Name: dataset2 (for example)

 

 

2) PythonCaller: Calculate similarity ratios between "str1" and every "str2" stored in a list, and store the results in a list; expose resulting list name (e.g. "dataset2{}.ratio").

 

Script example:

 

-----

 

import difflib

 

def fuzzyCompareString(feature):

 

    str1 = feature.getAttribute('str1')

 

    for i, str2 in enumerate(feature.getAttribute('dataset2{}.str2')):

 

        ratio = difflib.SequenceMatcher(None, str1, str2).ratio()

 

        feature.setAttribute('dataset2{%d}.ratio'  % i, ratio)

 

-----

 

 

3) ListSorter: Sort "dataset2{}.ratio" by Numeric Decreasing.

 

 

The Python method (difflib....ratio()) in the example above is same as the method that is used in the FuzzyStringComparer.

 

 

Takashi

@takashi Going to start a new thread but I have 2 lists of names that are similar to @makela and looking to rate the matches for each one in a match ratio attribute like you describe.

So my question is: you would not use the Fuzzy String Comparer at all and just use that Python snippet within the PythonCaller in order to generate the list that contains the ratio attribute?


itay
Supporter
Forum|alt.badge.img+17
  • Supporter
  • August 1, 2016

In the past I have used an edited tFuzzyStringComparer in a costum tranformer in which multiple sets are compared, the logica behind the FuzzyStringComparer can be viewed and edited to your needs.


ciarab
Contributor
Forum|alt.badge.img+9
  • Contributor
  • April 27, 2017
itay wrote:

In the past I have used an edited tFuzzyStringComparer in a costum tranformer in which multiple sets are compared, the logica behind the FuzzyStringComparer can be viewed and edited to your needs.

@itay what changes did you make to the transformer in order to compare two datasets.

 

 

I have tried the logic above from @takashi , it works but I have to match 10,000 address to 2 million and the unconditional merge on the feature merger is taking a very long time. Its just not feasible.

 

 

Did you have success in your method?

takashi
Influencer
  • April 27, 2017
itay wrote:

In the past I have used an edited tFuzzyStringComparer in a costum tranformer in which multiple sets are compared, the logica behind the FuzzyStringComparer can be viewed and edited to your needs.

Hi @ciarab, if you need to compare every pair of strings from the two datasets according to the ratio of matched part, it would not be avoid to create 10K x 2M pairs unless the features could be grouped by some rules.

I think it would be better to consider performing CROSS JOIN operation with a SQL statement. If the source format is not a database supporting SQL, it would be necessary to save the features into a database beforehand.


ciarab
Contributor
Forum|alt.badge.img+9
  • Contributor
  • April 27, 2017
takashi wrote:

Hi @ciarab, if you need to compare every pair of strings from the two datasets according to the ratio of matched part, it would not be avoid to create 10K x 2M pairs unless the features could be grouped by some rules.

I think it would be better to consider performing CROSS JOIN operation with a SQL statement. If the source format is not a database supporting SQL, it would be necessary to save the features into a database beforehand.

Thanks @takashi that sounds like a good plan. I might try that out next though not sure I will be allowed access to a database.

 

 

 

I am currently testing just matching on address line 1 and then sending those that have the same address line through FuzzyStringComparer -- seems to be matching a good amount but I will see how I get on! For some reason I thought this would be easier then it is.

 


cdalessandro
Participant
Forum|alt.badge.img+4
  • Participant
  • August 30, 2017

If anyone is interest yet, I materialized Takashi indications (Thanks Takashi) and I published

"FuzzyStringCompareFrom2Datasets.fmx" in FME HUB


Reply


Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

 
Cookie settings