Solved

Fuzzy string matching from two datasets


Hi, FuzzyStringComparer is a great transformer but it only works within one dataset. Has anyone created a method of doing the same with two datasets?

 

 

I have done a feature merge on two streetname lists. As output I now have two lists which contain streetnames that are missing/wrong in either list.

 

 

So I would like to continue by doing a fuzzy comparison and return to ListA the most probable candidate from ListB for each attribute.

 

 

It is likely there are just typos, and I will then find them. Where the probability is very low the streetname is most likely missing completely or superfluous in either list which is also useful information.

 

 

Thanks for any ideas how to proceed...
icon

Best answer by takashi 9 June 2016, 01:59

View original

14 replies

Userlevel 4
Hi,

 

 

I would've done this using a PythonCaller and something like the difflib (https://docs.python.org/2/library/difflib.html) module.

 

 

Sample use:

 

 

from difflib import get_close_matches

 

matches = get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])

 

 

This will return the list ['apple', 'ape']

 

 

David
Userlevel 2
Badge +17
Hi,

 

 

I guess that your requirement is to compare a string (e.g. "str1") in the dataset 1 with every string (e.g. "str2") in the dataset 2 with the same manner as the FuzzyStringComparer, and save the results into a list.

 

If my understanding is correct, this might help you.

 

 

1) FeatureMerger: Merge "str2" of every dataset 2 features to each dataset 1 feature. Specify a constant (e.g. "1") to the "Join On" parameter to perform unconditional merging, and:

 

Process Duplicate Feature: Yes

 

Duplicate Supplier List Name: dataset2 (for example)

 

 

2) PythonCaller: Calculate similarity ratios between "str1" and every "str2" stored in a list, and store the results in a list; expose resulting list name (e.g. "dataset2{}.ratio").

 

Script example:

 

-----

 

import difflib

 

def fuzzyCompareString(feature):

 

    str1 = feature.getAttribute('str1')

 

    for i, str2 in enumerate(feature.getAttribute('dataset2{}.str2')):

 

        ratio = difflib.SequenceMatcher(None, str1, str2).ratio()

 

        feature.setAttribute('dataset2{%d}.ratio'  % i, ratio)

 

-----

 

 

3) ListSorter: Sort "dataset2{}.ratio" by Numeric Decreasing.

 

 

The Python method (difflib....ratio()) in the example above is same as the method that is used in the FuzzyStringComparer.

 

 

Takashi
Thank you both very much for your support. I will need to try this once I get a chance!

Hi @takashi I am trying to use your example above, please could you guide me through setting up the feature merger for unconditional merging? I am a complete novice with FME. Do I select fields from the attribute value list under both requester and supplier in the Join On box or do I select 'User Parameter' and set the constant to '1', or set up a conditional statement from the conditional value option.

Thanks

Userlevel 2
Badge +17

Hi @shelley98, the "Join On" parameter in the FeatureMerger also accepts an expression, so you can just set the same constant value (e.g. 1) to the field for both Requestor and Supplier to perform unconditional merging, like this.

Hope this helps.

Userlevel 2
Badge +17

Hi @takashi I am trying to use your example above, please could you guide me through setting up the feature merger for unconditional merging? I am a complete novice with FME. Do I select fields from the attribute value list under both requester and supplier in the Join On box or do I select 'User Parameter' and set the constant to '1', or set up a conditional statement from the conditional value option.

Thanks

Hi @shelley98, the "Join On" parameter in the FeatureMerger also accepts an expression, so you can just set the same constant value (e.g. 1) to the field for both Requestor and Supplier to perform unconditional merging, like this.

Hope this helps.

That's great, thanks so much @taskashi

Badge

@takashi Takashi-san strikes again, ????????!

Badge
Hi,

 

 

I guess that your requirement is to compare a string (e.g. "str1") in the dataset 1 with every string (e.g. "str2") in the dataset 2 with the same manner as the FuzzyStringComparer, and save the results into a list.

 

If my understanding is correct, this might help you.

 

 

1) FeatureMerger: Merge "str2" of every dataset 2 features to each dataset 1 feature. Specify a constant (e.g. "1") to the "Join On" parameter to perform unconditional merging, and:

 

Process Duplicate Feature: Yes

 

Duplicate Supplier List Name: dataset2 (for example)

 

 

2) PythonCaller: Calculate similarity ratios between "str1" and every "str2" stored in a list, and store the results in a list; expose resulting list name (e.g. "dataset2{}.ratio").

 

Script example:

 

-----

 

import difflib

 

def fuzzyCompareString(feature):

 

    str1 = feature.getAttribute('str1')

 

    for i, str2 in enumerate(feature.getAttribute('dataset2{}.str2')):

 

        ratio = difflib.SequenceMatcher(None, str1, str2).ratio()

 

        feature.setAttribute('dataset2{%d}.ratio'  % i, ratio)

 

-----

 

 

3) ListSorter: Sort "dataset2{}.ratio" by Numeric Decreasing.

 

 

The Python method (difflib....ratio()) in the example above is same as the method that is used in the FuzzyStringComparer.

 

 

Takashi

@takashi Going to start a new thread but I have 2 lists of names that are similar to @makela and looking to rate the matches for each one in a match ratio attribute like you describe.

So my question is: you would not use the Fuzzy String Comparer at all and just use that Python snippet within the PythonCaller in order to generate the list that contains the ratio attribute?

Badge +16

In the past I have used an edited tFuzzyStringComparer in a costum tranformer in which multiple sets are compared, the logica behind the FuzzyStringComparer can be viewed and edited to your needs.

Badge +9

In the past I have used an edited tFuzzyStringComparer in a costum tranformer in which multiple sets are compared, the logica behind the FuzzyStringComparer can be viewed and edited to your needs.

@itay what changes did you make to the transformer in order to compare two datasets.

 

 

I have tried the logic above from @takashi , it works but I have to match 10,000 address to 2 million and the unconditional merge on the feature merger is taking a very long time. Its just not feasible.

 

 

Did you have success in your method?
Userlevel 2
Badge +17

In the past I have used an edited tFuzzyStringComparer in a costum tranformer in which multiple sets are compared, the logica behind the FuzzyStringComparer can be viewed and edited to your needs.

Hi @ciarab, if you need to compare every pair of strings from the two datasets according to the ratio of matched part, it would not be avoid to create 10K x 2M pairs unless the features could be grouped by some rules.

I think it would be better to consider performing CROSS JOIN operation with a SQL statement. If the source format is not a database supporting SQL, it would be necessary to save the features into a database beforehand.

Badge +9

Hi @ciarab, if you need to compare every pair of strings from the two datasets according to the ratio of matched part, it would not be avoid to create 10K x 2M pairs unless the features could be grouped by some rules.

I think it would be better to consider performing CROSS JOIN operation with a SQL statement. If the source format is not a database supporting SQL, it would be necessary to save the features into a database beforehand.

Thanks @takashi that sounds like a good plan. I might try that out next though not sure I will be allowed access to a database.

 

 

 

I am currently testing just matching on address line 1 and then sending those that have the same address line through FuzzyStringComparer -- seems to be matching a good amount but I will see how I get on! For some reason I thought this would be easier then it is.

 

Badge

If anyone is interest yet, I materialized Takashi indications (Thanks Takashi) and I published

"FuzzyStringCompareFrom2Datasets.fmx" in FME HUB

Reply