Solved

Fuzzy string matching from two datasets

10 years ago
October 14, 2014
14 replies
322 views

makela
3 replies

Hi, FuzzyStringComparer is a great transformer but it only works within one dataset. Has anyone created a method of doing the same with two datasets?

I have done a feature merge on two streetname lists. As output I now have two lists which contain streetnames that are missing/wrong in either list.

So I would like to continue by doing a fuzzy comparison and return to ListA the most probable candidate from ListB for each attribute.

It is likely there are just typos, and I will then find them. Where the probability is very low the streetname is most likely missing completely or superfluous in either list which is also useful information.

Thanks for any ideas how to proceed...

Best answer by takashi

Hi @shelley98, the "Join On" parameter in the FeatureMerger also accepts an expression, so you can just set the same constant value (e.g. 1) to the field for both Requestor and Supplier to perform unconditional merging, like this.

Hope this helps.

View original

Did this help you find an answer to your question?

david_r
8349 replies
10 years ago
October 14, 2014

Hi,

I would've done this using a PythonCaller and something like the difflib (https://docs.python.org/2/library/difflib.html) module.

Sample use:

from difflib import get_close_matches

matches = get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])

This will return the list ['apple', 'ape']

David

takashi
7681 replies
10 years ago
October 16, 2014

Hi,

I guess that your requirement is to compare a string (e.g. "str1") in the dataset 1 with every string (e.g. "str2") in the dataset 2 with the same manner as the FuzzyStringComparer, and save the results into a list.

If my understanding is correct, this might help you.

1) FeatureMerger: Merge "str2" of every dataset 2 features to each dataset 1 feature. Specify a constant (e.g. "1") to the "Join On" parameter to perform unconditional merging, and:

Process Duplicate Feature: Yes

Duplicate Supplier List Name: dataset2 (for example)

2) PythonCaller: Calculate similarity ratios between "str1" and every "str2" stored in a list, and store the results in a list; expose resulting list name (e.g. "dataset2{}.ratio").

Script example:

-----

import difflib

def fuzzyCompareString(feature):

str1 = feature.getAttribute('str1')

for i, str2 in enumerate(feature.getAttribute('dataset2{}.str2')):

ratio = difflib.SequenceMatcher(None, str1, str2).ratio()

feature.setAttribute('dataset2{%d}.ratio' % i, ratio)

-----

3) ListSorter: Sort "dataset2{}.ratio" by Numeric Decreasing.

The Python method (difflib....ratio()) in the example above is same as the method that is used in the FuzzyStringComparer.

Takashi

makela
Author
3 replies
10 years ago
October 17, 2014

Thank you both very much for your support. I will need to try this once I get a chance!

shelley98
3 replies
9 years ago
June 8, 2016

Hi @takashi I am trying to use your example above, please could you guide me through setting up the feature merger for unconditional merging? I am a complete novice with FME. Do I select fields from the attribute value list under both requester and supplier in the Join On box or do I select 'User Parameter' and set the constant to '1', or set up a conditional statement from the conditional value option.

Thanks

takashi
7681 replies
Best Answer
9 years ago
June 8, 2016

Hope this helps.

takashi
7681 replies
9 years ago
June 8, 2016

shelley98 wrote:

Thanks

Hope this helps.

shelley98
3 replies
9 years ago
June 9, 2016

That's great, thanks so much @taskashi

dmatranga
30 replies
8 years ago
July 28, 2016

@takashi Takashi-san strikes again, ????????!

dmatranga
30 replies
8 years ago
July 31, 2016

takashi wrote:

Hi,

If my understanding is correct, this might help you.

1) FeatureMerger: Merge "str2" of every dataset 2 features to each dataset 1 feature. Specify a constant (e.g. "1") to the "Join On" parameter to perform unconditional merging, and:

Process Duplicate Feature: Yes

Duplicate Supplier List Name: dataset2 (for example)

2) PythonCaller: Calculate similarity ratios between "str1" and every "str2" stored in a list, and store the results in a list; expose resulting list name (e.g. "dataset2{}.ratio").

Script example:

-----

import difflib

def fuzzyCompareString(feature):

str1 = feature.getAttribute('str1')

for i, str2 in enumerate(feature.getAttribute('dataset2{}.str2')):

ratio = difflib.SequenceMatcher(None, str1, str2).ratio()

feature.setAttribute('dataset2{%d}.ratio' % i, ratio)

-----

3) ListSorter: Sort "dataset2{}.ratio" by Numeric Decreasing.

The Python method (difflib....ratio()) in the example above is same as the method that is used in the FuzzyStringComparer.

Takashi

@takashi Going to start a new thread but I have 2 lists of names that are similar to @makela and looking to rate the matches for each one in a match ratio attribute like you describe.

So my question is: you would not use the Fuzzy String Comparer at all and just use that Python snippet within the PythonCaller in order to generate the list that contains the ratio attribute?

+17

itay
Supporter
1441 replies
8 years ago
August 1, 2016

In the past I have used an edited tFuzzyStringComparer in a costum tranformer in which multiple sets are compared, the logica behind the FuzzyStringComparer can be viewed and edited to your needs.

ciarab
Contributor
167 replies
8 years ago
April 27, 2017

itay wrote:

In the past I have used an edited tFuzzyStringComparer in a costum tranformer in which multiple sets are compared, the logica behind the FuzzyStringComparer can be viewed and edited to your needs.

@itay what changes did you make to the transformer in order to compare two datasets.

I have tried the logic above from @takashi , it works but I have to match 10,000 address to 2 million and the unconditional merge on the feature merger is taking a very long time. Its just not feasible.

Did you have success in your method?

takashi
7681 replies
8 years ago
April 27, 2017

itay wrote:

In the past I have used an edited tFuzzyStringComparer in a costum tranformer in which multiple sets are compared, the logica behind the FuzzyStringComparer can be viewed and edited to your needs.

Hi @ciarab, if you need to compare every pair of strings from the two datasets according to the ratio of matched part, it would not be avoid to create 10K x 2M pairs unless the features could be grouped by some rules.

I think it would be better to consider performing CROSS JOIN operation with a SQL statement. If the source format is not a database supporting SQL, it would be necessary to save the features into a database beforehand.

ciarab
Contributor
167 replies
8 years ago
April 27, 2017

takashi wrote:

Thanks @takashi that sounds like a good plan. I might try that out next though not sure I will be allowed access to a database.

I am currently testing just matching on address line 1 and then sending those that have the same address line through FuzzyStringComparer -- seems to be matching a good amount but I will see how I get on! For some reason I thought this would be easier then it is.

cdalessandro
Participant
19 replies
7 years ago
August 30, 2017

If anyone is interest yet, I materialized Takashi indications (Thanks Takashi) and I published

"FuzzyStringCompareFrom2Datasets.fmx" in FME HUB

Reply

Rich Text Editor, editor1

Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

Cookie settings

We use 3 different kinds of cookies. You can choose which cookies you want to accept. We need basic cookies to make this site work, therefore these are the minimum you can select. Learn more about our cookies.

Basic
Functional

Normal
Functional + analytics

Complete
Functional + analytics + social media + embedded videos + marketing

Fuzzy string matching from two datasets