Question

Use the FuzzyStringComparer to iterate through 2 lists?

  • 3 August 2016
  • 6 replies
  • 2 views

Badge

How can I use the FuzzyStringComparer on 2 lists?

For instance:

Table1_AttributeABC COMPANYDEF COMPANYTable2_AttributeABC INCDEF

My intention is to generate a table as follows:

Table1_AttributeTable2_AttributeratioABC COMPANYABC INC0.55DEF COMPANYDEF0.43

Using a post from @takashi, I merged the 2 datasets by using the FeatureMerger transformer and a constant of 1. I then generate a list using the FeatureMerger transformer and called it dataset2.

I then try to use the FuzzyStringComparer, but it asks me to pick a single attribute. If I choose the dataset2.list{}, it wants a single element in that list, and doesn't seem to be able to iterate through the list.

Any help would be appreciated!


6 replies

Userlevel 1
Badge +12

You can use the python in the FuzzyStringComparer (e.g. difflib.SequenceMatcher) to do it. I would suggest that you keep one side as features and add the other features as a list to those features. Then you can cycle through the list, exactly as @takashi showed in

https://knowledge.safe.com/questions/3776/fuzzy-string-matching-from-two-datasets.html

Hopefully you aren't dealing with millions as there will be many iterations in that case and keep the smaller dataset as the list to help reduce iterations.

Userlevel 2
Badge +17

Hi @dmatranga, if you need to compare the two tables for each row, add row number attribute to both tables and then merge them using the row number as the join key. i.e. merge the first row from the table 2 to the first row from the table 1, merge the second row from the table 2 to the second row from the table 1.. and so on.

Badge

@takashi I did the stuff below but I'm still stuck when it comes to the FuzzyString Comparer:

  • Exposed the csv_line_number format attributes on both CSV readers, dataset1 and dataset2
  • For dataset1, I used a BulkAttributeRenamer to prefix the features with dataset1_
  • For dataset2, I used a BulkAttributeRenamer to prefix the features with dataset2.
  • Using @todd_davis's suggestion, used a ListBuilder to create a list of all the features in dataset1 and called it _dataset1LIST.
  • Used a FeatureMerger to merge on:
    • _dataset1LIST{0}.dataset1_csv_line_number
    • dataset2_csv_line_number

Reading 10 features from dataset1, and 1 feature from dataset2, I get this result:

Here's my config for FuzzyStringComparer:

  • String 1 Attribute:
    • _dataset1LIST{0}.dataset1_NAME
  • String 2 Attribute:
    • dataset2_NAME

The only problem is, this is a one-to-one comparison, with the resulting RATIO attribute just representing the result of a single comparison to the first element in the list.

If I want to compare a single feature from dataset2 to every feature in dataset1, is this possible using FuzzyStringComparer?

Badge

I'm also getting this error message when I use the method from https://knowledge.safe.com/questions/3776/fuzzy-string-matching-from-two-datasets.html

Python Exception <TypeError>: 'NoneType' object is not iterable

Error encountered while calling function `FuzzyStringCompare'

f_22(PythonFactory): PythonFactory failed to process feature

Userlevel 2
Badge +17

@takashi I did the stuff below but I'm still stuck when it comes to the FuzzyString Comparer:

  • Exposed the csv_line_number format attributes on both CSV readers, dataset1 and dataset2
  • For dataset1, I used a BulkAttributeRenamer to prefix the features with dataset1_
  • For dataset2, I used a BulkAttributeRenamer to prefix the features with dataset2.
  • Using @todd_davis's suggestion, used a ListBuilder to create a list of all the features in dataset1 and called it _dataset1LIST.
  • Used a FeatureMerger to merge on:
    • _dataset1LIST{0}.dataset1_csv_line_number
    • dataset2_csv_line_number

Reading 10 features from dataset1, and 1 feature from dataset2, I get this result:

Here's my config for FuzzyStringComparer:

  • String 1 Attribute:
    • _dataset1LIST{0}.dataset1_NAME
  • String 2 Attribute:
    • dataset2_NAME

The only problem is, this is a one-to-one comparison, with the resulting RATIO attribute just representing the result of a single comparison to the first element in the list.

If I want to compare a single feature from dataset2 to every feature in dataset1, is this possible using FuzzyStringComparer?

Hi @dmatranga, to compare a single feature from dataset 2 to every feature from dataset 1, you don't need to build a list. Merge the dataset 2 feature to every dataset 1 feature unconditionally and then apply the FuzzyStringComparer. In order to perform unconditional merging, set an identical value (e.g. 1) to the Join On parameter for both Requestor and Supplier.

 

Userlevel 2
Badge +17

I'm also getting this error message when I use the method from https://knowledge.safe.com/questions/3776/fuzzy-string-matching-from-two-datasets.html

Python Exception <TypeError>: 'NoneType' object is not iterable

Error encountered while calling function `FuzzyStringCompare'

f_22(PythonFactory): PythonFactory failed to process feature

The error is from this script?

 

import difflib
def fuzzyCompareString(feature):
    str1 = feature.getAttribute('str1')
    for i, str2 in enumerate(feature.getAttribute('dataset2{}.str2')):
        ratio = difflib.SequenceMatcher(None, str1, str2).ratio()
        feature.setAttribute('dataset2{%d}.ratio'  % i, ratio)
If so, maybe the getAttribute method ((feature.getAttribute('dataset2{}.str2')) returned the None. More than likely, the specified list attribute does not exist or the specified list name is incorrect.

 

 

Reply