Solved

FuzzyStringCompareFrom2Datasets Slow

6 years ago
August 8, 2018
4 replies
136 views

andy_map
2 replies

I'm trying to compare two datasets by looking at the address attribute string used in both datasets and find the fuzzy matching ratios. The read process was fast and read the 350,000 records in dataset 1 and the 14,000 records in dataset 2 in less than a minute. I then sort both lists separately and then use the FuzzyStringCompareFrom2Datasets transformer. I have been running this workspace all day (about 5 hours so far) and it has only output 288 records. Is there a way to speed this up?

Best answer by paalped

FuzzyStringCompare2datsets does not look to support those big datasets, cause what is does it takes for every feature of your 350 000 features and adds a list of 14 000 features it the searches through that list and compare each string to the string to the string attribute you choose to compare(which equals to approx 4 900 000 000 comparison), then it sorts every list ( 350 000 times its sorts a list of length 14 000) by its ratio, and chooses the one with greates accuracy. this will of course be very time consuming with the sizes you are operating with.

View original

Did this help you find an answer to your question?

david_r
8352 replies
6 years ago
August 9, 2018

How large are the strings that you're comparing using the FuzzyStringCompare?

Which format are you writing to?

What happens if you disable the writer?

Also, do you really need the two Sorters?

+5

paalped
Contributor
130 replies
Best Answer
6 years ago
August 9, 2018

FuzzyStringCompare2datsets does not look to support those big datasets, cause what is does it takes for every feature of your 350 000 features and adds a list of 14 000 features it the searches through that list and compare each string to the string to the string attribute you choose to compare(which equals to approx 4 900 000 000 comparison), then it sorts every list ( 350 000 times its sorts a list of length 14 000) by its ratio, and chooses the one with greates accuracy. this will of course be very time consuming with the sizes you are operating with.

A

andy_map
Author
2 replies
6 years ago
August 9, 2018

david_r wrote:

How large are the strings that you're comparing using the FuzzyStringCompare?

Which format are you writing to?

What happens if you disable the writer?

Also, do you really need the two Sorters?

The strings are over 100 in length because i'm combing address parts earlier in the workflow together (street number, prefix, name, type, suffix). I'm writing to excel. If I disable the writer or just connect it to an inspector it is just as slow. I don't need the two sorters but added them after the first couple of runs thinking that sorting them might make the transformer work more efficiently.

A

andy_map
Author
2 replies
6 years ago
August 9, 2018

paalped wrote:

FuzzyStringCompare2datsets does not look to support those big datasets, cause what is does it takes for every feature of your 350 000 features and adds a list of 14 000 features it the searches through that list and compare each string to the string to the string attribute you choose to compare(which equals to approx 4 900 000 000 comparison), then it sorts every list ( 350 000 times its sorts a list of length 14 000) by its ratio, and chooses the one with greates accuracy. this will of course be very time consuming with the sizes you are operating with.

I'm looking for a way to compare address entries (dataset 2) in a dataset against a master address table (dataset 1). The addresses (2) initially did not have an exact match with the master table (1) so it'd be nice to see any other problems that might exist with the entry since the user can enter anything they want. I have a few ways to narrow down D2 (missing data, wrong city/county). This still is only a reduction of about 20%. Ideally the output using this transformer would prove that the users are inputting very dirty data that is not ideal to match with our address table, but I need to prove that. I guess there are other ways to do this but I was looking for something fast with FME I could run multiple times a week.

Reply

Rich Text Editor, editor1

FuzzyStringCompareFrom2Datasets Slow

4 replies

Reply

Helpful Members This Week

Recently Solved Questions

NeighborFinder output with multiple candidate have same Measure value

Workspace app: ArcGIS Online Feature Service Reader: Connection 'AGOL service' does not exist.

FME Flow Automation versioning

Difference between CoordinateSystemSetter and "Define Projection" in ArcGIS Pro

FME Log "Language" for VS Code

Community Stats

Latest FME

Cookie policy

Cookie settings

Reply

Related Topics

Matching an ID to another data seticon

Comparing similar geometry from two filesicon

Matching Values from One Column in Another Columnicon

Filtering data with the value from another readericon

Reading multiple files from multiple zipfilesicon

Helpful Members This Week

Recently Solved Questions

NeighborFinder output with multiple candidate have same Measure value

Workspace app: ArcGIS Online Feature Service Reader: Connection 'AGOL service' does not exist.

FME Flow Automation versioning

Difference between CoordinateSystemSetter and "Define Projection" in ArcGIS Pro

FME Log "Language" for VS Code

Popular Tags

Community Stats

Latest FME

Sign up

An FME Account is required to contribute

Login to the community

An FME Account is required to contribute

Scanning file for viruses.

This file cannot be downloaded

Cookie policy

Cookie settings