Question

CRC calculator picking up duplicates

9 years ago
February 1, 2016
3 replies
103 views

reshu
6 replies

hi

I am trying to compare two datasets by calculating the CRC value for (building number , streetname , suburb name) for each of the datasets and running through the feature merger (unreferenced port) to pick up new updates .

This has picked up quite a few records that already exist . the building number could be like 37-39 or 25A etc . Can the CRC calculator handle this ?

david_r
8352 replies
9 years ago
February 1, 2016

As far as I can tell, the CRCCalculator defaults to a 32-bit hash value based on some or all of the feature attributes. A 32-bit integer has about 4 billion possible values, so you should be very careful if you use the hash value to compare lots of features, as there is an increased risk of CRC collisions as the number of features grows, meaning that different features might in fact get the same CRC, even though the attributes differ. Of course, the probability of this happening increases with the number of features, approaching 50% probability at less then 80 000 features. You can read more about this here.

Also be aware that CRC/hash algorithms by design have no tolerance, e.g.

"25A" = 7B445B5C, while

"25 A" = CD999DE9, i.e. completely different

Also note that

"plumless" = 4DDB0C25 equals

"buckeroo" =4DDB0C25, i.e. identical CRC

I see that FME 2015 and later lets you choose CRC64 which should help a lot with the probability of a collision, but the probability will never be zero unless you only have one feature.

If you need to detect duplicates, you will probably be better off skipping the CRCCalculator and just comparing the attribute values directly. Also look at the DuplicateRemover and/or Matcher for this.

You may also want to consider doing some pre-processing on your attributes before comparing them, e.g. removing all spaces and converting to either upper or lower case.

David

+9

roland.martin
94 replies
9 years ago
February 1, 2016

david_r wrote:

As far as I can tell, the CRCCalculator defaults to a 32-bit hash value based on some or all of the feature attributes. A 32-bit integer has about 4 billion possible values, so you should be very careful if you use the hash value to compare lots of features, as there is an increased risk of CRC collisions as the number of features grows, meaning that different features might in fact get the same CRC, even though the attributes differ. Of course, the probability of this happening increases with the number of features, approaching 50% probability at less then 80 000 features. You can read more about this here.

Also be aware that CRC/hash algorithms by design have no tolerance, e.g.

"25A" = 7B445B5C, while

"25 A" = CD999DE9, i.e. completely different

Also note that

"plumless" = 4DDB0C25 equals

"buckeroo" =4DDB0C25, i.e. identical CRC

I see that FME 2015 and later lets you choose CRC64 which should help a lot with the probability of a collision, but the probability will never be zero unless you only have one feature.

If you need to detect duplicates, you will probably be better off skipping the CRCCalculator and just comparing the attribute values directly. Also look at the DuplicateRemover and/or Matcher for this.

You may also want to consider doing some pre-processing on your attributes before comparing them, e.g. removing all spaces and converting to either upper or lower case.

David

Really interesting, thanks. On the other hand, the trouble with comparing attributes directly is that your chances of losing matches can be high, particularly for numeric values (there was an issue here recently where somebody was comparing ArcSDE to some other format and the numbers were coming out different).

david_r
8352 replies
9 years ago
February 2, 2016

roland.martin wrote:

Really interesting, thanks. On the other hand, the trouble with comparing attributes directly is that your chances of losing matches can be high, particularly for numeric values (there was an issue here recently where somebody was comparing ArcSDE to some other format and the numbers were coming out different).

I agree that comparisons are complicated, often more complicated than anticipated. On the other hand, I don't think a CRC/hash value is going to simplify things, as the quality of the hash isn't any better than the quality of the values going into it.

I'd rather concatenate my key attributes into a long string using some sort of delimiter (think CSV) and compare two readable and intelligible strings than two CRC values where you cannot easily tell exactly what went into it.

Reply

Rich Text Editor, editor1

CRC calculator picking up duplicates

3 replies

Reply

Helpful Members This Week

Recently Solved Questions

Workspace app: ArcGIS Online Feature Service Reader: Connection 'AGOL service' does not exist.

FME Flow Automation versioning

Difference between CoordinateSystemSetter and "Define Projection" in ArcGIS Pro

FME Log "Language" for VS Code

Hide geometry definition input field from Geometry Parameter

Community Stats

Latest FME

Cookie policy

Cookie settings

Reply

Related Topics

PointOnArea overlayer creating duplicate points on outputicon

Duplicate ESRI Create Watersheds tool in FME ?icon

Why is the CRCCalculator giving differing results for things that should be the same?icon

Duplicates geometries DWGicon

Remove duplicate features based on geometryicon

Helpful Members This Week

Recently Solved Questions

Workspace app: ArcGIS Online Feature Service Reader: Connection 'AGOL service' does not exist.

FME Flow Automation versioning

Difference between CoordinateSystemSetter and "Define Projection" in ArcGIS Pro

FME Log "Language" for VS Code

Hide geometry definition input field from Geometry Parameter

Popular Tags

Community Stats

Latest FME

Sign up

An FME Account is required to contribute

Login to the community

An FME Account is required to contribute

Scanning file for viruses.

This file cannot be downloaded

Cookie policy

Cookie settings