Skip to main content

hi

I am trying to compare two datasets by calculating the CRC value for (building number , streetname , suburb name) for each of the datasets and running through the feature merger (unreferenced port) to pick up new updates .

This has picked up quite a few records that already exist . the building number could be like 37-39 or 25A etc . Can the CRC calculator handle this ?

As far as I can tell, the CRCCalculator defaults to a 32-bit hash value based on some or all of the feature attributes. A 32-bit integer has about 4 billion possible values, so you should be very careful if you use the hash value to compare lots of features, as there is an increased risk of CRC collisions as the number of features grows, meaning that different features might in fact get the same CRC, even though the attributes differ. Of course, the probability of this happening increases with the number of features, approaching 50% probability at less then 80 000 features. You can read more about this here.

Also be aware that CRC/hash algorithms by design have no tolerance, e.g.

"25A" = 7B445B5C, while

"25 A" = CD999DE9, i.e. completely different

Also note that

"plumless" = 4DDB0C25 equals

"buckeroo" =4DDB0C25, i.e. identical CRC

I see that FME 2015 and later lets you choose CRC64 which should help a lot with the probability of a collision, but the probability will never be zero unless you only have one feature.

If you need to detect duplicates, you will probably be better off skipping the CRCCalculator and just comparing the attribute values directly. Also look at the DuplicateRemover and/or Matcher for this.

You may also want to consider doing some pre-processing on your attributes before comparing them, e.g. removing all spaces and converting to either upper or lower case.

David


As far as I can tell, the CRCCalculator defaults to a 32-bit hash value based on some or all of the feature attributes. A 32-bit integer has about 4 billion possible values, so you should be very careful if you use the hash value to compare lots of features, as there is an increased risk of CRC collisions as the number of features grows, meaning that different features might in fact get the same CRC, even though the attributes differ. Of course, the probability of this happening increases with the number of features, approaching 50% probability at less then 80 000 features. You can read more about this here.

Also be aware that CRC/hash algorithms by design have no tolerance, e.g.

"25A" = 7B445B5C, while

"25 A" = CD999DE9, i.e. completely different

Also note that

"plumless" = 4DDB0C25 equals

"buckeroo" =4DDB0C25, i.e. identical CRC

I see that FME 2015 and later lets you choose CRC64 which should help a lot with the probability of a collision, but the probability will never be zero unless you only have one feature.

If you need to detect duplicates, you will probably be better off skipping the CRCCalculator and just comparing the attribute values directly. Also look at the DuplicateRemover and/or Matcher for this.

You may also want to consider doing some pre-processing on your attributes before comparing them, e.g. removing all spaces and converting to either upper or lower case.

David

Really interesting, thanks. On the other hand, the trouble with comparing attributes directly is that your chances of losing matches can be high, particularly for numeric values (there was an issue here recently where somebody was comparing ArcSDE to some other format and the numbers were coming out different).


Really interesting, thanks. On the other hand, the trouble with comparing attributes directly is that your chances of losing matches can be high, particularly for numeric values (there was an issue here recently where somebody was comparing ArcSDE to some other format and the numbers were coming out different).

I agree that comparisons are complicated, often more complicated than anticipated. On the other hand, I don't think a CRC/hash value is going to simplify things, as the quality of the hash isn't any better than the quality of the values going into it.

I'd rather concatenate my key attributes into a long string using some sort of delimiter (think CSV) and compare two readable and intelligible strings than two CRC values where you cannot easily tell exactly what went into it.


Reply