As far as I can tell, the CRCCalculator defaults to a 32-bit hash value based on some or all of the feature attributes. A 32-bit integer has about 4 billion possible values, so you should be very careful if you use the hash value to compare lots of features, as there is an increased risk of CRC collisions as the number of features grows, meaning that different features might in fact get the same CRC, even though the attributes differ. Of course, the probability of this happening increases with the number of features, approaching 50% probability at less then 80 000 features. You can read more about this here.
Also be aware that CRC/hash algorithms by design have no tolerance, e.g.
"25A" = 7B445B5C, while
"25 A" = CD999DE9, i.e. completely different
Also note that
"plumless" = 4DDB0C25 equals
"buckeroo" =4DDB0C25, i.e. identical CRC
I see that FME 2015 and later lets you choose CRC64 which should help a lot with the probability of a collision, but the probability will never be zero unless you only have one feature.
If you need to detect duplicates, you will probably be better off skipping the CRCCalculator and just comparing the attribute values directly. Also look at the DuplicateRemover and/or Matcher for this.
You may also want to consider doing some pre-processing on your attributes before comparing them, e.g. removing all spaces and converting to either upper or lower case.
David
As far as I can tell, the CRCCalculator defaults to a 32-bit hash value based on some or all of the feature attributes. A 32-bit integer has about 4 billion possible values, so you should be very careful if you use the hash value to compare lots of features, as there is an increased risk of CRC collisions as the number of features grows, meaning that different features might in fact get the same CRC, even though the attributes differ. Of course, the probability of this happening increases with the number of features, approaching 50% probability at less then 80 000 features. You can read more about this here.
Also be aware that CRC/hash algorithms by design have no tolerance, e.g.
"25A" = 7B445B5C, while
"25 A" = CD999DE9, i.e. completely different
Also note that
"plumless" = 4DDB0C25 equals
"buckeroo" =4DDB0C25, i.e. identical CRC
I see that FME 2015 and later lets you choose CRC64 which should help a lot with the probability of a collision, but the probability will never be zero unless you only have one feature.
If you need to detect duplicates, you will probably be better off skipping the CRCCalculator and just comparing the attribute values directly. Also look at the DuplicateRemover and/or Matcher for this.
You may also want to consider doing some pre-processing on your attributes before comparing them, e.g. removing all spaces and converting to either upper or lower case.
David
Really interesting, thanks. On the other hand, the trouble with comparing attributes directly is that your chances of losing matches can be high, particularly for numeric values (there was an issue here recently where somebody was comparing ArcSDE to some other format and the numbers were coming out different).
Really interesting, thanks. On the other hand, the trouble with comparing attributes directly is that your chances of losing matches can be high, particularly for numeric values (there was an issue here recently where somebody was comparing ArcSDE to some other format and the numbers were coming out different).
I agree that comparisons are complicated, often more complicated than anticipated. On the other hand, I don't think a CRC/hash value is going to simplify things, as the quality of the hash isn't any better than the quality of the values going into it.
I'd rather concatenate my key attributes into a long string using some sort of delimiter (think CSV) and compare two readable and intelligible strings than two CRC values where you cannot easily tell exactly what went into it.