Question

Why is the CRCCalculator giving differing results for things that should be the same?

  • 19 January 2024
  • 6 replies
  • 25 views

Badge +1

I have two sets of address data that I'm trying to match up using the CRCDetector as described here https://community.safe.com/s/article/creating-a-unique-identifier-crccalculator.

 

Geometry isn't a concern so there are only the 3 fields to match up but I'm getting different CRC values for things that should be the same.

imageSome of the algorithms will give a handful of matches but no more that a couple of dozen (out of about 12,000).

 

I've checked for trailing white spaces. What else is making it think these things are different?

 

Thanks


6 replies

Userlevel 6
Badge +36

What is the configuration of the CRCCalculator?

Userlevel 5

A couple of things

  • It seems you're using a 16-bit CRC. If you have more than a handful of features you should consider using a different algorithm to avoid collisions (false positives), e.g. MD5
  • In the configuration, make sure to set "Calculate CRC on" = "Selected attributes only" and then manually select the attributes you want included. Otherwise there's a small chance that unexposed attributes (e.g. format attributes) are included, throwing off the results.
Badge +1

What is the configuration of the CRCCalculator?

image

Badge +1

A couple of things

  • It seems you're using a 16-bit CRC. If you have more than a handful of features you should consider using a different algorithm to avoid collisions (false positives), e.g. MD5
  • In the configuration, make sure to set "Calculate CRC on" = "Selected attributes only" and then manually select the attributes you want included. Otherwise there's a small chance that unexposed attributes (e.g. format attributes) are included, throwing off the results.

Hi @david_r​ 

 

I have tried all the algorithms. The most matches any of them managed was 32 out of twelve thousand records. I've also tried the ChangeDetector and I'm getting around eleven thousand matches (which is about what I would've expected).

 

It is set to calculate on the 3 address fields and nothing else.

Userlevel 6
Badge +36

Hi @david_r​ 

 

I have tried all the algorithms. The most matches any of them managed was 32 out of twelve thousand records. I've also tried the ChangeDetector and I'm getting around eleven thousand matches (which is about what I would've expected).

 

It is set to calculate on the 3 address fields and nothing else.

I wonder if a) the data types of the fields are different between the two data sets and b) if this would even effect anything.

 

Can you share a subset of your data?

Badge +1

Hi @david_r​ 

 

I have tried all the algorithms. The most matches any of them managed was 32 out of twelve thousand records. I've also tried the ChangeDetector and I'm getting around eleven thousand matches (which is about what I would've expected).

 

It is set to calculate on the 3 address fields and nothing else.

Thanks @hkingsbury​ 

 

So it wasn't that the data was different types, it was that the source fields were different lengths (even though the extra space was unused). I concatenated the six original fields into two new ones and then everything joined up.

Reply