Skip to main content
Solved

What causes my attribute values (in this case CASnumbers) to be distinct?

  • October 6, 2022
  • 5 replies
  • 17 views

thijsknapen
Contributor
Forum|alt.badge.img+11

Hi,

 

I encountered something peculiar last week. I had to compare two datasets about chemical substances measured in groundwater.

To compare the datasets, I used the ChangeDetector. While doing so, for one substance an update/change was found for the value of a CASnumber. However, when I look at the originalValue and the revisedValue, I don't see any difference. Also, when I perform a compare using (a plugin of) Notepad++, it also detects a difference in the two CASnumbers. So there seems to be a difference, I just can't seem to spot the difference ;)

 

Maybe some people here can help take a look and provide an explanation?

 

Any feedback is appreciated.

 

The CASnumbers are:

CASnumber_1 = '‎483-63-6'

CASnumber_2 = '483-63-6'

 

See also the screenshots below for a dummy workspace and the comparison using (a plugin of) Notepad++;

imageimage.png 

Best answer by geomancer

Because they're different 😁 : CASnumber_1 contains some invisible character in front of the 4.

Saving the value of CASnumber_1 to a text file gives a file of 11 bytes. CASnumber_2 results in a text file of 8 bytes.

Comparing both files in PowerShell gives:

CASnumber_compareIn the AttributeCreator, you can go to the start of CASnumber_1 and press Delete once. This will not visually change the value, but afterwards the TestFilter will indicate both values are equal.

This post is closed to further activity.
It may be an old question, an answered question, an implemented idea, or a notification-only post.
Please check post dates before relying on any information in a question or answer.
For follow-up or related questions, please post a new question or idea.
If there is a genuine update to be made, please contact us and request that the post is reopened.

5 replies

geomancer
Evangelist
Forum|alt.badge.img+58
  • Evangelist
  • 932 replies
  • Best Answer
  • October 7, 2022

Because they're different 😁 : CASnumber_1 contains some invisible character in front of the 4.

Saving the value of CASnumber_1 to a text file gives a file of 11 bytes. CASnumber_2 results in a text file of 8 bytes.

Comparing both files in PowerShell gives:

CASnumber_compareIn the AttributeCreator, you can go to the start of CASnumber_1 and press Delete once. This will not visually change the value, but afterwards the TestFilter will indicate both values are equal.


thijsknapen
Contributor
Forum|alt.badge.img+11
  • Author
  • Contributor
  • 155 replies
  • October 7, 2022

Because they're different 😁 : CASnumber_1 contains some invisible character in front of the 4.

Saving the value of CASnumber_1 to a text file gives a file of 11 bytes. CASnumber_2 results in a text file of 8 bytes.

Comparing both files in PowerShell gives:

CASnumber_compareIn the AttributeCreator, you can go to the start of CASnumber_1 and press Delete once. This will not visually change the value, but afterwards the TestFilter will indicate both values are equal.

Thanks!

That indeed explains. I guess invisible characters are not always easy to spot ;)

I opened my sample workspace in NotePad++, and also found a similar observation;

<XFORM_PARM PARM_NAME="ATTR_TABLE" PARM_VALUE="&quot;&quot; CASnumber_1 SET_TO &lt;u200e&gt;483-63-6 CASnumber_2 SET_TO 483-63-6"/>


geomancer
Evangelist
Forum|alt.badge.img+58
  • Evangelist
  • 932 replies
  • October 7, 2022

Nice find!

U200E is the Left-To-Right Mark, which of course is an invisible character.

Makes one wonder why it's there...


thijsknapen
Contributor
Forum|alt.badge.img+11
  • Author
  • Contributor
  • 155 replies
  • October 7, 2022

Nice find!

U200E is the Left-To-Right Mark, which of course is an invisible character.

Makes one wonder why it's there...

Yeah, I also found that. And &lt; is xml encoding of '<' (less than), and &gt; is xml encoding of '>' (greater than). So to me it seems like 3 characters, but maybe the '<' and '>' are stored/read as a kind of header/declaration, i guess a as kind of wrapper around the Left To Right Mark.

 

I also don't know why it's there in my dataset, but that's a different question :|

Thanks again for helping me spot it 🙂

First thing is knowing what's up, second thing is how to deal with it ;)


geomancer
Evangelist
Forum|alt.badge.img+58
  • Evangelist
  • 932 replies
  • October 7, 2022

Nice find!

U200E is the Left-To-Right Mark, which of course is an invisible character.

Makes one wonder why it's there...

You're most welcome!