Skip to main content
Question

Problem with string lenght


This is probably rather a Windows 10 problem but maybe someone can help: If I use the StringLength function I get wrong counts for words with German special characters (e.g. "Ä" or "ß") because they get a length of 2 instead of 1. When I ran the same workspace on a colleagues machine the length was calculated correctly so I guess I have to change some region or language settings but so far I could not find the culprit. Any idea what's the issue here? Thank you for any help.

Edit: For some reason the StringLengthCalculator Transformer calculates the right length.

Which version of FME is this? I get the same length for both @StringLength() and the StringLengthCalculator in FME 2020.

It could be linked to the usage of Unicode, where the number of bytes doesn't necessarily correspond to the number of characters. Notably, the letter "Ä" in Unicode is represented by the two bytes c3 + 84 (hex). If an algorithm doesn't account for this, or if it doesn't know that it's a multibyte string, it might return the byte length (=2) rather than the character length (=1) for the string "Ä".

If you send the string to e.g. the Logger in FME you should be able to see the encoding associated with the attribute, e.g.


HI @kasparlov,

Sorry to hear you are running into this issue.

This has been reported in the past and is currently being tracked in our system as FMEENGINE-48508 (also posted on this idea). For the time being, please continue using the StringLengthCalculator transformer.

If you'd like to be added as a contact for the tracked issue, please submit a case and reference FMEENGINE-48508 or let me know and I can create a case on your behalf. Additionally, please be sure to upvote and comment on the linked idea!


Which version of FME is this? I get the same length for both @StringLength() and the StringLengthCalculator in FME 2020.

It could be linked to the usage of Unicode, where the number of bytes doesn't necessarily correspond to the number of characters. Notably, the letter "Ä" in Unicode is represented by the two bytes c3 + 84 (hex). If an algorithm doesn't account for this, or if it doesn't know that it's a multibyte string, it might return the byte length (=2) rather than the character length (=1) for the string "Ä".

If you send the string to e.g. the Logger in FME you should be able to see the encoding associated with the attribute, e.g.

This is spot on with the developer comments on the tracked issue. The long winded explanation can be found here: http://unicode.org/faq/char_combmark.html#7

It seems the expectation would be to count graphemes, (what is rendered on the screen) as a single logical character rather than bytes/code units.


This is spot on with the developer comments on the tracked issue. The long winded explanation can be found here: http://unicode.org/faq/char_combmark.html#7

It seems the expectation would be to count graphemes, (what is rendered on the screen) as a single logical character rather than bytes/code units.

Yeah, it's a fairly common challenge that is far from unique to FME.


HI @kasparlov,

Sorry to hear you are running into this issue.

This has been reported in the past and is currently being tracked in our system as FMEENGINE-48508 (also posted on this idea). For the time being, please continue using the StringLengthCalculator transformer.

If you'd like to be added as a contact for the tracked issue, please submit a case and reference FMEENGINE-48508 or let me know and I can create a case on your behalf. Additionally, please be sure to upvote and comment on the linked idea!

Hi @chrisatsafe,

 

 

I guess StringLengthCalculator would be a workaround but I still don't understand why the same workspace runs fine on other machines on the same FME version.

Which version of FME is this? I get the same length for both @StringLength() and the StringLengthCalculator in FME 2020.

It could be linked to the usage of Unicode, where the number of bytes doesn't necessarily correspond to the number of characters. Notably, the letter "Ä" in Unicode is represented by the two bytes c3 + 84 (hex). If an algorithm doesn't account for this, or if it doesn't know that it's a multibyte string, it might return the byte length (=2) rather than the character length (=1) for the string "Ä".

If you send the string to e.g. the Logger in FME you should be able to see the encoding associated with the attribute, e.g.

I'm on 2019.2.1 but get the same result on 2020 RC.


Hi @chrisatsafe,

 

 

I guess StringLengthCalculator would be a workaround but I still don't understand why the same workspace runs fine on other machines on the same FME version.

The answer may be found in the source dataset, which format is it? Are there any encoding options on the reader? If the source is the HTTPCaller, make sure to specify the result encoding rather than letting FME guess. Encoding guesses may take the OS settings into account (all things depending) and that could potentially explain the issues.


The answer may be found in the source dataset, which format is it? Are there any encoding options on the reader? If the source is the HTTPCaller, make sure to specify the result encoding rather than letting FME guess. Encoding guesses may take the OS settings into account (all things depending) and that could potentially explain the issues.

Encoding is set to UTF-8 in the reader


Encoding is set to UTF-8 in the reader

Then try the tip about sending the features to the Logger directly before the StringLength calculation, to check that it still says UTF-8 on those attributes. If not, then some transformer did something with the attribute encoding along the way.


Then try the tip about sending the features to the Logger directly before the StringLength calculation, to check that it still says UTF-8 on those attributes. If not, then some transformer did something with the attribute encoding along the way.

I just checked - its UTF-8 before and after the string length calculation


I found the problem. In the Region Settings on Windows 10 there is a checkbox "Beta: Use Unicode UTF-8 for worldwide language support" which was checked. When I unchecked the option the string length was calculated correctly.

 

 


I found the problem. In the Region Settings on Windows 10 there is a checkbox "Beta: Use Unicode UTF-8 for worldwide language support" which was checked. When I unchecked the option the string length was calculated correctly.

 

 


I found the problem. In the Region Settings on Windows 10 there is a checkbox "Beta: Use Unicode UTF-8 for worldwide language support" which was checked. When I unchecked the option the string length was calculated correctly.

 

 

Good find, that's really interesting. @chrisatsafe this may be of relevance for the developers...
Good find, that's really interesting. @chrisatsafe this may be of relevance for the developers...

Noted, I'll add that as a comment on the tracked issue.


Reply