Skip to main content
Question

Problem with string lenght


pflegpet
Contributor
Forum|alt.badge.img+8

This is probably rather a Windows 10 problem but maybe someone can help: If I use the StringLength function I get wrong counts for words with German special characters (e.g. "Ä" or "ß") because they get a length of 2 instead of 1. When I ran the same workspace on a colleagues machine the length was calculated correctly so I guess I have to change some region or language settings but so far I could not find the culprit. Any idea what's the issue here? Thank you for any help.

Edit: For some reason the StringLengthCalculator Transformer calculates the right length.

14 replies

david_r
Celebrity
  • June 3, 2020

Which version of FME is this? I get the same length for both @StringLength() and the StringLengthCalculator in FME 2020.

It could be linked to the usage of Unicode, where the number of bytes doesn't necessarily correspond to the number of characters. Notably, the letter "Ä" in Unicode is represented by the two bytes c3 + 84 (hex). If an algorithm doesn't account for this, or if it doesn't know that it's a multibyte string, it might return the byte length (=2) rather than the character length (=1) for the string "Ä".

If you send the string to e.g. the Logger in FME you should be able to see the encoding associated with the attribute, e.g.


chrisatsafe
Contributor
Forum|alt.badge.img+2
  • Contributor
  • June 3, 2020

HI @kasparlov,

Sorry to hear you are running into this issue.

This has been reported in the past and is currently being tracked in our system as FMEENGINE-48508 (also posted on this idea). For the time being, please continue using the StringLengthCalculator transformer.

If you'd like to be added as a contact for the tracked issue, please submit a case and reference FMEENGINE-48508 or let me know and I can create a case on your behalf. Additionally, please be sure to upvote and comment on the linked idea!


chrisatsafe
Contributor
Forum|alt.badge.img+2
  • Contributor
  • June 3, 2020
david_r wrote:

Which version of FME is this? I get the same length for both @StringLength() and the StringLengthCalculator in FME 2020.

It could be linked to the usage of Unicode, where the number of bytes doesn't necessarily correspond to the number of characters. Notably, the letter "Ä" in Unicode is represented by the two bytes c3 + 84 (hex). If an algorithm doesn't account for this, or if it doesn't know that it's a multibyte string, it might return the byte length (=2) rather than the character length (=1) for the string "Ä".

If you send the string to e.g. the Logger in FME you should be able to see the encoding associated with the attribute, e.g.

This is spot on with the developer comments on the tracked issue. The long winded explanation can be found here: http://unicode.org/faq/char_combmark.html#7

It seems the expectation would be to count graphemes, (what is rendered on the screen) as a single logical character rather than bytes/code units.


david_r
Celebrity
  • June 3, 2020
chrisatsafe wrote:

This is spot on with the developer comments on the tracked issue. The long winded explanation can be found here: http://unicode.org/faq/char_combmark.html#7

It seems the expectation would be to count graphemes, (what is rendered on the screen) as a single logical character rather than bytes/code units.

Yeah, it's a fairly common challenge that is far from unique to FME.


pflegpet
Contributor
Forum|alt.badge.img+8
  • Author
  • Contributor
  • June 3, 2020
chrisatsafe wrote:

HI @kasparlov,

Sorry to hear you are running into this issue.

This has been reported in the past and is currently being tracked in our system as FMEENGINE-48508 (also posted on this idea). For the time being, please continue using the StringLengthCalculator transformer.

If you'd like to be added as a contact for the tracked issue, please submit a case and reference FMEENGINE-48508 or let me know and I can create a case on your behalf. Additionally, please be sure to upvote and comment on the linked idea!

Hi @chrisatsafe,

 

 

I guess StringLengthCalculator would be a workaround but I still don't understand why the same workspace runs fine on other machines on the same FME version.

pflegpet
Contributor
Forum|alt.badge.img+8
  • Author
  • Contributor
  • June 3, 2020
david_r wrote:

Which version of FME is this? I get the same length for both @StringLength() and the StringLengthCalculator in FME 2020.

It could be linked to the usage of Unicode, where the number of bytes doesn't necessarily correspond to the number of characters. Notably, the letter "Ä" in Unicode is represented by the two bytes c3 + 84 (hex). If an algorithm doesn't account for this, or if it doesn't know that it's a multibyte string, it might return the byte length (=2) rather than the character length (=1) for the string "Ä".

If you send the string to e.g. the Logger in FME you should be able to see the encoding associated with the attribute, e.g.

I'm on 2019.2.1 but get the same result on 2020 RC.


david_r
Celebrity
  • June 3, 2020
pflegpet wrote:

Hi @chrisatsafe,

 

 

I guess StringLengthCalculator would be a workaround but I still don't understand why the same workspace runs fine on other machines on the same FME version.

The answer may be found in the source dataset, which format is it? Are there any encoding options on the reader? If the source is the HTTPCaller, make sure to specify the result encoding rather than letting FME guess. Encoding guesses may take the OS settings into account (all things depending) and that could potentially explain the issues.


pflegpet
Contributor
Forum|alt.badge.img+8
  • Author
  • Contributor
  • June 3, 2020
david_r wrote:

The answer may be found in the source dataset, which format is it? Are there any encoding options on the reader? If the source is the HTTPCaller, make sure to specify the result encoding rather than letting FME guess. Encoding guesses may take the OS settings into account (all things depending) and that could potentially explain the issues.

Encoding is set to UTF-8 in the reader


david_r
Celebrity
  • June 3, 2020
pflegpet wrote:

Encoding is set to UTF-8 in the reader

Then try the tip about sending the features to the Logger directly before the StringLength calculation, to check that it still says UTF-8 on those attributes. If not, then some transformer did something with the attribute encoding along the way.


pflegpet
Contributor
Forum|alt.badge.img+8
  • Author
  • Contributor
  • June 3, 2020
david_r wrote:

Then try the tip about sending the features to the Logger directly before the StringLength calculation, to check that it still says UTF-8 on those attributes. If not, then some transformer did something with the attribute encoding along the way.

I just checked - its UTF-8 before and after the string length calculation


pflegpet
Contributor
Forum|alt.badge.img+8
  • Author
  • Contributor
  • June 3, 2020

I found the problem. In the Region Settings on Windows 10 there is a checkbox "Beta: Use Unicode UTF-8 for worldwide language support" which was checked. When I unchecked the option the string length was calculated correctly.

 

 


pflegpet
Contributor
Forum|alt.badge.img+8
  • Author
  • Contributor
  • June 3, 2020

I found the problem. In the Region Settings on Windows 10 there is a checkbox "Beta: Use Unicode UTF-8 for worldwide language support" which was checked. When I unchecked the option the string length was calculated correctly.

 

 


david_r
Celebrity
  • June 4, 2020
pflegpet wrote:

I found the problem. In the Region Settings on Windows 10 there is a checkbox "Beta: Use Unicode UTF-8 for worldwide language support" which was checked. When I unchecked the option the string length was calculated correctly.

 

 

Good find, that's really interesting. @chrisatsafe this may be of relevance for the developers...

chrisatsafe
Contributor
Forum|alt.badge.img+2
  • Contributor
  • June 4, 2020
david_r wrote:
Good find, that's really interesting. @chrisatsafe this may be of relevance for the developers...

Noted, I'll add that as a comment on the tracked issue.


Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

 
Cookie settings