Skip to main content

Hi All,

 

I have a string that ends with a couple of characters that cause PostGreSQL to fatally terminate the process when it receives them. Unfortunately I'm not certain that the string /always/ ends with these characters or I'd use the substringExtractor.

 

 

 

I get this error:

 

invalid byte sequence for encoding "UTF8": 0xc0 0x80

 

 

So what I'd like to do is trim them. The problem is - How to do that without resorting to Python?

 

 

I can't just copy/paste them into an attributeTrimmer because FME represents these with the special "replacement character" ? - https://en.wikipedia.org/wiki/Specials_%28Unicode_block%29#Replacement_character - meaning what the trimmer is searching and looking to replace is that special unicode character (?), not the actual ones that I want to remove.

 

 

If I go into the source database and select the character in it's original encoding and paste that, it just pastes a space.

 

 

There's nothing in the AttributeTrimmer docs about this sort of trimming.

 

 

StringReplacer with a regexp and using a specific code point - https://www.regular-expressions.info/unicode.html#codepoint - doesn't work because FME explicitly doesn't allow the \\u modifier. Or the \\p modifier.

 

 

If I'm really desperate I can probably write some Python to do it, but I need to keep it fairly lightweight because this is going to be run for millions of features.

 

 

Does anyone have any suggestions for how to handle this?

 

 

Note: FME 2016.0

 

 

Thanks,

 

Jonathan
Hi @jonathan, the error message could appear when encoding of a SQL statement doesn't match the client encoding setting of the PostgreSQL database (UTF-8 in this case).

 

In which situation did you get the message?

 


Hi @jonathan, the error message could appear when encoding of a SQL statement doesn't match the client encoding setting of the PostgreSQL database (UTF-8 in this case).

 

In which situation did you get the message?

 

Hi @takashi - Good question. I don't know what the source encoding is - it's coming from an informix Text field, but it is going to utf-8 in the postgresql, and that's what raises the error. We know from experience it's only these two characters at the end of the string that cause an issue - And we definitely want to trim them off because they'll cause issues in other programs too, it's just a question of how.

 


Can you use some regex to keep extract everything but the invalid characters? I've found this easier in the past when trying to exclude non printing characters.

Do you know the code for the characters you want to remove?

You can try the syntax [\\x{0021}] in a string replacer using the relevant code instead of 0021 and replace with nothing


Can you use some regex to keep extract everything but the invalid characters? I've found this easier in the past when trying to exclude non printing characters.

Do you know the code for the characters you want to remove?

You can try the syntax [\\x{0021}] in a string replacer using the relevant code instead of 0021 and replace with nothing

Thanks for the thought @egomm. I did try the regex with the \\x syntax, the problem is I don't know the code.

 

The codes I have are from the error: 0xc0 0x80 - but if you try and put them into a regexp it won't accept them (field turns red) - the "x" isn't valid hex (09a-f).

 


I've resolved this by using a "negative" regex - I look for characters that are not words (\\w) or spaces (\\s) or quote marks. They have to be one or both of the last two characters. If they meet those criteria they get removed.

The regex is:

[^\\w\\s"]{0,2}$

I'd prefer to be explicit and search for the specified characters, but in lieu of that, this seems to work. Suggestions welcome.


Hi @jonathan

have you tried to copy/paste the characters into StringReplacer and run the translation? The characters will not be displayed... in a user friendly way, but they might be preserved and handled correctly by the transformer.

Another option would be to trim the very last character of the string, extract its code with CharacterCodeExtractor, and check if the code is 0x80 - if it is, the last two characters of the original string should be trimmed.

You might also want to try to check if the original string can be represented in UTF-8 using AttributeEncoder - if indeed there are not supported characters, the AttributeEncoder will fail.

Have you visualized the data with Data Inspector? What is the displayed problem attribute encoding in DI?


Hi @jonathan

have you tried to copy/paste the characters into StringReplacer and run the translation? The characters will not be displayed... in a user friendly way, but they might be preserved and handled correctly by the transformer.

Another option would be to trim the very last character of the string, extract its code with CharacterCodeExtractor, and check if the code is 0x80 - if it is, the last two characters of the original string should be trimmed.

You might also want to try to check if the original string can be represented in UTF-8 using AttributeEncoder - if indeed there are not supported characters, the AttributeEncoder will fail.

Have you visualized the data with Data Inspector? What is the displayed problem attribute encoding in DI?

UPD: it looks like you might be dealing with a double-byte character: https://stackoverflow.com/questions/3911536/utf-8-unicode-whats-with-0xc0-and-0x80 What is the language of your data?

 

 


Hi @jonathan

have you tried to copy/paste the characters into StringReplacer and run the translation? The characters will not be displayed... in a user friendly way, but they might be preserved and handled correctly by the transformer.

Another option would be to trim the very last character of the string, extract its code with CharacterCodeExtractor, and check if the code is 0x80 - if it is, the last two characters of the original string should be trimmed.

You might also want to try to check if the original string can be represented in UTF-8 using AttributeEncoder - if indeed there are not supported characters, the AttributeEncoder will fail.

Have you visualized the data with Data Inspector? What is the displayed problem attribute encoding in DI?

Hi @LenaAtSafe; thanks for your thoughts.

 

I've tried the copy/paste into StringReplacer; see my original question for that.

 

SubstringExtractor->CharacterCodeExtrator = both return '65533'.

 

AttributeEncoder - there is no utf-8 representation for this character (or double byte chracter if that's what it is); that's the problem. DI represents it as utf-8, and uses the ?. See original question for more info.

 

The language is English, but it's actually a data-format that seems to terminate the strings occasionally in this weird byte pair.

 


Hi @LenaAtSafe; thanks for your thoughts.

 

I've tried the copy/paste into StringReplacer; see my original question for that.

 

SubstringExtractor->CharacterCodeExtrator = both return '65533'.

 

AttributeEncoder - there is no utf-8 representation for this character (or double byte chracter if that's what it is); that's the problem. DI represents it as utf-8, and uses the ?. See original question for more info.

 

The language is English, but it's actually a data-format that seems to terminate the strings occasionally in this weird byte pair.

 

Hi @jonathan

 

thank you for your reply - I now understand better what test steps you've done (I did not ignore your original description, just was not sure about some of the details).

 

So, if the character is indeed not supported by UTF-8, AttributeEncoder will fail to transcode it - and the whole translation will fail. I wish AttributeEncoder had <Rejected> port - it would then filter out all problem strings for you to trim without failing the translation. We are adding <Rejected> port to all transformers (and all new transformers are created with <Rejected>), but it will take time to get all existing transformers updated.

 

Your regex solution is brilliant! Ideally you would want to find specific bytes with the regex... I am not sure whether FME can support this. Could you please try a\\xc0] \\x80] expression?

 


Hi Jonathan, I faced similar problem today.

 

 

My data was encoded as UTF-8, and the problem was similar to yours. I think the root cause is that FME is too intuitive and WYSIWYG oriented (What You See Is What You Get), but FME Data Inspector fails to reveal non-printable & special characters (not even raise Warnings in log files).

Please try 2 different solutions:

#1: Use TextEncoder (to URL percent encoding) in order to reveal any hidden characters; Remove the problematic characters using StringReplacer; Finally use TextDecoder to revert the URL percent encoding.

or

#2: Use StringReplacer to remove non-printable and undesirable characters; I've chosen to remove the range of non-printable \\x{0000}-\\x{001f} and some extended ASCII characters that are not common in my data \\x{007f}-\\x{00bf}. In StringReplacer I checked for every occurrences of these and replaced them to 'white space' or nothing. .\\x{0000}-\\x{001f}\\x{007f}-\\x{00bf}]

 


Hi Jonathan, I faced similar problem today.

 

 

My data was encoded as UTF-8, and the problem was similar to yours. I think the root cause is that FME is too intuitive and WYSIWYG oriented (What You See Is What You Get), but FME Data Inspector fails to reveal non-printable & special characters (not even raise Warnings in log files).

Please try 2 different solutions:

#1: Use TextEncoder (to URL percent encoding) in order to reveal any hidden characters; Remove the problematic characters using StringReplacer; Finally use TextDecoder to revert the URL percent encoding.

or

#2: Use StringReplacer to remove non-printable and undesirable characters; I've chosen to remove the range of non-printable \\x{0000}-\\x{001f} and some extended ASCII characters that are not common in my data \\x{007f}-\\x{00bf}. In StringReplacer I checked for every occurrences of these and replaced them to 'white space' or nothing. .\\x{0000}-\\x{001f}\\x{007f}-\\x{00bf}]

 

Thanks Fabio,

Had similar issue we were struggling with.

 

Your #1 solution worked like a charm.

Text encoder helped us identify a zero-length space %E2%80%8B.

 

https://stackoverflow.com/questions/63187010/e2808b-appears-in-url-net-core

https://www.w3schools.com/tags/ref_urlencode.ASP

 


Reply