Hi @jonathan, the error message could appear when encoding of a SQL statement doesn't match the client encoding setting of the PostgreSQL database (UTF-8 in this case).
In which situation did you get the message?
Hi @jonathan, the error message could appear when encoding of a SQL statement doesn't match the client encoding setting of the PostgreSQL database (UTF-8 in this case).
In which situation did you get the message?
Hi @takashi - Good question. I don't know what the source encoding is - it's coming from an informix Text field, but it is going to utf-8 in the postgresql, and that's what raises the error. We know from experience it's only these two characters at the end of the string that cause an issue - And we definitely want to trim them off because they'll cause issues in other programs too, it's just a question of how.
Can you use some regex to keep extract everything but the invalid characters? I've found this easier in the past when trying to exclude non printing characters.
Do you know the code for the characters you want to remove?
You can try the syntax [\\x{0021}] in a string replacer using the relevant code instead of 0021 and replace with nothing
Can you use some regex to keep extract everything but the invalid characters? I've found this easier in the past when trying to exclude non printing characters.
Do you know the code for the characters you want to remove?
You can try the syntax [\\x{0021}] in a string replacer using the relevant code instead of 0021 and replace with nothing
Thanks for the thought @egomm. I did try the regex with the \\x syntax, the problem is I don't know the code.
The codes I have are from the error: 0xc0 0x80 - but if you try and put them into a regexp it won't accept them (field turns red) - the "x" isn't valid hex (09a-f).
I've resolved this by using a "negative" regex - I look for characters that are not words (\\w) or spaces (\\s) or quote marks. They have to be one or both of the last two characters. If they meet those criteria they get removed.
The regex is:
[^\\w\\s"]{0,2}$
I'd prefer to be explicit and search for the specified characters, but in lieu of that, this seems to work. Suggestions welcome.
Hi @jonathan
have you tried to copy/paste the characters into StringReplacer and run the translation? The characters will not be displayed... in a user friendly way, but they might be preserved and handled correctly by the transformer.
Another option would be to trim the very last character of the string, extract its code with CharacterCodeExtractor, and check if the code is 0x80 - if it is, the last two characters of the original string should be trimmed.
You might also want to try to check if the original string can be represented in UTF-8 using AttributeEncoder - if indeed there are not supported characters, the AttributeEncoder will fail.
Have you visualized the data with Data Inspector? What is the displayed problem attribute encoding in DI?
Hi @jonathan
have you tried to copy/paste the characters into StringReplacer and run the translation? The characters will not be displayed... in a user friendly way, but they might be preserved and handled correctly by the transformer.
Another option would be to trim the very last character of the string, extract its code with CharacterCodeExtractor, and check if the code is 0x80 - if it is, the last two characters of the original string should be trimmed.
You might also want to try to check if the original string can be represented in UTF-8 using AttributeEncoder - if indeed there are not supported characters, the AttributeEncoder will fail.
Have you visualized the data with Data Inspector? What is the displayed problem attribute encoding in DI?
UPD: it looks like you might be dealing with a double-byte character:
https://stackoverflow.com/questions/3911536/utf-8-unicode-whats-with-0xc0-and-0x80 What is the language of your data?
Hi @jonathan
have you tried to copy/paste the characters into StringReplacer and run the translation? The characters will not be displayed... in a user friendly way, but they might be preserved and handled correctly by the transformer.
Another option would be to trim the very last character of the string, extract its code with CharacterCodeExtractor, and check if the code is 0x80 - if it is, the last two characters of the original string should be trimmed.
You might also want to try to check if the original string can be represented in UTF-8 using AttributeEncoder - if indeed there are not supported characters, the AttributeEncoder will fail.
Have you visualized the data with Data Inspector? What is the displayed problem attribute encoding in DI?
Hi @LenaAtSafe; thanks for your thoughts.
I've tried the copy/paste into StringReplacer; see my original question for that.
SubstringExtractor->CharacterCodeExtrator = both return '65533'.
AttributeEncoder - there is no utf-8 representation for this character (or double byte chracter if that's what it is); that's the problem. DI represents it as utf-8, and uses the ?. See original question for more info.
The language is English, but it's actually a data-format that seems to terminate the strings occasionally in this weird byte pair.
Hi @LenaAtSafe; thanks for your thoughts.
I've tried the copy/paste into StringReplacer; see my original question for that.
SubstringExtractor->CharacterCodeExtrator = both return '65533'.
AttributeEncoder - there is no utf-8 representation for this character (or double byte chracter if that's what it is); that's the problem. DI represents it as utf-8, and uses the ?. See original question for more info.
The language is English, but it's actually a data-format that seems to terminate the strings occasionally in this weird byte pair.
Hi @jonathan
thank you for your reply - I now understand better what test steps you've done (I did not ignore your original description, just was not sure about some of the details).
So, if the character is indeed not supported by UTF-8, AttributeEncoder will fail to transcode it - and the whole translation will fail. I wish AttributeEncoder had <Rejected> port - it would then filter out all problem strings for you to trim without failing the translation. We are adding <Rejected> port to all transformers (and all new transformers are created with <Rejected>), but it will take time to get all existing transformers updated.
Your regex solution is brilliant! Ideally you would want to find specific bytes with the regex... I am not sure whether FME can support this. Could you please try a\\xc0] \\x80] expression?
Hi Jonathan, I faced similar problem today.
My data was encoded as UTF-8, and the problem was similar to yours. I think the root cause is that FME is too intuitive and WYSIWYG oriented (What You See Is What You Get), but FME Data Inspector fails to reveal non-printable & special characters (not even raise Warnings in log files).
Please try 2 different solutions:
#1: Use TextEncoder (to URL percent encoding) in order to reveal any hidden characters; Remove the problematic characters using StringReplacer; Finally use TextDecoder to revert the URL percent encoding.
or
#2: Use StringReplacer to remove non-printable and undesirable characters; I've chosen to remove the range of non-printable \\x{0000}-\\x{001f} and some extended ASCII characters that are not common in my data \\x{007f}-\\x{00bf}. In StringReplacer I checked for every occurrences of these and replaced them to 'white space' or nothing. .\\x{0000}-\\x{001f}\\x{007f}-\\x{00bf}]
Hi Jonathan, I faced similar problem today.
My data was encoded as UTF-8, and the problem was similar to yours. I think the root cause is that FME is too intuitive and WYSIWYG oriented (What You See Is What You Get), but FME Data Inspector fails to reveal non-printable & special characters (not even raise Warnings in log files).
Please try 2 different solutions:
#1: Use TextEncoder (to URL percent encoding) in order to reveal any hidden characters; Remove the problematic characters using StringReplacer; Finally use TextDecoder to revert the URL percent encoding.
or
#2: Use StringReplacer to remove non-printable and undesirable characters; I've chosen to remove the range of non-printable \\x{0000}-\\x{001f} and some extended ASCII characters that are not common in my data \\x{007f}-\\x{00bf}. In StringReplacer I checked for every occurrences of these and replaced them to 'white space' or nothing. .\\x{0000}-\\x{001f}\\x{007f}-\\x{00bf}]
Thanks Fabio,
Had similar issue we were struggling with.
Your #1 solution worked like a charm.
Text encoder helped us identify a zero-length space %E2%80%8B.
https://stackoverflow.com/questions/63187010/e2808b-appears-in-url-net-core
https://www.w3schools.com/tags/ref_urlencode.ASP