Question

Trimming invalid unicode

7 years ago
January 10, 2018
11 replies
245 views

jonathan
74 replies

Hi All,

I have a string that ends with a couple of characters that cause PostGreSQL to fatally terminate the process when it receives them. Unfortunately I'm not certain that the string /always/ ends with these characters or I'd use the substringExtractor.

I get this error:

invalid byte sequence for encoding "UTF8": 0xc0 0x80

So what I'd like to do is trim them. The problem is - How to do that without resorting to Python?

I can't just copy/paste them into an attributeTrimmer because FME represents these with the special "replacement character" ? - https://en.wikipedia.org/wiki/Specials_%28Unicode_block%29#Replacement_character - meaning what the trimmer is searching and looking to replace is that special unicode character (?), not the actual ones that I want to remove.

If I go into the source database and select the character in it's original encoding and paste that, it just pastes a space.

There's nothing in the AttributeTrimmer docs about this sort of trimming.

StringReplacer with a regexp and using a specific code point - https://www.regular-expressions.info/unicode.html#codepoint - doesn't work because FME explicitly doesn't allow the \\u modifier. Or the \\p modifier.

If I'm really desperate I can probably write some Python to do it, but I need to keep it fairly lightweight because this is going to be run for millions of features.

Does anyone have any suggestions for how to handle this?

Note: FME 2016.0

Thanks,

Jonathan

takashi
7698 replies
7 years ago
January 10, 2018

Hi @jonathan, the error message could appear when encoding of a SQL statement doesn't match the client encoding setting of the PostgreSQL database (UTF-8 in this case).

In which situation did you get the message?

jonathan
Author
74 replies
7 years ago
January 10, 2018

takashi wrote:

Hi @jonathan, the error message could appear when encoding of a SQL statement doesn't match the client encoding setting of the PostgreSQL database (UTF-8 in this case).

In which situation did you get the message?

Hi @takashi - Good question. I don't know what the source encoding is - it's coming from an informix Text field, but it is going to utf-8 in the postgresql, and that's what raises the error. We know from experience it's only these two characters at the end of the string that cause an issue - And we definitely want to trim them off because they'll cause issues in other programs too, it's just a question of how.

+39

ebygomm
Influencer
3299 replies
7 years ago
January 10, 2018

Can you use some regex to keep extract everything but the invalid characters? I've found this easier in the past when trying to exclude non printing characters.

Do you know the code for the characters you want to remove?

You can try the syntax [\\x{0021}] in a string replacer using the relevant code instead of 0021 and replace with nothing

jonathan
Author
74 replies
7 years ago
January 10, 2018

ebygomm wrote:

Can you use some regex to keep extract everything but the invalid characters? I've found this easier in the past when trying to exclude non printing characters.

Do you know the code for the characters you want to remove?

You can try the syntax [\\x{0021}] in a string replacer using the relevant code instead of 0021 and replace with nothing

Thanks for the thought @egomm. I did try the regex with the \\x syntax, the problem is I don't know the code.

The codes I have are from the error: 0xc0 0x80 - but if you try and put them into a regexp it won't accept them (field turns red) - the "x" isn't valid hex (09a-f).

jonathan
Author
74 replies
7 years ago
January 10, 2018

I've resolved this by using a "negative" regex - I look for characters that are not words (\\w) or spaces (\\s) or quote marks. They have to be one or both of the last two characters. If they meet those criteria they get removed.

The regex is:

[^\\w\\s"]{0,2}$

I'd prefer to be explicit and search for the specified characters, but in lieu of that, this seems to work. Suggestions welcome.

lenaatsafe
275 replies
7 years ago
January 10, 2018

Hi @jonathan

have you tried to copy/paste the characters into StringReplacer and run the translation? The characters will not be displayed... in a user friendly way, but they might be preserved and handled correctly by the transformer.

Another option would be to trim the very last character of the string, extract its code with CharacterCodeExtractor, and check if the code is 0x80 - if it is, the last two characters of the original string should be trimmed.

You might also want to try to check if the original string can be represented in UTF-8 using AttributeEncoder - if indeed there are not supported characters, the AttributeEncoder will fail.

Have you visualized the data with Data Inspector? What is the displayed problem attribute encoding in DI?

lenaatsafe
275 replies
7 years ago
January 10, 2018

lenaatsafe wrote:

Hi @jonathan

You might also want to try to check if the original string can be represented in UTF-8 using AttributeEncoder - if indeed there are not supported characters, the AttributeEncoder will fail.

Have you visualized the data with Data Inspector? What is the displayed problem attribute encoding in DI?

UPD: it looks like you might be dealing with a double-byte character: https://stackoverflow.com/questions/3911536/utf-8-unicode-whats-with-0xc0-and-0x80 What is the language of your data?

jonathan
Author
74 replies
7 years ago
January 11, 2018

lenaatsafe wrote:

Hi @jonathan

You might also want to try to check if the original string can be represented in UTF-8 using AttributeEncoder - if indeed there are not supported characters, the AttributeEncoder will fail.

Have you visualized the data with Data Inspector? What is the displayed problem attribute encoding in DI?

Hi @LenaAtSafe; thanks for your thoughts.

I've tried the copy/paste into StringReplacer; see my original question for that.

SubstringExtractor->CharacterCodeExtrator = both return '65533'.

AttributeEncoder - there is no utf-8 representation for this character (or double byte chracter if that's what it is); that's the problem. DI represents it as utf-8, and uses the ?. See original question for more info.

The language is English, but it's actually a data-format that seems to terminate the strings occasionally in this weird byte pair.

lenaatsafe
275 replies
7 years ago
January 11, 2018

jonathan wrote:

Hi @LenaAtSafe; thanks for your thoughts.

I've tried the copy/paste into StringReplacer; see my original question for that.

SubstringExtractor->CharacterCodeExtrator = both return '65533'.

The language is English, but it's actually a data-format that seems to terminate the strings occasionally in this weird byte pair.

Hi @jonathan

thank you for your reply - I now understand better what test steps you've done (I did not ignore your original description, just was not sure about some of the details).

So, if the character is indeed not supported by UTF-8, AttributeEncoder will fail to transcode it - and the whole translation will fail. I wish AttributeEncoder had <Rejected> port - it would then filter out all problem strings for you to trim without failing the translation. We are adding <Rejected> port to all transformers (and all new transformers are created with <Rejected>), but it will take time to get all existing transformers updated.

Your regex solution is brilliant! Ideally you would want to find specific bytes with the regex... I am not sure whether FME can support this. Could you please try [\\xc0][\\x80] expression?

fabiofagundes
Contributor
4 replies
7 years ago
January 19, 2018

Hi Jonathan, I faced similar problem today.

My data was encoded as UTF-8, and the problem was similar to yours. I think the root cause is that FME is too intuitive and WYSIWYG oriented (What You See Is What You Get), but FME Data Inspector fails to reveal non-printable & special characters (not even raise Warnings in log files).

Please try 2 different solutions:

#1: Use TextEncoder (to URL percent encoding) in order to reveal any hidden characters; Remove the problematic characters using StringReplacer; Finally use TextDecoder to revert the URL percent encoding.

#2: Use StringReplacer to remove non-printable and undesirable characters; I've chosen to remove the range of non-printable \\x{0000}-\\x{001f} and some extended ASCII characters that are not common in my data \\x{007f}-\\x{00bf}. In StringReplacer I checked for every occurrences of these and replaced them to 'white space' or nothing. [\\x{0000}-\\x{001f}\\x{007f}-\\x{00bf}]

bilal
Contributor
14 replies
2 years ago
November 15, 2022

fabiofagundes wrote:

Hi Jonathan, I faced similar problem today.

Please try 2 different solutions:

Thanks Fabio,

Had similar issue we were struggling with.

Your #1 solution worked like a charm.

Text encoder helped us identify a zero-length space %E2%80%8B.

https://stackoverflow.com/questions/63187010/e2808b-appears-in-url-net-core

https://www.w3schools.com/tags/ref_urlencode.ASP

Reply

Rich Text Editor, editor1

Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

Cookie settings

We use 3 different kinds of cookies. You can choose which cookies you want to accept. We need basic cookies to make this site work, therefore these are the minimum you can select. Learn more about our cookies.

Basic
Functional

Normal
Functional + analytics

Complete
Functional + analytics + social media + embedded videos + marketing

Trimming invalid unicode