Solved

How to clean strings from false bytes?

5 years ago
December 13, 2019
4 replies
117 views

+9

matself
Contributor
12 replies


Bulk copy failed on table 'tv.vagnat_framtida' using delimiter ':'. Error was 'ERROR:  invalid byte sequence for encoding "UTF8": 0xc3 0x3a
CONTEXT:  COPY vagnat_framtida, line 33

This error kept me busy for some hours exploring character encoding in shapefiles, FME and PostGIS. Which did not help. Not until I did some data digging and found the error.

The data in a shapefile apparently comes from a qualified geodata store, and some long text fields have been truncated in the conversion to shape, leaving what appears to be incomplete character codes. And this causes PostGIS problems. The error message comes from deep within PostGIS.

I have tried to cut a few bytes from the string with SubstringExtractor, but then the whole string became HEX. Very strange. Since it is invalid data, there seems to be no way of catching these characters with any of the FME string tools. And the error appears only in the Postgis writer, not before so it cannot be logged.

Basically, I am looking for suggestions on how to catch and clean the strings from false bytes. I do not mind truncating the string further, since an unknown part already is lost. I will enclose a zipped shapefile for your perusal. See field GenBeskr, line 33 and possibly elsewhere as well.

Best answer by samisnunu

Ok , I found the offending character for you..

if you check the field values in the Visual Preview window or closely in the Data Inspector you'll notice this character (?) at the end of the several strings.

So, to remove this invalid character(s), use the String Replacer, and set the mode to Replace Regular Expression. see snapshot below.

View original

Did this help you find an answer to your question?

+10

samisnunu
Contributor
65 replies
Best Answer
5 years ago
December 15, 2019

Ok , I found the offending character for you..

if you check the field values in the Visual Preview window or closely in the Data Inspector you'll notice this character (?) at the end of the several strings.

So, to remove this invalid character(s), use the String Replacer, and set the mode to Replace Regular Expression. see snapshot below.

+9

matself
Author
Contributor
12 replies
5 years ago
December 15, 2019

Thank you, @samisnunu so much. That was amazing.

I had of course seen the ? character. I know that this is a unicode catch-all U+FFFD, used to replace an unknown, unrecognized or unrepresentable character. And I did try to replace that as text (which it is not) and that failed. Hence my appeal for help. Mode Regular Expression did the trick, at least this time.

I also believed that the ? was generated in FME when reading the shapefile, but it probably originated in the supplying system.

Thanks once again. This was a useful lesson.

/Mats,E

E

esteban.carcamo
1 reply
2 years ago
January 11, 2023

Hello everyone, I would like to reopen this post.

I find a similar problem with a csv file.

This file has columns with an encoding problem with characters like Ã¢; Ã©; Ã´(which are respectively â, é, ô) 2023-01-11 12_02_14-FME Data Inspector - 2022.1

When I try to load thi file on my Postgis DB FME gives the error:

'ERROR: invalid byte sequence for encoding "UTF8": 0xc3 0x20

If I'm not mistaken, 0xc3 0x20 translates to (Ã + white space) but I couldn't find these two characters on Data Inspector...

For the moment I tried to transform the file with StringReplacer for the characters and AttributeEncoder for translate from ISO-8859-1 to UTF-8.

2023-01-11 12_03_02-_MULTI → POSTGIS (C__Users_EstebanCarcamo_Documents_FME_FMW_module_integ_donnees 2023-01-11 12_03_26-_MULTI → POSTGIS (C__Users_EstebanCarcamo_Documents_FME_FMW_module_integ_donnees

Data Inspector shows that all columns are in UTF-8 and no weird characters, but the same error remains.

2023-01-11 12_04_01-FME Data Inspector - 2022.1 My BD has UTF-8 encoding, Collation and Character type in English_United States.1252.

Have you any idea how can I handle this error?

Thanks!

+12

danminneyatsaf
Safer
484 replies
2 years ago
January 11, 2023

esteban.carcamo wrote: