Solved

How to clean strings from false bytes?

  • 13 December 2019
  • 4 replies
  • 28 views

Badge +8

Bulk copy failed on table 'tv.vagnat_framtida' using delimiter ':'. Error was 'ERROR:  invalid byte sequence for encoding "UTF8": 0xc3 0x3a
CONTEXT:  COPY vagnat_framtida, line 33

This error kept me busy for some hours exploring character encoding in shapefiles, FME and PostGIS. Which did not help. Not until I did some data digging and found the error.

The data in a shapefile apparently comes from a qualified geodata store, and some long text fields have been truncated in the conversion to shape, leaving what appears to be incomplete character codes. And this causes PostGIS problems. The error message comes from deep within PostGIS. 

I have tried to cut a few bytes from the string with SubstringExtractor, but then the whole string became HEX. Very strange. Since it is invalid data, there seems to be no way of catching these characters with any of the FME string tools. And the error appears only in the Postgis writer, not before so it cannot be logged.

Basically, I am looking for suggestions on how to catch and clean the strings from false bytes. I do not mind truncating the string further, since an unknown part already is lost. I will enclose a zipped shapefile for your perusal. See field GenBeskr, line 33 and possibly elsewhere as well.

icon

Best answer by samisnunu 15 December 2019, 05:33

View original

4 replies

Badge +2

Ok , I found the offending character for you..

if you check the field values in the Visual Preview window or closely in the Data Inspector you'll notice this character (?) at the end of the several strings.

So, to remove this invalid character(s), use the String Replacer, and set the mode to Replace Regular Expression. see snapshot below.

Badge +8

Thank you, @samisnunu so much. That was amazing.

 

I had of course seen the ? character. I know that this is a unicode catch-all U+FFFD, used to replace an unknown, unrecognized or unrepresentable character. And I did try to replace that as text (which it is not) and that failed. Hence my appeal for help. Mode Regular Expression did the trick, at least this time.

 

I also believed that the ? was generated in FME when reading the shapefile, but it probably originated in the supplying system.

 

Thanks once again. This was a useful lesson.

/Mats,E

Hello everyone, I would like to reopen this post.

I find a similar problem with a csv file.

This file has columns with an encoding problem with characters like â; é; ô(which are respectively â, é, ô)2023-01-11 12_02_14-FME Data Inspector - 2022.1 

When I try to load thi file on my Postgis DB FME gives the error:

'ERROR: invalid byte sequence for encoding "UTF8": 0xc3 0x20

If I'm not mistaken, 0xc3 0x20 translates to (Ã + white space) but I couldn't find these two characters on Data Inspector...

 

For the moment I tried to transform the file with StringReplacer for the characters and AttributeEncoder for translate from ISO-8859-1 to UTF-8.

2023-01-11 12_03_02-_MULTI → POSTGIS (C__Users_EstebanCarcamo_Documents_FME_FMW_module_integ_donnees2023-01-11 12_03_26-_MULTI → POSTGIS (C__Users_EstebanCarcamo_Documents_FME_FMW_module_integ_donnees 

Data Inspector shows that all columns are in UTF-8 and no weird characters, but the same error remains.

 

2023-01-11 12_04_01-FME Data Inspector - 2022.1My BD has UTF-8 encoding, Collation and Character type in English_United States.1252.

 

Have you any idea how can I handle this error?

 

Thanks!

 

 

 

Userlevel 2
Badge +10

Hello everyone, I would like to reopen this post.

I find a similar problem with a csv file.

This file has columns with an encoding problem with characters like â; é; ô(which are respectively â, é, ô)2023-01-11 12_02_14-FME Data Inspector - 2022.1 

When I try to load thi file on my Postgis DB FME gives the error:

'ERROR: invalid byte sequence for encoding "UTF8": 0xc3 0x20

If I'm not mistaken, 0xc3 0x20 translates to (Ã + white space) but I couldn't find these two characters on Data Inspector...

 

For the moment I tried to transform the file with StringReplacer for the characters and AttributeEncoder for translate from ISO-8859-1 to UTF-8.

2023-01-11 12_03_02-_MULTI → POSTGIS (C__Users_EstebanCarcamo_Documents_FME_FMW_module_integ_donnees2023-01-11 12_03_26-_MULTI → POSTGIS (C__Users_EstebanCarcamo_Documents_FME_FMW_module_integ_donnees 

Data Inspector shows that all columns are in UTF-8 and no weird characters, but the same error remains.

 

2023-01-11 12_04_01-FME Data Inspector - 2022.1My BD has UTF-8 encoding, Collation and Character type in English_United States.1252.

 

Have you any idea how can I handle this error?

 

Thanks!

 

 

 

@esteban.carcamo​ I would expect the AttributeEncoder to work in this case. As a workaround, you can define the encoding of the data in the CSV Reader itself when you add it to the workspace. When adding the CSV Reader, click the parameters button and then change the Character Encoding to Unicode 8-bit (utf-8).

image 

Testing this myself, I was able to write the â, é, ô characters to a PostgreSQL table. Let me know if this works for you. (Note: I am using FME Workbench 2022.2)

Reply