How to clean strings from false bytes?

Question

Bulk copy failed on table 'tv.vagnat_framtida' using delimiter ':'. Error was 'ERROR:  invalid byte sequence for encoding "UTF8": 0xc3 0x3aCONTEXT:  COPY vagnat_framtida, line 33This error kept me busy for some hours exploring character encoding in shapefiles, FME and PostGIS. Which did not help. Not until I did some data digging and found the error.The data in a shapefile apparently comes from a qualified geodata store, and some long text fields have been truncated in the conversion to shape, leaving what appears to be incomplete character codes. And this causes PostGIS problems. The error message comes from deep within PostGIS. I have tried to cut a few bytes from the string with SubstringExtractor, but then the whole string became HEX. Very strange. Since it is invalid data, there seems to be no way of catching these characters with any of the FME string tools. And the error appears only in the Postgis writer, not before so it cannot be logged.Basically, I am looking for suggestions on how to catch and clean the strings from false bytes. I do not mind truncating the string further, since an unknown part already is lost. I will enclose a zipped shapefile for your perusal. See field GenBeskr, line 33 and possibly elsewhere as well.

samisnunu · Accepted Answer

Ok , I found the offending character for you..

if you check the field values in the Visual Preview window or closely in the Data Inspector you'll notice this character (?) at the end of the several strings.

So, to remove this invalid character(s), use the String Replacer, and set the mode to Replace Regular Expression. see snapshot below.

matself · Answer

Thank you, @samisnunu so much. That was amazing.

I had of course seen the ? character. I know that this is a unicode catch-all U+FFFD, used to replace an unknown, unrecognized or unrepresentable character. And I did try to replace that as text (which it is not) and that failed. Hence my appeal for help. Mode Regular Expression did the trick, at least this time.

I also believed that the ? was generated in FME when reading the shapefile, but it probably originated in the supplying system.

Thanks once again. This was a useful lesson.

/Mats,E