Question

How to properly read ESRI Shapefile attributes that are in another language


Badge

I'm attempting to perform some transformations on an ESRI shapefile in FME and am running into some issues. The attributes within the shapefile are all in Spanish. When I export back out to ESRI shapefile, it is not retaining the shapefiles' originally encoded language properties, instead inserting incorrect special characters.

Here is a view of the shapefile attributes table in ArcCatalog:

Here is a view of the shapefile through FME Inspector:

As you can see, FME does not seem to recognize Spanish characters. It hasn't been a problem for us when writing from shapefile directly into our Microsoft SQL database- we just updated user attribute types in the writer from varchar to nvarchar and it handled the character encoding issue perfectly. Shapefiles created in Spanish with ESRI and directly written to our SQL database using the method below retained their original character encoding.

For another workflow, we needed to clip a large dataset (Highways covering all of Mexico) to smaller datasets (Highways clipped by state borders) so we used FME to perform our transformations. However, the result was a shapefile with unrecognizable characters within the attributes table.

When we attempted to write the new shapefiles created through FME to MySQL using a workflow identical to the one I described above, the unrecognizable characters were written to the database even after updating the attribute type column within the writer feature type properties.

So we've identified that FME is causing an issue but aren't sure how to update FME to handle these character encoding issues. Does anyone know how to get FME to recognize and properly handle these cases?


10 replies

Userlevel 2
Badge +17

Hi @jcroff, it seems that the ESRISHAPE reader reads Shapefiles with the default encoding of OS, by default. Therefore, such a situation could occur if the source Shapefile has been created by other encoding.

Check if the source Shapefile contains a "*.cpg" file. It's a plain text file describing the encoding that has been applied to create the Shapefile.

[Correction] If the cpg file exists, the reader will use the encoding described in the file. If cpg doesn't exist, the reader will use the default encoding of the OS. Possible reasons are: the cpg file is missing, or the cpg file is wrong. If your case is caused by one of the reasons, you can specify the correct encoding through the "Character Encoding" parameter of the reader.

Userlevel 4
Badge +25

The Shapefile Writer also has a setting for character encoding. Are you setting that? If you can write it correctly to SQL Server then the reading and transformation part should be OK.

I'm not quite clear about the screenshot above. Is that the view of the Shapefile in the Data Inspector before or after it has been translated? I'm assuming it is after?

Badge

The Shapefile Writer also has a setting for character encoding. Are you setting that? If you can write it correctly to SQL Server then the reading and transformation part should be OK.

I'm not quite clear about the screenshot above. Is that the view of the Shapefile in the Data Inspector before or after it has been translated? I'm assuming it is after?

Hey Mark, the first screenshot is a look at the attribute table of the shapefile through ArcCatalog. The next screenshot is a view of the attribute table of the shapefile through FME Data Inspector.

When I read the shapefile into FME, do some transformations on the data, and write it to a new shapefile, the incorrect character encoding persists.

Badge

Hi @jcroff, it seems that the ESRISHAPE reader reads Shapefiles with the default encoding of OS, by default. Therefore, such a situation could occur if the source Shapefile has been created by other encoding.

Check if the source Shapefile contains a "*.cpg" file. It's a plain text file describing the encoding that has been applied to create the Shapefile.

[Correction] If the cpg file exists, the reader will use the encoding described in the file. If cpg doesn't exist, the reader will use the default encoding of the OS. Possible reasons are: the cpg file is missing, or the cpg file is wrong. If your case is caused by one of the reasons, you can specify the correct encoding through the "Character Encoding" parameter of the reader.

I'm not seeing a cpg file for my shapefile. I'm taking a look at the parameters in the navigator right now, and I'm not seeing where I could set this advanced parameter.

Userlevel 2
Badge +17

I'm not seeing a cpg file for my shapefile. I'm taking a look at the parameters in the navigator right now, and I'm not seeing where I could set this advanced parameter.

I don't know why the Advanced section is not shown, but the Character Encoding parameter exists here.

You have set "utf-8" to the parameter. Is your Shapefile written in UTF-8 encoding?

Badge

Hi @jcroff

 

I agree with @Mark2AtSafe : both - Shape Reader and Shape Writer - have Character Encoding parameters (please check Parameters when adding the Reader/Writer or in Navigator, under the Reader/Writer Parameters).

 

The Spanish data is most likely either in Windows-1252 encoding or in UTF. By default (with no cpg-file and no explicitly set character encoding), the data is assumed to be in system default encoding (which quite likely is Windows-1252 in your case). If reading with default system encoding resulted in garbled characters, please try UTF.
Badge

Hi @jcroff

 

I agree with @Mark2AtSafe : both - Shape Reader and Shape Writer - have Character Encoding parameters (please check Parameters when adding the Reader/Writer or in Navigator, under the Reader/Writer Parameters).

 

The Spanish data is most likely either in Windows-1252 encoding or in UTF. By default (with no cpg-file and no explicitly set character encoding), the data is assumed to be in system default encoding (which quite likely is Windows-1252 in your case). If reading with default system encoding resulted in garbled characters, please try UTF.

Hello, I tried UTF in both my reader and my writer and the characters are still all jumbled up.

Badge

Hi @jcroff

 

I agree with @Mark2AtSafe : both - Shape Reader and Shape Writer - have Character Encoding parameters (please check Parameters when adding the Reader/Writer or in Navigator, under the Reader/Writer Parameters).

 

The Spanish data is most likely either in Windows-1252 encoding or in UTF. By default (with no cpg-file and no explicitly set character encoding), the data is assumed to be in system default encoding (which quite likely is Windows-1252 in your case). If reading with default system encoding resulted in garbled characters, please try UTF.

I've changed the character encoding to Unicode 8-bit (utf-8) in my reader and my writer. When I take a look at the data attributes through FME Data Inspector, the data remains garbled.

When I view the shapefile in ArcCatalog, it's better but still not correct. I've highlighted the two issues I found 'En operacion' was altered and changed to 'En operaci'. Ultimately, we'll write to MySQL but we have some intermediate processing to perform on the data before we load it into MySQL. We just want to ensure that we retain the correct encoding throughout the process.

Badge

@jcroff

could you please submit support request regarding this problem? Please e-mail your workspace together with the source data sample to support@safe.com

The data can get garbled:

  • at reading (but this should be controlled by the Reader Character Encoding parameter);
  • at writing (but this shouldn't be a problem with the Writer Character Encoding parameter either);
  • during processing - we need to take a closer look at this step.

We will find what is causing the problem and help you fix it.

Badge

We've tracked down the issue after some trial and error. After running into issues setting the reader and writer character encoding to Unicode (UTF-8), we decided to go ahead and not set the character encoding to see what our attributes would look like. As it turns out, we had no further character encoding issues after that.

FME detects your default system character encoding and uses that if no explicit option is chosen. So after much trial-and-error, we should set the character encoding option to Windows Latin-1 ANSI (windows-1252). This ensures that if we share our workspace, character writing will not be system-dependent.

Reader settings:

Writer Settings:

Resulting attributes:

If you take a look at the CONDICION field, you can see the accent mark over the o in En operacion. This dataset is now correctly encoded.

Reply