Question

Is it possible to extract the character encoding of an attribute?

  • 10 August 2017
  • 6 replies
  • 46 views

Badge +7

Text attributes can have a character encoding in FME, as we all know. The encoding that is used is shown in the Data Inspector, for instance:

 

I would like to fetch that "iso-8895-1", "utf-8" or "windows-1252" value. My guess is that the answer is no but the question is: is it possible to extract the encoding somehow? I know that the FME Objects Python API allows me to detect if the attribute is an encoded string (FMEFeature.getAttributeType() ==> FME_ATTR_ENCODED_STRING), but it doesn't tell me what the encoding is. It seems to be stored as a (hidden) attribute property though, otherwise the Data Inspector could not show it.

Depending on the answer(s) I will get here, I'm thinking of posting an idea for an EncodingExtractor transformer.


6 replies

Userlevel 5
Badge +25

I thought the Schema reader would be able to do that but no (so that could be an idea too)

Userlevel 4

I'm curious, why do you need to know the encoding?

Would using an AttributeEncoder set to honor the input encoding to convert the strings to e.g. "Unicode (utf-8)" work? Having a known encoding, it should be fairly easy to take it from there.

Badge +7

I'm curious, why do you need to know the encoding?

Would using an AttributeEncoder set to honor the input encoding to convert the strings to e.g. "Unicode (utf-8)" work? Having a known encoding, it should be fairly easy to take it from there.

@david_r: I am reading an attribute with a PythonCaller. In the PythonCaller, I do some manipulations and concatenations and then I write out a new attribute. I would like that output attribute to have the same encoding as the input attribute.

 

However, in order to preserve the encoding, I should also be able to specify the encoding when calling .setAttribute() on the feature, else it will be lost anyway:

 

  • Using Python 2.*, the result attribute is written as a system encoded string (provided that the input is converted from Unicode to str first using the .encode('utf8') method on the Unicode object - although I would prefer to call .encode(<detected encoding>) instead).
  • Using Python 3.*, that returns a bytes object instead of a Unicode object, the result attribute is always written as a UTF-8 encoded string, even if the input was encoded as something else.
So I guess that even when it's possible to extract the encoding, it will not be possible to write it with a PythonCaller using that same encoding, unless Safe changes the API. However, if I knew the input encoding, I could properly set the encoding after the PythonCaller using an AttributeEncoder like you said (but set to "Use Bytes" for the Python 2 case).

 

Userlevel 4
@david_r: I am reading an attribute with a PythonCaller. In the PythonCaller, I do some manipulations and concatenations and then I write out a new attribute. I would like that output attribute to have the same encoding as the input attribute.

 

However, in order to preserve the encoding, I should also be able to specify the encoding when calling .setAttribute() on the feature, else it will be lost anyway:

 

  • Using Python 2.*, the result attribute is written as a system encoded string (provided that the input is converted from Unicode to str first using the .encode('utf8') method on the Unicode object - although I would prefer to call .encode(<detected encoding>) instead).
  • Using Python 3.*, that returns a bytes object instead of a Unicode object, the result attribute is always written as a UTF-8 encoded string, even if the input was encoded as something else.
So I guess that even when it's possible to extract the encoding, it will not be possible to write it with a PythonCaller using that same encoding, unless Safe changes the API. However, if I knew the input encoding, I could properly set the encoding after the PythonCaller using an AttributeEncoder like you said (but set to "Use Bytes" for the Python 2 case).

 

Thanks for the explanation, I see your point. Maybe post it as an idea? I'd vote for it.

 

 

You may want to consider sending this to Safe support as well.
Badge +7
Thanks for the explanation, I see your point. Maybe post it as an idea? I'd vote for it.

 

 

You may want to consider sending this to Safe support as well.
Done! :)

 

Split it into 2 ideas actually:

 

https://knowledge.safe.com/idea/50224/python-api-new-setattributetype-method-for-fmefeat.html

 

https://knowledge.safe.com/idea/50225/transformer-to-extract-attribute-character-encodin.html

 

 

Userlevel 4
Upvoted x2 :-)

Reply