Skip to main content
Question

Is it possible to extract the character encoding of an attribute?


geosander
Forum|alt.badge.img+7

Text attributes can have a character encoding in FME, as we all know. The encoding that is used is shown in the Data Inspector, for instance:

 

I would like to fetch that "iso-8895-1", "utf-8" or "windows-1252" value. My guess is that the answer is no but the question is: is it possible to extract the encoding somehow? I know that the FME Objects Python API allows me to detect if the attribute is an encoded string (FMEFeature.getAttributeType() ==> FME_ATTR_ENCODED_STRING), but it doesn't tell me what the encoding is. It seems to be stored as a (hidden) attribute property though, otherwise the Data Inspector could not show it.

Depending on the answer(s) I will get here, I'm thinking of posting an idea for an EncodingExtractor transformer.

6 replies

redgeographics
Celebrity
Forum|alt.badge.img+49

I thought the Schema reader would be able to do that but no (so that could be an idea too)


david_r
Celebrity
  • August 10, 2017

I'm curious, why do you need to know the encoding?

Would using an AttributeEncoder set to honor the input encoding to convert the strings to e.g. "Unicode (utf-8)" work? Having a known encoding, it should be fairly easy to take it from there.


geosander
Forum|alt.badge.img+7
  • Author
  • August 10, 2017
david_r wrote:

I'm curious, why do you need to know the encoding?

Would using an AttributeEncoder set to honor the input encoding to convert the strings to e.g. "Unicode (utf-8)" work? Having a known encoding, it should be fairly easy to take it from there.

@david_r: I am reading an attribute with a PythonCaller. In the PythonCaller, I do some manipulations and concatenations and then I write out a new attribute. I would like that output attribute to have the same encoding as the input attribute.

 

However, in order to preserve the encoding, I should also be able to specify the encoding when calling .setAttribute() on the feature, else it will be lost anyway:

 

  • Using Python 2.*, the result attribute is written as a system encoded string (provided that the input is converted from Unicode to str first using the .encode('utf8') method on the Unicode object - although I would prefer to call .encode(<detected encoding>) instead).
  • Using Python 3.*, that returns a bytes object instead of a Unicode object, the result attribute is always written as a UTF-8 encoded string, even if the input was encoded as something else.
So I guess that even when it's possible to extract the encoding, it will not be possible to write it with a PythonCaller using that same encoding, unless Safe changes the API. However, if I knew the input encoding, I could properly set the encoding after the PythonCaller using an AttributeEncoder like you said (but set to "Use Bytes" for the Python 2 case).

 


david_r
Celebrity
  • August 10, 2017
geosander wrote:
@david_r: I am reading an attribute with a PythonCaller. In the PythonCaller, I do some manipulations and concatenations and then I write out a new attribute. I would like that output attribute to have the same encoding as the input attribute.

 

However, in order to preserve the encoding, I should also be able to specify the encoding when calling .setAttribute() on the feature, else it will be lost anyway:

 

  • Using Python 2.*, the result attribute is written as a system encoded string (provided that the input is converted from Unicode to str first using the .encode('utf8') method on the Unicode object - although I would prefer to call .encode(<detected encoding>) instead).
  • Using Python 3.*, that returns a bytes object instead of a Unicode object, the result attribute is always written as a UTF-8 encoded string, even if the input was encoded as something else.
So I guess that even when it's possible to extract the encoding, it will not be possible to write it with a PythonCaller using that same encoding, unless Safe changes the API. However, if I knew the input encoding, I could properly set the encoding after the PythonCaller using an AttributeEncoder like you said (but set to "Use Bytes" for the Python 2 case).

 

Thanks for the explanation, I see your point. Maybe post it as an idea? I'd vote for it.

 

 

You may want to consider sending this to Safe support as well.

geosander
Forum|alt.badge.img+7
  • Author
  • August 10, 2017
david_r wrote:
Thanks for the explanation, I see your point. Maybe post it as an idea? I'd vote for it.

 

 

You may want to consider sending this to Safe support as well.
Done! :)

 

Split it into 2 ideas actually:

 

https://knowledge.safe.com/idea/50224/python-api-new-setattributetype-method-for-fmefeat.html

 

https://knowledge.safe.com/idea/50225/transformer-to-extract-attribute-character-encodin.html

 

 


david_r
Celebrity
  • August 10, 2017

Reply


Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

 
Cookie settings