Question

Is it possible to extract the character encoding of an attribute?

7 years ago
August 10, 2017
6 replies
244 views

geosander
330 replies

Text attributes can have a character encoding in FME, as we all know. The encoding that is used is shown in the Data Inspector, for instance:

I would like to fetch that "iso-8895-1", "utf-8" or "windows-1252" value. My guess is that the answer is no but the question is: is it possible to extract the encoding somehow? I know that the FME Objects Python API allows me to detect if the attribute is an encoded string (FMEFeature.getAttributeType() ==> FME_ATTR_ENCODED_STRING), but it doesn't tell me what the encoding is. It seems to be stored as a (hidden) attribute property though, otherwise the Data Inspector could not show it.

Depending on the answer(s) I will get here, I'm thinking of posting an idea for an EncodingExtractor transformer.

+50

redgeographics
Celebrity
3643 replies
7 years ago
August 10, 2017

I thought the Schema reader would be able to do that but no (so that could be an idea too)

david_r
8355 replies
7 years ago
August 10, 2017

I'm curious, why do you need to know the encoding?

Would using an AttributeEncoder set to honor the input encoding to convert the strings to e.g. "Unicode (utf-8)" work? Having a known encoding, it should be fairly easy to take it from there.

geosander
Author
330 replies
7 years ago
August 10, 2017

david_r wrote:

I'm curious, why do you need to know the encoding?

Would using an AttributeEncoder set to honor the input encoding to convert the strings to e.g. "Unicode (utf-8)" work? Having a known encoding, it should be fairly easy to take it from there.

@david_r: I am reading an attribute with a PythonCaller. In the PythonCaller, I do some manipulations and concatenations and then I write out a new attribute. I would like that output attribute to have the same encoding as the input attribute.

However, in order to preserve the encoding, I should also be able to specify the encoding when calling .setAttribute() on the feature, else it will be lost anyway:

Using Python 2.*, the result attribute is written as a system encoded string (provided that the input is converted from Unicode to str first using the .encode('utf8') method on the Unicode object - although I would prefer to call .encode(<detected encoding>) instead).
Using Python 3.*, that returns a bytes object instead of a Unicode object, the result attribute is always written as a UTF-8 encoded string, even if the input was encoded as something else.

So I guess that even when it's possible to extract the encoding, it will not be possible to write it with a PythonCaller using that same encoding, unless Safe changes the API. However, if I knew the input encoding, I could properly set the encoding after the PythonCaller using an AttributeEncoder like you said (but set to "Use Bytes" for the Python 2 case).

david_r
8355 replies
7 years ago
August 10, 2017

geosander wrote:

However, in order to preserve the encoding, I should also be able to specify the encoding when calling .setAttribute() on the feature, else it will be lost anyway:

Using Python 2.*, the result attribute is written as a system encoded string (provided that the input is converted from Unicode to str first using the .encode('utf8') method on the Unicode object - although I would prefer to call .encode(<detected encoding>) instead).
Using Python 3.*, that returns a bytes object instead of a Unicode object, the result attribute is always written as a UTF-8 encoded string, even if the input was encoded as something else.

Thanks for the explanation, I see your point. Maybe post it as an idea? I'd vote for it.

You may want to consider sending this to Safe support as well.

geosander
Author
330 replies
7 years ago
August 10, 2017

david_r wrote:

Thanks for the explanation, I see your point. Maybe post it as an idea? I'd vote for it.

You may want to consider sending this to Safe support as well.

Done! :)

Split it into 2 ideas actually:

https://knowledge.safe.com/idea/50224/python-api-new-setattributetype-method-for-fmefeat.html

https://knowledge.safe.com/idea/50225/transformer-to-extract-attribute-character-encodin.html

david_r
8355 replies
7 years ago
August 10, 2017

geosander wrote:

Done! :)

Split it into 2 ideas actually:

https://knowledge.safe.com/idea/50224/python-api-new-setattributetype-method-for-fmefeat.html

https://knowledge.safe.com/idea/50225/transformer-to-extract-attribute-character-encodin.html

Upvoted x2 :-)

Reply

Rich Text Editor, editor1

Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

Cookie settings

We use 3 different kinds of cookies. You can choose which cookies you want to accept. We need basic cookies to make this site work, therefore these are the minimum you can select. Learn more about our cookies.

Basic
Functional

Normal
Functional + analytics

Complete
Functional + analytics + social media + embedded videos + marketing

Is it possible to extract the character encoding of an attribute?