Skip to main content
Solved

Converting character encoding before reading text files


peteralstorp
Contributor
Forum|alt.badge.img

Hi, Safers!

 

I read Text Files as Whole text at once but I need convert files not in UTF-8 to UTF-8 before reading. This is because FME needs to know the encoding when it reads the file. 

 

I figure one easy way to do this is to store the files to a temp folder and via the SystemCaller run some windows command to convert the files not already in UTF-8 (this can be more than thousands of files so preferably I can convert only the files that needs converting) before reading them with the Text File reader with Character Encoding set to UTF-8.

 

Does anyone have any suggestions on how to solve this in an easy way? Preferably within the FME script if possible...

 

Kind regards, Peter

Best answer by david_r

peteralstorp wrote:

No, I have to check for each file.

If you can use Python, this may be a viable option, which could potentially save you a lot of guesswork: https://pypi.org/project/chardet/

You will notice that even this library can only detect the encoding with a certain confidence.

View original
Did this help you find an answer to your question?

13 replies

david_r
Evangelist
  • November 30, 2020

Have you tried using the AttributeEncoder? It can convert between different encodings, including utf-8.


ebygomm
Influencer
Forum|alt.badge.img+32
  • Influencer
  • November 30, 2020

Do you already know what encoding the files have which are not UTF-8?


peteralstorp
Contributor
Forum|alt.badge.img
  • Author
  • Contributor
  • November 30, 2020
david_r wrote:

Have you tried using the AttributeEncoder? It can convert between different encodings, including utf-8.

I have to set the encoding when reading the file so this problem has to be solved before reading the file, no?


peteralstorp
Contributor
Forum|alt.badge.img
  • Author
  • Contributor
  • November 30, 2020
ebygomm wrote:

Do you already know what encoding the files have which are not UTF-8?

No, I have to check for each file.


ebygomm
Influencer
Forum|alt.badge.img+32
  • Influencer
  • November 30, 2020
peteralstorp wrote:

No, I have to check for each file.

So how are you determining the encoding you need to read from? Most programs that show you an encoding of a text file are a best guess not an absolute.


peteralstorp
Contributor
Forum|alt.badge.img
  • Author
  • Contributor
  • November 30, 2020
peteralstorp wrote:

No, I have to check for each file.

Ok, I didn't know that. I open the file in Notepad++ now and check the encoding there. When it comes to this problem I wish I lived in an English speaking country and not Sweden...


david_r
Evangelist
  • November 30, 2020
peteralstorp wrote:

I have to set the encoding when reading the file so this problem has to be solved before reading the file, no?

The reader encoding only tells FME what to expect so that it can correctly encode the attributes for you, but you can override it in the AttributeEncoder. Of course, if you know the input encoding then that will simplify the tast quite a bit.

Worst case, there is also the "Data File" reader which lets you read the binary contents of the file.


david_r
Evangelist
  • November 30, 2020
peteralstorp wrote:

No, I have to check for each file.

No BOM in the utf-8 files? It might be worth checking, as it would make it easier for you.


ebygomm
Influencer
Forum|alt.badge.img+32
  • Influencer
  • November 30, 2020
peteralstorp wrote:

No, I have to check for each file.

Encoding issues are the bane of my life at the moment, living in the UK isn't helping!

 

 


peteralstorp
Contributor
Forum|alt.badge.img
  • Author
  • Contributor
  • November 30, 2020
peteralstorp wrote:

No, I have to check for each file.

The files in my case can be created in any text editor or other programs so I can never know what to expect. Somehow I need to check.

 


peteralstorp
Contributor
Forum|alt.badge.img
  • Author
  • Contributor
  • November 30, 2020
peteralstorp wrote:

No, I have to check for each file.

Haha, really! Yeah, there are probably many of us who would like to see a well thought out handling of this problem within FME.


david_r
Evangelist
  • Best Answer
  • November 30, 2020
peteralstorp wrote:

No, I have to check for each file.

If you can use Python, this may be a viable option, which could potentially save you a lot of guesswork: https://pypi.org/project/chardet/

You will notice that even this library can only detect the encoding with a certain confidence.


peteralstorp
Contributor
Forum|alt.badge.img
  • Author
  • Contributor
  • December 9, 2020
peteralstorp wrote:

No, I have to check for each file.

Thanks, @david_r​ , I was hoping for a simpler answer, but don't we always? I'll look into your proposed Python-solution!


Reply


Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

 
Cookie settings