Skip to main content

Hi,

 

When i set the CAT reader to read a .txt file the first character it returns on the first line is a '¿ ' which messes up the content of the first row. All other rows are read fine. The contents of the first row become shifted because of the extra character at the start of the text line. When opening the text file in a text editor I see nothing wrong.

 

Any help would be welcome.

 

 

That would be the BOM (Byte Order Marker?) which tells FME which encoding is being used. Usually that character gets ignored when read (FME or otherwise) but unfortunately I don't see an option in the CAT reader. I guess the CAT reader doesn't support encoding properly and I can confirm that it also occurs when I read an encoded file.

There are a couple of possible solutions, although I don't really like them. Firstly you could save the file without encoding. For example, if I open my CAT file in NotePad++ and use Encoding > Encode in ANSI and then save it, then FME reads it without that character (because it has no BOM).

The other thing you can do is to read the data with a Text File reader then write it back out without encoding (for example set the Output Encoding parameter to windows-1252 and it should do). Then you can read it with the CAT reader. In fact I tried that and it is fine.

So Text File reader > FeatureWriter (write to Text File w/o encoding) > FeatureReader (read it back as CAT).

But I'll query the developers about the issue too and get encoding added to the CAT reader if I can.


That would be the BOM (Byte Order Marker?) which tells FME which encoding is being used. Usually that character gets ignored when read (FME or otherwise) but unfortunately I don't see an option in the CAT reader. I guess the CAT reader doesn't support encoding properly and I can confirm that it also occurs when I read an encoded file.

There are a couple of possible solutions, although I don't really like them. Firstly you could save the file without encoding. For example, if I open my CAT file in NotePad++ and use Encoding > Encode in ANSI and then save it, then FME reads it without that character (because it has no BOM).

The other thing you can do is to read the data with a Text File reader then write it back out without encoding (for example set the Output Encoding parameter to windows-1252 and it should do). Then you can read it with the CAT reader. In fact I tried that and it is fine.

So Text File reader > FeatureWriter (write to Text File w/o encoding) > FeatureReader (read it back as CAT).

But I'll query the developers about the issue too and get encoding added to the CAT reader if I can.

It's filed with the developers as FMEENGINE-37793. It's a medium priority because this is a fairly rare case. But hopefully it will get a fix sooner rather than later.


It's filed with the developers as FMEENGINE-37793. It's a medium priority because this is a fairly rare case. But hopefully it will get a fix sooner rather than later.

To clarify, the issue occurs when using a feature reader that reads a txt file as CAT that is created in another workspace. Maybe this helps to fix the issue


It's filed with the developers as FMEENGINE-37793. It's a medium priority because this is a fairly rare case. But hopefully it will get a fix sooner rather than later.

Hi @mark2atsafe, I know that the CAT reader cannot be used for a text file containing multibyte characters, such as Japanese, Chinese, and maybe UTF, since the alignments would be set with the number of characters, rather than number of bytes. The help on the CAT reader says "CAT files are ASCII database files", so I don't think Safe intend to implement the reader to support any text encoding other than ASCII. Am I right?


To clarify, the issue occurs when using a feature reader that reads a txt file as CAT that is created in another workspace. Maybe this helps to fix the issue

It would help if the workspace that creates the CAT file could somehow not set the encoding, or not set it to UTF. In short the CAT reader is reading a character it shouldn't. If you can avoid creating that character (by not writing the file with encoding) then it will solve the problem. But nothing apart from a developer fix can stop the CAT reader from picking up the BOM character in that way.


Hi @mark2atsafe, I know that the CAT reader cannot be used for a text file containing multibyte characters, such as Japanese, Chinese, and maybe UTF, since the alignments would be set with the number of characters, rather than number of bytes. The help on the CAT reader says "CAT files are ASCII database files", so I don't think Safe intend to implement the reader to support any text encoding other than ASCII. Am I right?

That's an interesting point about multibyte characters. As someone who works only in English, I sometimes forget that not every set of characters works the same way! I think this answer on StackOverflow outlines the reasons why this wouldn't work. Some codepoints might be double width which means - you are correct - it would be almost impossible to know for sure that the columns aligned properly with the text. So I'd be surprised if we did or could support other than ASCII.

Having said that, I still think we can do something better for this user. At the very least we could ignore the BOM character and assume that the user's data is all single-width ASCII.


Hi @mark2atsafe, I know that the CAT reader cannot be used for a text file containing multibyte characters, such as Japanese, Chinese, and maybe UTF, since the alignments would be set with the number of characters, rather than number of bytes. The help on the CAT reader says "CAT files are ASCII database files", so I don't think Safe intend to implement the reader to support any text encoding other than ASCII. Am I right?

Thanks for your comments regarding mutibyte characters. I don't think it's absolutely impossible to improve CAT reader to support any encodings including multibyte characters. See this experimental custom transformer: MbStringByteSplitter.

In fact, I created this transformer in order to read a text file containing Japanese characters in Shift JIS encoding and the columns are aligned according to the number of bytes.

Anyway, I agree that it would be better if the CAT reader could ignore BOM.


That's an interesting point about multibyte characters. As someone who works only in English, I sometimes forget that not every set of characters works the same way! I think this answer on StackOverflow outlines the reasons why this wouldn't work. Some codepoints might be double width which means - you are correct - it would be almost impossible to know for sure that the columns aligned properly with the text. So I'd be surprised if we did or could support other than ASCII.

Having said that, I still think we can do something better for this user. At the very least we could ignore the BOM character and assume that the user's data is all single-width ASCII.

Many thanks @mark2atsafe! Just setting the 'Write Byte order marker' to no in the Text feature writer didn't help so we also we changed the character encoding to Windows Latin-1 (windows-1252) and that did it.


Reply