Skip to main content

I am suddenly receiving a pipe separated text file with imbedded pseudo-unicode characters that are supposed to be a macronated 'o'. Unfortunately it translates to a ^Z in the ASCII file.

All the vowels can have macrons in Maori and there is a new policy that all government departments must add macrons. If all software was unicode aware then this might work.

Some programs will handle this, reading the whole file regardless of the ^Z but many stop. The FME CSV2 reader stops. Oddly the FME Textfile reader does handle them with encoding set to DOS-Latin-1 (ibm-850)

What can I do?

The simplest idea is to translate the pair of characters ^Zo back to a plain ascii o.

Surely tr could just strip the ^Z...nope.

I have tried to use utf-8 encoding parameter on the CSV reader and other tricks with tr without success.

I have attached a test sample.asc of two records.

Unix wc -l returns a count of 1, not a good start since I can cat two records.

I have created a workaround based on this suggestion on stack exchange

https://stackoverflow.com/questions/20078816/replace-non-ascii-characters-with-a-single-space

def processFeature(feature):
    """extract record type in field 8 and strip ^Z"""
    buffer = feature.getAttribute('text_line_data')
    fixed = ''.join(ti if ord(i) > 31 else '' for i in buffer])
    feature.setAttribute("rec", buffer.split('|') 7])
    feature.setAttribute('text_line_data',fixed)

I used the text reader to read in the whole file (it ignores ^Z - hooray!) and then a PythonCaller to strip off the ^Z, write each line out to another text file. Then I was able to use a CSVReader to read in the data successfully splitting at the pipe separators. Perhaps I could have joined up the two processes with a workspace runner, but I just wanted my original workspace to run again.

I was not able to use the original startup Python script because Python also halts on an imbedded ^Z.

Hi @kimo

I was able to use the CSV2 reader with a UTF-8 encoding parameter to read your sample file successfully.

In the Data Inspector, the macronated characters were prefixed with a substitute character (hex code \\x1a). It is possible to match using hex codes in the StringReplacer in Replace Regular Expression mode as mentioned in this Q&A post, so I would recommend using this method to replace the substitute character in your text file. In addition, I have found it is also possible to replace this character by pasting the sub character in the text editor in Replace Text mode of the String Replacer.

I have attached a sample workspace demonstrating these two approaches. I hope it helps.

kimo_StringReplacer_MacronatedCharacters.fmw


Reply