csv and control-Z

Question

I am suddenly receiving a pipe separated text file with imbedded pseudo-unicode characters that are supposed to be a macronated 'o'. Unfortunately it translates to a ^Z in the ASCII file.

All the vowels can have macrons in Maori and there is a new policy that all government departments must add macrons. If all software was unicode aware then this might work.

Some programs will handle this, reading the whole file regardless of the ^Z but many stop. The FME CSV2 reader stops. Oddly the FME Textfile reader does handle them with encoding set to DOS-Latin-1 (ibm-850)

What can I do?

The simplest idea is to translate the pair of characters ^Zo back to a plain ascii o.

Surely tr could just strip the ^Z...nope.

I have tried to use utf-8 encoding parameter on the CSV reader and other tricks with tr without success.

I have attached a test sample.asc of two records.

Unix wc -l returns a count of 1, not a good start since I can cat two records.

icon

Best answer by kimo 22 August 2019, 23:17

View original

kimo · Accepted Answer

Ihavecreatedaworkaroundbasedonthissuggestiononstackexchangehttps://stackoverflow.com/questions/20078816/replace-non-ascii-characters-with-a-single-spacedefprocessFeature(feature):"""extractrecordtypeinfield8andstrip^Z"""buffer=feature.getAttribute('text_line_data')fixed=''.join()feature.setAttribute("rec",buffer.split('|')[7])feature.setAttribute('text_line_data',fixed)Iusedthetextreadertoreadinthewholefile(itignores^Z-hooray!)andthenaPythonCallertostripoffthe^Z,writeeachlineouttoanothertextfile.ThenIwasabletouseaCSVReadertoreadinthedatasuccessfullysplittingatthepipeseparators.PerhapsIcouldhavejoinedupthetwoprocesseswithaworkspacerunner,butIjustwantedmyoriginalworkspacetorunagain.IwasnotabletousetheoriginalstartupPythonscriptbecausePythonalsohaltsonanimbedded^Z.

debbiatsafe · Answer

Hi @kimoI was able to use the CSV2 reader with a UTF-8 encoding parameter to read your sample file successfully. In the Data Inspector, the macronated characters were prefixed with a substitute character (hex code \x1a). It is possible to match using hex codes in the StringReplacer in Replace Regular Expression mode as mentioned in this Q&A post, so I would recommend using this method to replace the substitute character in your text file. In addition, I have found it is also possible to replace this character by pasting the sub character in the text editor in Replace Text mode of the String Replacer.I have attached a sample workspace demonstrating these two approaches. I hope it helps. kimo_StringReplacer_MacronatedCharacters.fmw

csv and control-Z

1 Attachment

2 replies

Reply

Community Stats

1 Attachment

Reply

Community Stats

Sign up

An FME Account is required to contribute

Login to the community

An FME Account is required to contribute

Scanning file for viruses.

This file cannot be downloaded