Skip to main content
Solved

CSV file with SOH control character as separator

  • April 2, 2018
  • 8 replies
  • 1889 views

aaron
Contributor
Forum|alt.badge.img+12

I have several CSV files exported from Hadoop so the separator is the SOH control character. What do I type as the separator for the CSV reader? I've tried the hex code (\\x01) and copying/pasting the SOH character from Notepad++ but neither work.

Thanks,

Aaron

Best answer by takashi

Hi @aaron, a workaround I can think of is to read the source file with the Text File reader, replace \\x01 with a normal delimiter character (e.g. comma) with the StringReplacer (Replace Regular Expression mode), then split each text line by the delimiter character with the AttributeSplitter.

This post is closed to further activity.
It may be an old question, an answered question, an implemented idea, or a notification-only post.
Please check post dates before relying on any information in a question or answer.
For follow-up or related questions, please post a new question or idea.
If there is a genuine update to be made, please contact us and request that the post is reopened.

8 replies

takashi
Celebrity
  • Best Answer
  • April 3, 2018

Hi @aaron, a workaround I can think of is to read the source file with the Text File reader, replace \\x01 with a normal delimiter character (e.g. comma) with the StringReplacer (Replace Regular Expression mode), then split each text line by the delimiter character with the AttributeSplitter.


aaron
Contributor
Forum|alt.badge.img+12
  • Author
  • Contributor
  • April 3, 2018

Hi @aaron, a workaround I can think of is to read the source file with the Text File reader, replace \\x01 with a normal delimiter character (e.g. comma) with the StringReplacer (Replace Regular Expression mode), then split each text line by the delimiter character with the AttributeSplitter.

Hi @takashi, I'm working with Twitter data so I can't use normal delimiter characters (commas, tabs, pipes, etc.) because they are sometimes included in the text of a tweet. In such cases, columns won't parse correctly for those records. I considered making up a delimiter with some random text (e.g. - qxz) but that's not ideal. Can someone think of any other/better solutions?

 


Forum|alt.badge.img

Hi @aaron

I agree with @takashi: replacing SOH control character with something... easier to handle is probably the best idea. I would read the source data as a text file using FeatureReader, replace the problem character, and write it as a temporary file (have you had a chance to try TempPathnameCreator?) using FeatureWriter. After this "prep" the file will be ready to be read with CSV Reader - you could use FeatureReader again to deal with a single translation.


aaron
Contributor
Forum|alt.badge.img+12
  • Author
  • Contributor
  • April 3, 2018

Hi @aaron

I agree with @takashi: replacing SOH control character with something... easier to handle is probably the best idea. I would read the source data as a text file using FeatureReader, replace the problem character, and write it as a temporary file (have you had a chance to try TempPathnameCreator?) using FeatureWriter. After this "prep" the file will be ready to be read with CSV Reader - you could use FeatureReader again to deal with a single translation.

@LenaAtSafe, I was hoping for a solution where I could parse the SOH control character directly, but I'll try one of the workarounds suggested.

 

 

Thanks,

 

 

Aaron

lifalin2016
Supporter
Forum|alt.badge.img+40
  • Supporter
  • April 5, 2018

Hi @aaron, a workaround I can think of is to read the source file with the Text File reader, replace \\x01 with a normal delimiter character (e.g. comma) with the StringReplacer (Replace Regular Expression mode), then split each text line by the delimiter character with the AttributeSplitter.

Hi Aaron.

 

Although the (2017.1) doc for AttributeSplitter looks somewhat out of date, it does show some examples of using control characters as delimiters. Extrapolating from the shown examples, you may want to try if (^A) will work as a substitute for 0x01 (ascii 001).

 

 


aaron
Contributor
Forum|alt.badge.img+12
  • Author
  • Contributor
  • April 16, 2018

FYI, I successfully used a Text File reader and a StringReplacer followed by an AttributeSplitter to parse the data. With the StringReplacer, I had to use \\x01 in the Text to Match box; copying and pasting the SOH control character directly did not work. Thanks everyone for your help!


  • September 20, 2018

HI @aaron can you tell me the correct solution for this... if u send like query type also it would be best... like " fields terminated by '\\???' "


aaron
Contributor
Forum|alt.badge.img+12
  • Author
  • Contributor
  • September 20, 2018

HI @aaron can you tell me the correct solution for this... if u send like query type also it would be best... like " fields terminated by '\\???' "

@kauk, below is a screenshot of the StringReplacer I used to find and replace SOH and STX control characters, respectively.