Skip to main content
Solved

CSV file with SOH control character as separator


aaron
Contributor
Forum|alt.badge.img+10
  • Contributor

I have several CSV files exported from Hadoop so the separator is the SOH control character. What do I type as the separator for the CSV reader? I've tried the hex code (\\x01) and copying/pasting the SOH character from Notepad++ but neither work.

Thanks,

Aaron

Best answer by takashi

Hi @aaron, a workaround I can think of is to read the source file with the Text File reader, replace \\x01 with a normal delimiter character (e.g. comma) with the StringReplacer (Replace Regular Expression mode), then split each text line by the delimiter character with the AttributeSplitter.

View original
Did this help you find an answer to your question?

8 replies

takashi
Contributor
Forum|alt.badge.img+19
  • Contributor
  • Best Answer
  • April 3, 2018

Hi @aaron, a workaround I can think of is to read the source file with the Text File reader, replace \\x01 with a normal delimiter character (e.g. comma) with the StringReplacer (Replace Regular Expression mode), then split each text line by the delimiter character with the AttributeSplitter.


aaron
Contributor
Forum|alt.badge.img+10
  • Author
  • Contributor
  • April 3, 2018
takashi wrote:

Hi @aaron, a workaround I can think of is to read the source file with the Text File reader, replace \\x01 with a normal delimiter character (e.g. comma) with the StringReplacer (Replace Regular Expression mode), then split each text line by the delimiter character with the AttributeSplitter.

Hi @takashi, I'm working with Twitter data so I can't use normal delimiter characters (commas, tabs, pipes, etc.) because they are sometimes included in the text of a tweet. In such cases, columns won't parse correctly for those records. I considered making up a delimiter with some random text (e.g. - qxz) but that's not ideal. Can someone think of any other/better solutions?

 


Forum|alt.badge.img

Hi @aaron

I agree with @takashi: replacing SOH control character with something... easier to handle is probably the best idea. I would read the source data as a text file using FeatureReader, replace the problem character, and write it as a temporary file (have you had a chance to try TempPathnameCreator?) using FeatureWriter. After this "prep" the file will be ready to be read with CSV Reader - you could use FeatureReader again to deal with a single translation.


aaron
Contributor
Forum|alt.badge.img+10
  • Author
  • Contributor
  • April 3, 2018
lenaatsafe wrote:

Hi @aaron

I agree with @takashi: replacing SOH control character with something... easier to handle is probably the best idea. I would read the source data as a text file using FeatureReader, replace the problem character, and write it as a temporary file (have you had a chance to try TempPathnameCreator?) using FeatureWriter. After this "prep" the file will be ready to be read with CSV Reader - you could use FeatureReader again to deal with a single translation.

@LenaAtSafe, I was hoping for a solution where I could parse the SOH control character directly, but I'll try one of the workarounds suggested.

 

 

Thanks,

 

 

Aaron

lifalin2016
Contributor
Forum|alt.badge.img+29
  • Contributor
  • April 5, 2018
takashi wrote:

Hi @aaron, a workaround I can think of is to read the source file with the Text File reader, replace \\x01 with a normal delimiter character (e.g. comma) with the StringReplacer (Replace Regular Expression mode), then split each text line by the delimiter character with the AttributeSplitter.

Hi Aaron.

 

Although the (2017.1) doc for AttributeSplitter looks somewhat out of date, it does show some examples of using control characters as delimiters. Extrapolating from the shown examples, you may want to try if (^A) will work as a substitute for 0x01 (ascii 001).

 

 


aaron
Contributor
Forum|alt.badge.img+10
  • Author
  • Contributor
  • April 16, 2018

FYI, I successfully used a Text File reader and a StringReplacer followed by an AttributeSplitter to parse the data. With the StringReplacer, I had to use \\x01 in the Text to Match box; copying and pasting the SOH control character directly did not work. Thanks everyone for your help!


  • September 20, 2018

HI @aaron can you tell me the correct solution for this... if u send like query type also it would be best... like " fields terminated by '\\???' "


aaron
Contributor
Forum|alt.badge.img+10
  • Author
  • Contributor
  • September 20, 2018
kauk wrote:

HI @aaron can you tell me the correct solution for this... if u send like query type also it would be best... like " fields terminated by '\\???' "

@kauk, below is a screenshot of the StringReplacer I used to find and replace SOH and STX control characters, respectively.

 


Reply


Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

 
Cookie settings