Skip to main content

I have several CSV files exported from Hadoop so the separator is the SOH control character. What do I type as the separator for the CSV reader? I've tried the hex code (\\x01) and copying/pasting the SOH character from Notepad++ but neither work.

Thanks,

Aaron

Hi @aaron, a workaround I can think of is to read the source file with the Text File reader, replace \\x01 with a normal delimiter character (e.g. comma) with the StringReplacer (Replace Regular Expression mode), then split each text line by the delimiter character with the AttributeSplitter.


Hi @aaron, a workaround I can think of is to read the source file with the Text File reader, replace \\x01 with a normal delimiter character (e.g. comma) with the StringReplacer (Replace Regular Expression mode), then split each text line by the delimiter character with the AttributeSplitter.

Hi @takashi, I'm working with Twitter data so I can't use normal delimiter characters (commas, tabs, pipes, etc.) because they are sometimes included in the text of a tweet. In such cases, columns won't parse correctly for those records. I considered making up a delimiter with some random text (e.g. - qxz) but that's not ideal. Can someone think of any other/better solutions?

 


Hi @aaron

I agree with @takashi: replacing SOH control character with something... easier to handle is probably the best idea. I would read the source data as a text file using FeatureReader, replace the problem character, and write it as a temporary file (have you had a chance to try TempPathnameCreator?) using FeatureWriter. After this "prep" the file will be ready to be read with CSV Reader - you could use FeatureReader again to deal with a single translation.


Hi @aaron

I agree with @takashi: replacing SOH control character with something... easier to handle is probably the best idea. I would read the source data as a text file using FeatureReader, replace the problem character, and write it as a temporary file (have you had a chance to try TempPathnameCreator?) using FeatureWriter. After this "prep" the file will be ready to be read with CSV Reader - you could use FeatureReader again to deal with a single translation.

@LenaAtSafe, I was hoping for a solution where I could parse the SOH control character directly, but I'll try one of the workarounds suggested.

 

 

Thanks,

 

 

Aaron

Hi @aaron, a workaround I can think of is to read the source file with the Text File reader, replace \\x01 with a normal delimiter character (e.g. comma) with the StringReplacer (Replace Regular Expression mode), then split each text line by the delimiter character with the AttributeSplitter.

Hi Aaron.

 

Although the (2017.1) doc for AttributeSplitter looks somewhat out of date, it does show some examples of using control characters as delimiters. Extrapolating from the shown examples, you may want to try if (^A) will work as a substitute for 0x01 (ascii 001).

 

 


FYI, I successfully used a Text File reader and a StringReplacer followed by an AttributeSplitter to parse the data. With the StringReplacer, I had to use \\x01 in the Text to Match box; copying and pasting the SOH control character directly did not work. Thanks everyone for your help!


HI @aaron can you tell me the correct solution for this... if u send like query type also it would be best... like " fields terminated by '\\???' "


HI @aaron can you tell me the correct solution for this... if u send like query type also it would be best... like " fields terminated by '\\???' "

@kauk, below is a screenshot of the StringReplacer I used to find and replace SOH and STX control characters, respectively.

 


Reply