Solved

CSV file with SOH control character as separator

7 years ago
April 2, 2018
8 replies
1703 views

+10

aaron
Contributor
66 replies

I have several CSV files exported from Hadoop so the separator is the SOH control character. What do I type as the separator for the CSV reader? I've tried the hex code (\\x01) and copying/pasting the SOH character from Notepad++ but neither work.

Thanks,

Aaron

Best answer by takashi

Hi @aaron, a workaround I can think of is to read the source file with the Text File reader, replace \\x01 with a normal delimiter character (e.g. comma) with the StringReplacer (Replace Regular Expression mode), then split each text line by the delimiter character with the AttributeSplitter.

View original

Did this help you find an answer to your question?

takashi
7685 replies
Best Answer
7 years ago
April 3, 2018

+10

aaron
Author
Contributor
66 replies
7 years ago
April 3, 2018

takashi wrote:

Hi @takashi, I'm working with Twitter data so I can't use normal delimiter characters (commas, tabs, pipes, etc.) because they are sometimes included in the text of a tweet. In such cases, columns won't parse correctly for those records. I considered making up a delimiter with some random text (e.g. - qxz) but that's not ideal. Can someone think of any other/better solutions?

lenaatsafe
275 replies
7 years ago
April 3, 2018

Hi @aaron

I agree with @takashi: replacing SOH control character with something... easier to handle is probably the best idea. I would read the source data as a text file using FeatureReader, replace the problem character, and write it as a temporary file (have you had a chance to try TempPathnameCreator?) using FeatureWriter. After this "prep" the file will be ready to be read with CSV Reader - you could use FeatureReader again to deal with a single translation.

+10

aaron
Author
Contributor
66 replies
7 years ago
April 3, 2018

lenaatsafe wrote:

Hi @aaron

@LenaAtSafe, I was hoping for a solution where I could parse the SOH control character directly, but I'll try one of the workarounds suggested.

Thanks,

Aaron

+29

lifalin2016
Contributor
574 replies
7 years ago
April 5, 2018

takashi wrote:

Hi Aaron.

Although the (2017.1) doc for AttributeSplitter looks somewhat out of date, it does show some examples of using control characters as delimiters. Extrapolating from the shown examples, you may want to try if (^A) will work as a substitute for 0x01 (ascii 001).

-- Cheers, Lars I.

+10

aaron
Author
Contributor
66 replies
7 years ago
April 16, 2018

FYI, I successfully used a Text File reader and a StringReplacer followed by an AttributeSplitter to parse the data. With the StringReplacer, I had to use \\x01 in the Text to Match box; copying and pasting the SOH control character directly did not work. Thanks everyone for your help!

kauk
1 reply
6 years ago
September 20, 2018

HI @aaron can you tell me the correct solution for this... if u send like query type also it would be best... like " fields terminated by '\\???' "

+10

aaron
Author
Contributor
66 replies
6 years ago
September 20, 2018

kauk wrote:

HI @aaron can you tell me the correct solution for this... if u send like query type also it would be best... like " fields terminated by '\\???' "

@kauk, below is a screenshot of the StringReplacer I used to find and replace SOH and STX control characters, respectively.

Reply

Rich Text Editor, editor1

Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

Cookie settings

We use 3 different kinds of cookies. You can choose which cookies you want to accept. We need basic cookies to make this site work, therefore these are the minimum you can select. Learn more about our cookies.

Basic
Functional

Normal
Functional + analytics

Complete
Functional + analytics + social media + embedded videos + marketing

CSV file with SOH control character as separator