Question

Too many text delimiters in csv files

11 years ago
October 7, 2013
4 replies
121 views

karl_zinglersen
9 replies

I have a large number (one for each day the past years) of text files of apparently same field structure. The fields are separated with semi colon ; delimiter, but with no text qualifyer e.g. " ".

However, some of the records have data, which include semi colons and consequently mess up and fails when importing into FME Workbench.

So far I have added the text files as readers in used the first line as column names, but the import apparently breaks when hitting the false semi colon delimiters.

What can I do to manage and clean up such failures?

takashi
7734 replies
11 years ago
October 7, 2013

Hi,

How can you distinguish delimiter semi colons from others?

If there is a regularity - e.g. delimiter semi colons are always followed by a white space and non-delimiters are not, I think it's possible to create an FME workspace to clean up them.

Otherwise it's difficult to automate completely. Some manual operations might be necessary, for example: Read the data with the TEXT reader line by line. Split every line with semi colons, count the number of columns. Save correct records and failed records separately. Correct failed records manually. Read again and merge them, then save them into the final destination dataset.

Takashi

david_r
8358 replies
11 years ago
October 8, 2013

Hi,

it's difficult to tell without having seen the CSV file in question, but as has been said, it is not trivial to separate content from delimiters if there are no other distinguishing features between them.

You could, however, try and replace the CSV reader with a PythonCreator and the script found here. It will use the Python "csv" module that is somewhat more robust than the CSV reader in FME.

David

karl_zinglersen
Author
9 replies
11 years ago
October 11, 2013

Hi, sorry I haven't replied yet - have had a small vacation too.

I'll test your different approaches. For my (at the moment) limited data set of 500K post, there seems to be only one failing entity (of some posts), so I'll surely be able to fix that. However, this would be a returning and updated data set (AIS-shiptracks from satellit monitoring).

I believe I can do Takashi's approach resonable easily, thanks.

But I'll also try David's PythonCreator - I have a personal goal to get more into tools like Python.

karl_zinglersen
Author
9 replies
11 years ago
November 14, 2013

Hi again,

I received a new similar data set recently and tried a different approach using regex (.*;){21} - to find lines with more than 20 semi colons - as a simple solution inside a StringSearcher.

Found at Stackoverflow http://stackoverflow.com/questions/14774559/with-regex-finding-lines-containing-more-than-n-semi-colons

Karl

Reply

Rich Text Editor, editor1

Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

Cookie settings

We use 3 different kinds of cookies. You can choose which cookies you want to accept. We need basic cookies to make this site work, therefore these are the minimum you can select. Learn more about our cookies.

Basic
Functional

Normal
Functional + analytics

Complete
Functional + analytics + social media + embedded videos + marketing

Too many text delimiters in csv files