Question

Too many text delimiters in csv files

  • 7 October 2013
  • 4 replies
  • 19 views

I have a large number (one for each day the past years) of text files of apparently same field structure. The fields are separated with semi colon ; delimiter, but with no text qualifyer e.g. " ".

 

 

However, some of the records have data, which include semi colons and consequently mess up and fails when importing into FME Workbench.

 

 

So far I have added the text files as readers in used the first line as column names, but the import apparently breaks when hitting the false semi colon delimiters. 

 

 

What can I do to manage and  clean up such failures?

4 replies

Userlevel 2
Badge +17
Hi,

 

 

How can you distinguish delimiter semi colons from others?

 

If there is a regularity - e.g. delimiter semi colons are always followed by a white space and non-delimiters are not, I think it's possible to create an FME workspace to clean up them.

 

Otherwise it's difficult to automate completely. Some manual operations might be necessary, for example: Read the data with the TEXT reader line by line. Split every line with semi colons, count the number of columns. Save correct records and failed records separately. Correct failed records manually. Read again and merge them, then save them into the final destination dataset.

 

 

Takashi
Userlevel 4
Hi,

 

 

it's difficult to tell without having seen the CSV file in question, but as has been said, it is not trivial to separate content from delimiters if there are no other distinguishing features between them.

 

 

You could, however, try and replace the CSV reader with a PythonCreator and the script found here. It will use the Python "csv" module that is somewhat more robust than the CSV reader in FME.

 

 

David
Hi, sorry I haven't replied yet - have had a small vacation too.

 

I'll test your different approaches. For my (at the moment) limited data set of 500K post, there seems to be only one failing entity (of some posts), so I'll surely be able to fix that. However, this would be a returning and updated data set (AIS-shiptracks from satellit monitoring). 

 

 

I believe I can do Takashi's approach resonable easily, thanks.

 

But I'll also try David's PythonCreator - I have a personal goal to get more into tools like Python.
Hi again,

 

I received a new similar data set recently and tried a different approach using regex (.*;){21} - to find lines with more than 20 semi colons - as a simple solution inside a StringSearcher.

 

 

Found at Stackoverflow http://stackoverflow.com/questions/14774559/with-regex-finding-lines-containing-more-than-n-semi-colons 

 

 

Karl

Reply