Question

Split comma delimited text which contains commas. (Read CSV as TXT)

5 years ago
February 26, 2020
6 replies
1116 views

david.benoit
16 replies

I am looking to split some data, which is stored in csv but contains commas.

Unfortunately it cannot be read in using a CSV reader due to its complexity and non table / row format.

An example line of data might look like this:

"This data has , a comma in it", "So does, this one", This one does not,"but this one ,does", This one doesnt, or does this one,"Another comma, comma, comma",No comma, still none, still none

I considered doing 4 splits:

split 1: ","

Split 2: ,"

Split 3: ",

Split 4: ,

And then join the results back together. Although this would work, I need to retain the order of the data and assign an ascending ID.

EG:

1 - This data has , a comma in it

2 - So does, this one

3 - This one does not

4 - but this one ,does

5 - This one doesnt

So my consideration will not work since it will put things out of order.

Secondly, I considered doing a pre-processing step which would read in all the data as CSV, then Write it as CSV with a TAB delimiter instead of COMMA.

But what happens here is that, the original commas which were contained in the data, are being treated as delimiters. To complicate it more, this would have to be done in a bulk format since there are many many files.

Perhaps this solution is possible but there is a way to set up a writer schema or something that is beyond my understanding.

Any recommendations are appreciated.

+34

ebygomm
Influencer
3275 replies
5 years ago
February 26, 2020

You could use some regex in a stringreplacer to replace any comma that is preceded by an even number of quote marks with some other character and then use that in the attribute splitter (assuming your quotes are balanced)

,(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)

markatsafe
1891 replies
5 years ago
February 26, 2020

@david.benoit I think the CSV reader should be able to handle this. The Field Qualifier Character controls whether <quoted> fields can include the Delimieter Character. One problem you may have encountered is that the CSV reader has an 'auto' mode for the delimiter, so in this case it seems to use <space> as the default. So being explicit about the Delimeter Character might also help:

david.benoit
Author
16 replies
5 years ago
February 26, 2020

ebygomm wrote:

,(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)

Thank you!.

One thing im noticing is it seems to get stuck (no error, no finish .. just stuck!) on a line break.. eg:

-------

,JOSMCAABFMK,2017-01-01,00:01:00,2017-01-02,00:01:00,MST,2017-01-01,07:01:00,2017-01-02,07:01:00,1.65,0.03,V0,0.70,0.03,V0,4.31,0.10,V0,-999.00,-999.00,M1,0.68,0.06,V0,3.10,0.28,V0,0.95,0.10,V0,2.64,0.17,V0,-999.00,-999.00,M1,1.05,0.06,V0,1.69,0.05,V0,0.13,0.21,V1,0.01,0.05,V1,0.51,0.13,V0,0.12,0.04,V0,4.19,0.10,V0,0.07,0.03,V0,-999.00,-999.00,M1,0.04,0.10,V1,-999.00,-999.00,M1,0.07,0.03,V0,0.03,0.07,V1,0.04,0.08,V1,4.34,0.10,V0,1.55,0.12,V0,0.16,0.13,V0,0.26,0.07,V0,2.45,0.14,V0,0.07,0.07,V0,0.26,

---------

thoughts?

david.benoit
Author
16 replies
5 years ago
February 26, 2020

markatsafe wrote:

Thanks @markatsafe.

My version of FME already has these settings as default. I will keep working on this option and follow up.

Thanks again,

Dave

david_r
8332 replies
5 years ago
February 27, 2020

Lots of good ideas here, I'll just add that it's also possible to use e.g. the Text Line reader to read either line-by-line or the entire file in one block, then use the Python CSV module on a per-line basis, as needed.

Example PythonCaller:

import fmeobjects
import csv

def SplitCSVLine(feature):
    text = feature.getAttribute('text_line_data')
    if text:
        values = csv.reader([str(text)])
        feature.setAttribute('values{}', list(values)[0])

Sample output:

You can then either explode the list or rename the individual items as necessary.

david.benoit
Author
16 replies
5 years ago
February 27, 2020

ebygomm wrote:

,(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)

Okay after filtering out NULLs (which arent needed anyway) and some giant chunks that are also not needed, i was able to make this solution work. It takes 12 minutes to run compared to about two minutes before. but this might be due to a slow network on sql connection today. Thank you!

Reply

Rich Text Editor, editor1

Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

Cookie settings

We use 3 different kinds of cookies. You can choose which cookies you want to accept. We need basic cookies to make this site work, therefore these are the minimum you can select. Learn more about our cookies.

Basic
Functional

Normal
Functional + analytics

Complete
Functional + analytics + social media + embedded videos + marketing

Split comma delimited text which contains commas. (Read CSV as TXT)

6 replies

Reply

Helpful Members This Week

Recently Solved Questions

Adding the workbench's file path via a creator

A geodatabase feature could not be written

Why does FME store files in My Documents folder?

Importing a module in the workspace's directory into PythonCaller

Convert JSON to ESRI Point Feature Class

Community Stats

Latest FME

Cookie policy

Cookie settings

Reply

Related Topics

GameDog [v5.0] - Chroma for CS:GO and DOTA2icon

Leviathan V2 Pro - Subwoofer issues (Brand new). Win11icon

Helpful Members This Week

Recently Solved Questions

Popular Tags

Community Stats

Latest FME

Sign up

An FME Account is required to contribute

Login to the community

An FME Account is required to contribute

Scanning file for viruses.

This file cannot be downloaded

Cookie policy

Cookie settings