Skip to main content

I am looking to split some data, which is stored in csv but contains commas.

Unfortunately it cannot be read in using a CSV reader due to its complexity and non table / row format.

An example line of data might look like this:

 

"This data has , a comma in it", "So does, this one", This one does not,"but this one ,does", This one doesnt, or does this one,"Another comma, comma, comma",No comma, still none, still none

 

I considered doing 4 splits:

split 1: ","

 

Split 2: ,"

 

Split 3: ",

 

Split 4: ,

 

And then join the results back together. Although this would work, I need to retain the order of the data and assign an ascending ID.

 

 

EG:

1 - This data has , a comma in it

2 - So does, this one

3 - This one does not

4 - but this one ,does

5 - This one doesnt

 

So my consideration will not work since it will put things out of order.

 

Secondly, I considered doing a pre-processing step which would read in all the data as CSV, then Write it as CSV with a TAB delimiter instead of COMMA.

But what happens here is that, the original commas which were contained in the data, are being treated as delimiters. To complicate it more, this would have to be done in a bulk format since there are many many files.

Perhaps this solution is possible but there is a way to set up a writer schema or something that is beyond my understanding.

Any recommendations are appreciated.

 

 

You could use some regex in a stringreplacer to replace any comma that is preceded by an even number of quote marks with some other character and then use that in the attribute splitter (assuming your quotes are balanced)

,(?=(?:Â^\"]*\"a^\"]*\")*p^\"]*$)

0684Q00000ArKbXQAV.png


@david.benoit I think the CSV reader should be able to handle this. The Field Qualifier Character controls whether <quoted> fields can include the Delimieter Character. One problem you may have encountered is that the CSV reader has an 'auto' mode for the delimiter, so in this case it seems to use <space> as the default. So being explicit about the Delimeter Character might also help:


You could use some regex in a stringreplacer to replace any comma that is preceded by an even number of quote marks with some other character and then use that in the attribute splitter (assuming your quotes are balanced)

,(?=(?:Â^\"]*\"a^\"]*\")*p^\"]*$)

0684Q00000ArKbXQAV.png

Thank you!.

 

One thing im noticing is it seems to get stuck (no error, no finish .. just stuck!) on a line break.. eg: 

-------

,JOSMCAABFMK,2017-01-01,00:01:00,2017-01-02,00:01:00,MST,2017-01-01,07:01:00,2017-01-02,07:01:00,1.65,0.03,V0,0.70,0.03,V0,4.31,0.10,V0,-999.00,-999.00,M1,0.68,0.06,V0,3.10,0.28,V0,0.95,0.10,V0,2.64,0.17,V0,-999.00,-999.00,M1,1.05,0.06,V0,1.69,0.05,V0,0.13,0.21,V1,0.01,0.05,V1,0.51,0.13,V0,0.12,0.04,V0,4.19,0.10,V0,0.07,0.03,V0,-999.00,-999.00,M1,0.04,0.10,V1,-999.00,-999.00,M1,0.07,0.03,V0,0.03,0.07,V1,0.04,0.08,V1,4.34,0.10,V0,1.55,0.12,V0,0.16,0.13,V0,0.26,0.07,V0,2.45,0.14,V0,0.07,0.07,V0,0.26,

 

---------

thoughts?


@david.benoit I think the CSV reader should be able to handle this. The Field Qualifier Character controls whether <quoted> fields can include the Delimieter Character. One problem you may have encountered is that the CSV reader has an 'auto' mode for the delimiter, so in this case it seems to use <space> as the default. So being explicit about the Delimeter Character might also help:

Thanks @markatsafe.

 

 

My version of FME already has these settings as default. I will keep working on this option and follow up.

Thanks again,

Dave


Lots of good ideas here, I'll just add that it's also possible to use e.g. the Text Line reader to read either line-by-line or the entire file in one block, then use the Python CSV module on a per-line basis, as needed. 

Example PythonCaller:

import fmeobjects
import csv

def SplitCSVLine(feature):
    text = feature.getAttribute('text_line_data')
    if text:
        values = csv.reader(        feature.setAttribute('values{}', list(values)(0])

Sample output:

0684Q00000ArL92QAF.png

You can then either explode the list or rename the individual items as necessary.


You could use some regex in a stringreplacer to replace any comma that is preceded by an even number of quote marks with some other character and then use that in the attribute splitter (assuming your quotes are balanced)

,(?=(?:Â^\"]*\"a^\"]*\")*p^\"]*$)

0684Q00000ArKbXQAV.png

Okay after filtering out NULLs (which arent needed anyway) and some giant chunks that are also not needed, i was able to make this solution work. It takes 12 minutes to run compared to about two minutes before. but this might be due to a slow network on sql connection today. Thank you! 


Reply