Skip to main content

I have a string attribute:

eg.

(sign,45,"red, white",null, "stop")

 

I want to be able to split this attribute while respecting the use of "" to surround the text.

 

If I use a regular Attributesplitter the result is:

  1. sign
  2. 45
  3. "red
  4. white"
  5. null
  6. "stop"

 

what I'm aiming for is:

  1. sign
  2. 45
  3. "red, white"
  4. null
  5. "stop"

 

Anyone got any ideas?

 

Cheers,

David

Have a look at my answer here (using the PythonCaller), it should be pretty much excatly what you're asking for:

https://community.safe.com/s/question/0D54Q000080hLrnSAE/split-comma-delimited-text-which-contains-commas-read-csv-as-txt

 


First use StringReplacer to replace the commas you don't want to use as a delimiter temporarily with another special character like ";"

The RegEx to Search for at this step is:

("s^,]+),( ^,]+")

The Replacement String is

\1;\2

Then split with AttributeSplitter as per normal.  It will no longer split at the commas in between "..." because they have been temporarily replaced with another character.

 

Then go back and replace the ";" in the split strings back to the original comma characters.

capture

Outputs:

capture1

 


First use StringReplacer to replace the commas you don't want to use as a delimiter temporarily with another special character like ";"

The RegEx to Search for at this step is:

("s^,]+),( ^,]+")

The Replacement String is

\1;\2

Then split with AttributeSplitter as per normal.  It will no longer split at the commas in between "..." because they have been temporarily replaced with another character.

 

Then go back and replace the ";" in the split strings back to the original comma characters.

capture

Outputs:

capture1

 

That is nice. I've been looking for a RegEx to achieve that since I read the question, but I'm pretty bad at them.

Could you explain it a little bit, please? 


That is nice. I've been looking for a RegEx to achieve that since I read the question, but I'm pretty bad at them.

Could you explain it a little bit, please? 

That regex will only work if there are only single commas in the quoted strings.

I'd use, the following in a string replacer to replace the commas outside of quotes with another unique character and then use that character as the delimiter in the Attribute to split

,(?=(?:e^\"]*\"t^\"]*\")*i^\"]*$)

Capture

 

 

This site gives explanations of a specified regex expression

https://regex101.com/


That is nice. I've been looking for a RegEx to achieve that since I read the question, but I'm pretty bad at them.

Could you explain it a little bit, please?

I suspected Regex might be the answer, it is just a scary world for someone who only knows the pretty visual coding of FME!

Thanks for the input guys!


Just in case anyone wanted a really messy but non Regex solution to this , I came up with a work around before the smart people above used Regex.

(also I cheated by making sure all the string elements of my list were surrounded by quotes.)

 

THIS ONLY WORKS IF ALL "string" VARIABLES WITHIN THE LIST ARE ENCASED IN QUOTES

 

Basically you treat the full string as a JSON and uses the fragmenter to separate them.

 

Capture

You can stop after the fragmenter if you simply want the list as separate features. I wanted them as a list within a single feature so added the extra aggregator and splitter for the following results:

result


That is nice. I've been looking for a RegEx to achieve that since I read the question, but I'm pretty bad at them.

Could you explain it a little bit, please?

Not to toot my own horn, but I would avoid using complex regex expressions unless you feel confident that it'll work and that you'll never have to touch it again.

The python code I posted a link to will be much easier to read, understand and maintain if you come back to it some time later (even if you don't really know python). In addition, it uses base functionality in python that have been thoroughly tested over many years, so you know it's robust, even for edge cases.


That is nice. I've been looking for a RegEx to achieve that since I read the question, but I'm pretty bad at them.

Could you explain it a little bit, please? 

Still plenty of ways to get rubbish input data that the python would need to be rewritten to handle, e.g. if the string has some extra spaces in, e.g.

sign, 45, "red, blue ,white", null, "stop"  you'd need to set skipInitialspace to True to get the required output

import fmeobjects
import csv
 
def SplitCSVLine(feature):
    text = feature.getAttribute('StringToSearch')
    if text:
        values = csv.reader(tstr(text)],skipinitialspace=True)
        feature.setAttribute('values{}', list(values)v0])

The regex method followed by the AttributeSplitter handles this without any change. If you start getting into escaped quotes etc. then it's a different matter!

Whatever method is used, it's always helpful to understand how it's actually working so you can understand the limitations and where things can go wrong.


That is nice. I've been looking for a RegEx to achieve that since I read the question, but I'm pretty bad at them.

Could you explain it a little bit, please?

Oh yes, definitely lots than can go wrong if the data isn't clean.

My point is that if you set skipInitalspace=True in Python, then that's an explicit information that is easily understood when re-reading the code (much) later. It may not be so apparent when reading some (not yours specifically) non-trivial regex later on.

That is my experience anyway :-)


Reply