Question

Text To Column using RegEx


Badge +1

Hi,

I am using FME 2017.0 x64 and the input data is PostGIS data from server

So, there is a column in the table 'other_tags that contains strings as displayed. I want the values in the left of => to be the attribute name and the value in the right of => to be the value of that attribute. The two columns are separated by comma(,).

The catch is the attribute name are not in a certain order, so I found a bit of difficulty to create list.

Moreover, the values in the columns can be multiple, like surface could be- paved, unpaved, asphalt, grass and many more and since the workbench has to be used for multiple countries the string also cannot be defined.

Could there be some easy possible way to extract the values that can make workbench shorter and easier.

I tried string searcher with regex- surface"=>".*"$ but something is going wrong.

Thanks.


12 replies

Userlevel 4

You say "as displayed" but I can't see anything other than your question, did you maybe forget to post a screenshot?

A concrete example of the text you want to split up would be very helpful in trying to help.

Badge +1

You say "as displayed" but I can't see anything other than your question, did you maybe forget to post a screenshot?

A concrete example of the text you want to split up would be very helpful in trying to help.

Hi David, I have attached the screenshot now.

 

 

Badge

Hi @hellblazer,

Could you please provide some sample input data which will be helpful for providing a solution.

If the single attribute contains multiple values that you need to separate into new attributes, first you can use an AttributeSplitter to build a list.

However, as a reply to your RegEx question, you can pass the value in 'other_tags' into two new attributes (in this example have created 'att_name' and 'att_value') and use two StringReplacers to extract part before => as 'att_name' and part after => as 'att_value'. (Please see the attached workspace)

To extract value for 'att_name' you can use RegEx ^.*>

To extract value for 'att_value' RegEx is =>.*$

Then you can build the new attribute using 'att_name' and 'att_value'.

Note: FME is not directly supporting creating new attribute using a value of another attribute to use them within the workflow. However, you can create attribute in this manner and write them into final output using an SchemaSetter and setting the writer to Dynamic with "Schema From Schema Feature" as depicted in the attached sample workspace.

regex-sample.fmw

 

Regards,

Priyantha Pallegama

Userlevel 4

You can of course accomplish the same thing using transformers (e.g. AttributeSplitters etc), but here's a PythonCaller based solution that will create an attribute for each tag listed in "other_tags":

def SplitTags(feature):
    tags_str = feature.getAttribute('other_tags')
    if tags_str:
        # Split tags into key/value pairs in a dict
        tags = dict([x.split('=>') for x in tags_str.split(',')])
        # Trim superfluous spaces
        tags = {k.strip(): v.strip() for k, v in tags.iteritems()}
        # Trim double quotes
        tags = {k.strip('"'): v.strip('"') for k, v in tags.iteritems()}
        for k, v in tags.iteritems():
            feature.setAttribute(k, v)

Example input:

"int_ref" => "E 691", "name:ka" => "ka", "name:ru" => "ru", "surface" => "asphalt"

Example output:

`int_ref' has value `E 691'
`name:ka' has value `ka'
`name:ru' has value `ru'
`surface' has value `asphalt'

Note that there might be some edge cases where you'll get unexpected results, notably if the string "=>" or a comma occurs inside a tag value rather than as a separator. If that's the case, a more advanced parsing mechanism is needed.

Userlevel 4

If you want a purely regex-based solution, you can use this in a StringSearcher:

"surface"\s*=>\s*"(.+?)"

If you set the "subexpression matches list name" to "_subs", you'll find the surface value in the attribute _subs{0}.part

Badge +1

Hi @hellblazer,

Could you please provide some sample input data which will be helpful for providing a solution.

If the single attribute contains multiple values that you need to separate into new attributes, first you can use an AttributeSplitter to build a list.

However, as a reply to your RegEx question, you can pass the value in 'other_tags' into two new attributes (in this example have created 'att_name' and 'att_value') and use two StringReplacers to extract part before => as 'att_name' and part after => as 'att_value'. (Please see the attached workspace)

To extract value for 'att_name' you can use RegEx ^.*>

To extract value for 'att_value' RegEx is =>.*$

Then you can build the new attribute using 'att_name' and 'att_value'.

Note: FME is not directly supporting creating new attribute using a value of another attribute to use them within the workflow. However, you can create attribute in this manner and write them into final output using an SchemaSetter and setting the writer to Dynamic with "Schema From Schema Feature" as depicted in the attached sample workspace.

regex-sample.fmw

 

Regards,

Priyantha Pallegama

Hi @pallegama

 

 

Thanks for your efforts.I have attached the sampledata.zip. This will give you a more clear picture to you about my requirement.

 

Userlevel 2
Badge +17

Hi @hellblazer, alternatively, the JSONFlattener with this expression set to the JSON Document parameter can be used here.

{@ReplaceString(@Value(other_tags),=>,:)}

You can then expose required attribute names with the AttributeExposer, if necessary.

See also the attachment:  jsonflattener-demo.fmwt (FME 2017.0.1.1)

Userlevel 2
Badge +17

Hi @hellblazer, alternatively, the JSONFlattener with this expression set to the JSON Document parameter can be used here.

{@ReplaceString(@Value(other_tags),=>,:)}

You can then expose required attribute names with the AttributeExposer, if necessary.

See also the attachment:  jsonflattener-demo.fmwt (FME 2017.0.1.1)

I forgot the JSONFlattener has the "Attributes to Expose" parameter. The AttributeExposer is not essential to expose attribute names.

 

 

Badge +3

@hellblazer

If you try to use .* (any wordcarachter) for searching thenyou are also searching for comma's etc.

Supposing the strings only contain alphanumerical characters, spaces,questionmarks and semicolons then you can use following regexp in a (for instance) stringsearcher

"([a-zA-Z:\\s\\d\\?]+)"=>"([a-zA-Z:\\s\\d\\?]+)"

Badge +3

@hellblazer

If you try to use .* (any wordcarachter) for searching thenyou are also searching for comma's etc.

Supposing the strings only contain alphanumerical characters, spaces,questionmarks and semicolons then you can use following regexp in a (for instance) stringsearcher

"([a-zA-Z:\\s\\d\\?]+)"=>"([a-zA-Z:\\s\\d\\?]+)"

 

...and of course set All Matches Listname and subexpression etc. on the stringsearcher (if you use that one)

 

 

Badge +1

@hellblazer

If you try to use .* (any wordcarachter) for searching thenyou are also searching for comma's etc.

Supposing the strings only contain alphanumerical characters, spaces,questionmarks and semicolons then you can use following regexp in a (for instance) stringsearcher

"([a-zA-Z:\\s\\d\\?]+)"=>"([a-zA-Z:\\s\\d\\?]+)"

Hi @gio

 

I tried using your idea of the above reg ex, and it definitely works with the first element. Although, if I look at the overall number of elements like there are surface, lanes, name eng, rus, etc., I need to create a list for each. Also, the elements are not in order in each row, will that be an issue?

 

Badge +3
Hi @gio

 

I tried using your idea of the above reg ex, and it definitely works with the first element. Although, if I look at the overall number of elements like there are surface, lanes, name eng, rus, etc., I need to create a list for each. Also, the elements are not in order in each row, will that be an issue?

 

@hellblazer

 

 

You can use the string searcher and set All Matces and sub part list name.

 

All the matches will then be in the resulting list.

 

Order is no issue, all that matters is the pattern of the input data.

 

The Items in the character class (everything between the brackets) should cover the required extent.

 

Here is a pic

 

(I used David's test string, so I added optional spaces between the items, whitch your input has not and underscores to the regexp)

 

 

Sub match part 0 is attr_name and part 1 is it's value.

 

Reply