Question

Replace regex match with whole new string

  • 9 September 2019
  • 11 replies
  • 17 views

Badge

Hi,

I have a series of regular expression and new string pairs.

If the regular expression gets a match I don't just want to replace the characters that match I want to replace the whole matched string with the new string.

As an example;

If the old string is - CAROLA 1.6ABC

The regex is - C[AO]?[CR]R?O?O?L?L?L?LI?A?

The new string needs to be - COROLLA

To make matters more difficult, there are about 400+ unique pairings.

 

Thanks in advance


11 replies

Badge

Put a .* at the end? And a .* at the start if you want to get rid of prefixes as well. Or am I underthinking this?

.*C[AO]?[CR]R?O?O?L?L?L?LI?A?.*

Badge

Put a .* at the end? And a .* at the start if you want to get rid of prefixes as well. Or am I underthinking this?

.*C[AO]?[CR]R?O?O?L?L?L?LI?A?.*

But this actually sounds like a job for SchemaMapper instead.

Badge +3

@cj

AS you appearantly are trying to find all or as many possible forms of somehting that sounds or looks like "COROLLA" you cannot capture it to be replaced, as yu would be capturing any incoorrect written version.

 

Yyou would be filtering and mapping.

You could use attributecreator

[regexp -all {C[AO]?[CR]R?O?O?L?L?L?LI?A?} {@Value(tt)}]!=0?"COROLLA":"NOPE"

(a conditional statement)

Or the provided Conditional statementstructure.

Badge

But this actually sounds like a job for SchemaMapper instead.

The * at each end would help with respect to getting rid of suffixes and prefixes. The real challenge however is how to apply the 400+ combinations.

Badge

@cj

AS you appearantly are trying to find all or as many possible forms of somehting that sounds or looks like "COROLLA" you cannot capture it to be replaced, as yu would be capturing any incoorrect written version.

 

Yyou would be filtering and mapping.

You could use attributecreator

[regexp -all {C[AO]?[CR]R?O?O?L?L?L?LI?A?} {@Value(tt)}]!=0?"COROLLA":"NOPE"

(a conditional statement)

Or the provided Conditional statementstructure.

Thanks, a conditional statement is the logic I am trying to achieve (if MATCH then replace string). The additionally difficult part however is how to scale that to the 400+ unique combinations.

Badge

I have come up with the follow python script;

import re

str = "CAROLA 1.6ABC"
match = re.match("C[AO]?[CR]R?O?O?L?L?L?LI?A?",str)
if match is not None:
    str = str.replace(str, "COROLLA")
    print (str)
else:
    print(match)

This does the search and replace well. But now trying to figure out a way to scale this. Ideally it would run off some sort of lookup table that would contain all the unique combos of the regex and new string.

Badge

Have managed to achieve the desired outcome with the following workflow;

 

Basically merge the REGEX and NEWMODEL values as a list onto the incoming data then use the StringSearcher to test all the old models against the regular expressions for that MAKE, then replace the old model with NEWMODEL for those that match.

Feels a bit inefficient, I still think there is a way using the above python script, the look up table, and looping in a custom transformer maybe.

Badge

I have come up with the follow python script;

import re

str = "CAROLA 1.6ABC"
match = re.match("C[AO]?[CR]R?O?O?L?L?L?LI?A?",str)
if match is not None:
    str = str.replace(str, "COROLLA")
    print (str)
else:
    print(match)

This does the search and replace well. But now trying to figure out a way to scale this. Ideally it would run off some sort of lookup table that would contain all the unique combos of the regex and new string.

An important thing here is whether you are doing a one-time translation of a fixed dataset or if you need to build a robust system to tackle incoming data of varying quality.

If it's the first, a fixed dataset, I would just set up the mapping in a spreadsheet and use SchemaMapper. But if you get new data all the time, you need to predict the errors in the data and/or build a system to catch the new variants that you need to incorporate in the mapping, making the system more and more robust as you go along.

Badge

An important thing here is whether you are doing a one-time translation of a fixed dataset or if you need to build a robust system to tackle incoming data of varying quality.

If it's the first, a fixed dataset, I would just set up the mapping in a spreadsheet and use SchemaMapper. But if you get new data all the time, you need to predict the errors in the data and/or build a system to catch the new variants that you need to incorporate in the mapping, making the system more and more robust as you go along.

This is not a one-time translation, this will be processing new and updated data on a regular schedule.

Userlevel 1
Badge +10

Have managed to achieve the desired outcome with the following workflow;

0684Q00000ArK2oQAF.png

 

Basically merge the REGEX and NEWMODEL values as a list onto the incoming data then use the StringSearcher to test all the old models against the regular expressions for that MAKE, then replace the old model with NEWMODEL for those that match.

Feels a bit inefficient, I still think there is a way using the above python script, the look up table, and looping in a custom transformer maybe.

I'd look at keeping the first part of your workflow as far as the featuremerger, then use a pythoncaller to loop through the list

e.g.

import fme
import fmeobjects
import re

def processFeature(feature):
    model= feature.getAttribute('model')
    ziplist = zip(feature.getAttribute('_list{}.newstring'),feature.getAttribute('_list{}.regex'))
    for i in ziplist:
        match = re.match(i[1],model)
        if match is not None:
            feature.setAttribute("newstring",i[0])
Badge +3

Actually, I don't think RegEx is the best solution here, as you would struggle to catch all varieties. Plus you would need to set up an entirely new RegEx when the next model is on your list.

 

Have you looked into the FuzzyStringComparer (or its big sister FuzzyStringCompareFrom2Datasets)? Effectively it compares two strings and gives a difference ratio / similarity score between 1 and 0: 1 = strings are identical, 0 = strings are entirely different.

See attached workbench

Reply