Question

REGEX Question


Badge +3

If I had the following string how would I remove a repeated word in this case the word Pensions. So I want to remove the first word that has Pensions but keep the second one.

2006SourcesOfPersonalIncomeTotalResponses06OtherSuperOrAnnuitiesOtherThanNZSuperVeteransOrWarPensionsCURP15YrsAndOver

2006SourcesOfPersonalIncomeTotalResponses06OtherSuperPensionsOrAnnuitiesOtherThanNZSuperVeteransOrWarPensionsCURP15YrsAndOver


5 replies

Userlevel 2
Badge +16

Not sure if you can use it (or if it is too specific for this case), but the StringReplacer allows for a text to be replaced by another text.

In this case you could replace SuperPensions by Super.

That removes the first but not the last instance of Pensions.

Hope this helps.

Badge +3

This can be done by "Look-Aheads" in RegEx, but the syntax of those are somewhat complex.

It is a couple more Transformers, but using StringSearcher as the basis will do the job with much simpler forms of RegEx. In List Mode, StringSearcher will return all the single instances of the word "Pensions" and what character positions each instance starts at.

Using this, it is then possible:

a) To determine if there is more than one instance of "Pension" in the result (by looking at how many Search results in the List). If there is more than one List item, then there are multiple instances of "Pension".

b) To use StringReplacer to replace the first instance of Pension by simply replacing the part of the string that evaluates to "{CharactersBefore}Pensions" with just "{CharactersBefore}". The RegEx used in the sample below evaluates to Eg ".{53}Pensions" in the case of "2006SourcesOfPersonalIncomeTotalResponses06OtherSuperOrAnnuitiesOtherThanNZSuperVeteransOrWarPensionsCURP15YrsAndOve " which is: Find exactly where there are 53 characters before, and including the word "Pension", and replace with those 53 preceding characters without the word "Pensions" in the replacement string, effectively removing the first instance of "Pensions"

 

...and if anyone is wondering why I used AttributeFilter instead of Tester.....it is much, much faster due to AttributeFilter being Bulk Mode enabled.

Userlevel 1
Badge +21

You can use set up the String Replacer as follows

This matches two groups, the first group contains the word Pension and all characters after up until the next occurrence of Pension. The second gropu matches Pension. The \\1 replaces the match with the first group

Badge +8

removeredundantwordsfromstring.fmwt

Hi, I tried to catch your string in a regular expression and ended up with the following:

[A-Z]+[a-z]+[0-9]*|[0-9]+|[A-Z]+

It should be used in StringSearcher in a case sensitive way, and post processed using a combination of ListDuplicateRemover and ListConcatenator (see workspace attached).

One has to ask herselves what defines a repeated word? My interpretation:

  • It starts with a capital
  • Or is just plain numbers
  • Or it only contains capitals

This is what the regular expression does in the case above, it looks (from the left to the right, RE are not abelian...)

  • It looks for one or more capitals in the range A-Z, followed by
  • one or more lowercase character in the range a-z, followed by
  • zero or more ciphers

OR, if that does not succeed,

  • It looks for one or more ciphers

OR, if that does not succeed,

  • It looks for one or more capitals

In the advanced section of the StringSearcher, provide a name for the list of all matches.

Duplicates in this list can be removed using a ListDuplicateRemover, specifying the list.match{} list

The original string can be reconstructed by using a ListConcatenator, concatenating the list.match{} list. Do not forget to remove the default comma separator to distinguish between the concatenated string.

Hope this helps.

 

Badge +3

Do you know what the duplicate word in each of your strings is going to be? Your example string shows multiple duplicate words, including 'Or' and 'Other', so you can't just use a regex that 'simply' filters out any duplicate word.

Reply