Question

REGEX Question

Forum|Forum|6 years ago
April 9, 2020
5 replies
70 views

gisgeek
Contributor

If I had the following string how would I remove a repeated word in this case the word Pensions. So I want to remove the first word that has Pensions but keep the second one.

2006SourcesOfPersonalIncomeTotalResponses06OtherSuperOrAnnuitiesOtherThanNZSuperVeteransOrWarPensionsCURP15YrsAndOver

2006SourcesOfPersonalIncomeTotalResponses06OtherSuperPensionsOrAnnuitiesOtherThanNZSuperVeteransOrWarPensionsCURP15YrsAndOver

This post is closed to further activity.
It may be an old question, an answered question, an implemented idea, or a notification-only post.
Please check post dates before relying on any information in a question or answer.
For follow-up or related questions, please post a new question or idea.
If there is a genuine update to be made, please contact us and request that the post is reopened.

+26

erik_jan
Contributor
Forum|Forum|6 years ago
April 9, 2020

Not sure if you can use it (or if it is too specific for this case), but the StringReplacer allows for a text to be replaced by another text.

In this case you could replace SuperPensions by Super.

That removes the first but not the last instance of Pensions.

Hope this helps.

Upvote

+26

bwn
Evangelist
Forum|Forum|6 years ago
April 9, 2020

This can be done by "Look-Aheads" in RegEx, but the syntax of those are somewhat complex.

It is a couple more Transformers, but using StringSearcher as the basis will do the job with much simpler forms of RegEx. In List Mode, StringSearcher will return all the single instances of the word "Pensions" and what character positions each instance starts at.

Using this, it is then possible:

a) To determine if there is more than one instance of "Pension" in the result (by looking at how many Search results in the List). If there is more than one List item, then there are multiple instances of "Pension".

b) To use StringReplacer to replace the first instance of Pension by simply replacing the part of the string that evaluates to "{CharactersBefore}Pensions" with just "{CharactersBefore}". The RegEx used in the sample below evaluates to Eg ".{53}Pensions" in the case of "2006SourcesOfPersonalIncomeTotalResponses06OtherSuperOrAnnuitiesOtherThanNZSuperVeteransOrWarPensionsCURP15YrsAndOve " which is: Find exactly where there are 53 characters before, and including the word "Pension", and replace with those 53 preceding characters without the word "Pensions" in the replacement string, effectively removing the first instance of "Pensions"

...and if anyone is wondering why I used AttributeFilter instead of Tester.....it is much, much faster due to AttributeFilter being Bulk Mode enabled.

Upvote

+51

ebygomm
Evangelist
Forum|Forum|6 years ago
April 9, 2020

You can use set up the String Replacer as follows

This matches two groups, the first group contains the word Pension and all characters after up until the next occurrence of Pension. The second gropu matches Pension. The \\1 replaces the match with the first group

Upvote

helmoet
Forum|Forum|6 years ago
April 9, 2020

removeredundantwordsfromstring.fmwt

Hi, I tried to catch your string in a regular expression and ended up with the following:

[A-Z]+[a-z]+[0-9]*|[0-9]+|[A-Z]+

It should be used in StringSearcher in a case sensitive way, and post processed using a combination of ListDuplicateRemover and ListConcatenator (see workspace attached).

One has to ask herselves what defines a repeated word? My interpretation:

It starts with a capital
Or is just plain numbers
Or it only contains capitals

This is what the regular expression does in the case above, it looks (from the left to the right, RE are not abelian...)

It looks for one or more capitals in the range A-Z, followed by
one or more lowercase character in the range a-z, followed by
zero or more ciphers

OR, if that does not succeed,

It looks for one or more ciphers

OR, if that does not succeed,

It looks for one or more capitals

In the advanced section of the StringSearcher, provide a name for the list of all matches.

Duplicates in this list can be removed using a ListDuplicateRemover, specifying the list.match{} list

The original string can be reconstructed by using a ListConcatenator, concatenating the list.match{} list. Do not forget to remove the default comma separator to distinguish between the concatenated string.

Hope this helps.

Upvote

+16

arnold_bijlsma
Enthusiast
Forum|Forum|6 years ago
April 9, 2020

Do you know what the duplicate word in each of your strings is going to be? Your example string shows multiple duplicate words, including 'Or' and 'Other', so you can't just use a regex that 'simply' filters out any duplicate word.

Upvote

Community Stats

Sign up

An FME Account is required to contribute

Login to the community

An FME Account is required to contribute