Skip to main content
Question

regular expression problem


Hi all

 

 

I am cleaning up a huge adress dataset and I am now stuck so I hope someone here has an idea.

 

 

I have a large amount of adresses in one attribute column that are all composed of the following:

 

 

"road name" "road number" "junk i need to remove"

 

 

example:

 

 

fictive street 123 ,23-65

 

 

My problem is that I want to remove everything after the road number (ie: ,23-65)

 

 

the road names can contain x numbers of characters with a random number of white spaces. I'm guessing I need to use regular expressions but I can't figure out how to select and remove all the junk text. the junk always comes after the road number and a white space, tje junk can be a random number of characters long.

 

 

Any ideas?

 

 

7 replies

pratap
Contributor
Forum|alt.badge.img+11
  • Contributor
  • July 27, 2015
Have you tried "AttributeSplitter" with space and StringConcatenator later?

takashi
Contributor
Forum|alt.badge.img+19
  • Contributor
  • July 27, 2015
Hi,

 

 

?> the junk always comes after the road number and a white space

 

 

I would try using the StringReplacer with this setting.

 

Text to Match: ^(.*\\d)\\s.*$

 

Replacement Text: \\1

 

Use Regular Expressions: yes

 

 

Takashi

david_r
Evangelist
  • July 27, 2015
Hi

 

 

For some reason the regular expressions behave a bit unexpectedly (to me) in the StringSearcher et al.

 

 

But try the RegularExpressionMatcher (that is Python-based) from the FME Store and set it up as follows:

 

 

 

 

It will search for all the characters up until the first whitespace after the first group of numbers from the beginning of the line.

 

 

For a roadname attribute that contains "fictive street 123 ,23-65" it will return the list attribute "REM_matched_parts{0}" with the value "fictive street 123"

 

 

David

ebygomm
Influencer
Forum|alt.badge.img+31
  • Influencer
  • July 27, 2015
A string searcher with the following regular expression should return everything you want without the 'junk' in the matched result attribute

 

 

^\\D+[0-9]+

 

 

ebygomm
Influencer
Forum|alt.badge.img+31
  • Influencer
  • July 27, 2015
Just a thought, do you need to allow for suffixes, eg. fictive street 22a, 6340

ebygomm
Influencer
Forum|alt.badge.img+31
  • Influencer
  • July 27, 2015
The following should allow for that evenutality (assuming single letter suffixes)

 

 

^\\D+[0-9]+[a-z]?

Thank you all for your help, I was able to use your examples to clean up about 14500 adresses out 14800, the rest are so screwed up that they will have to be cleaned more or less manually but regular expressions sure are powerful :)

Reply


Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

 
Cookie settings