Skip to main content
Hi all

 

 

I am cleaning up a huge adress dataset and I am now stuck so I hope someone here has an idea.

 

 

I have a large amount of adresses in one attribute column that are all composed of the following:

 

 

"road name" "road number" "junk i need to remove"

 

 

example:

 

 

fictive street 123 ,23-65

 

 

My problem is that I want to remove everything after the road number (ie: ,23-65)

 

 

the road names can contain x numbers of characters with a random number of white spaces. I'm guessing I need to use regular expressions but I can't figure out how to select and remove all the junk text. the junk always comes after the road number and a white space, tje junk can be a random number of characters long.

 

 

Any ideas?

 

 
Have you tried "AttributeSplitter" with space and StringConcatenator later?
Hi,

 

 

?> the junk always comes after the road number and a white space

 

 

I would try using the StringReplacer with this setting.

 

Text to Match: ^(.*\\d)\\s.*$

 

Replacement Text: \\1

 

Use Regular Expressions: yes

 

 

Takashi
Hi

 

 

For some reason the regular expressions behave a bit unexpectedly (to me) in the StringSearcher et al.

 

 

But try the RegularExpressionMatcher (that is Python-based) from the FME Store and set it up as follows:

 

 

 

 

It will search for all the characters up until the first whitespace after the first group of numbers from the beginning of the line.

 

 

For a roadname attribute that contains "fictive street 123 ,23-65" it will return the list attribute "REM_matched_parts{0}" with the value "fictive street 123"

 

 

David
A string searcher with the following regular expression should return everything you want without the 'junk' in the matched result attribute

 

 

^\\D+D0-9]+

 

 
Just a thought, do you need to allow for suffixes, eg. fictive street 22a, 6340
The following should allow for that evenutality (assuming single letter suffixes)

 

 

^\\D+D0-9]+]a-z]?
Thank you all for your help, I was able to use your examples to clean up about 14500 adresses out 14800, the rest are so screwed up that they will have to be cleaned more or less manually but regular expressions sure are powerful 🙂

Reply