Skip to main content

For the following sample text strings extracted from AutoCad dwg file.

I tried to build a RegEx pattern to replace the extra spaces in each word to make it consistent and readable.

but it's like a mix of good and incorrectly spaced words.

Mostly, the pattern looks like : letter <space>letter<space>letter

 

R i v i e r e --->bad

St. Martin --->ok

Tatamaka River ---> bad

SIR MARTIN S. S. S --->ok

( J O H N K E N N E D Y L A N E ) --->bad

(A N D R E M A R T I N A V E N U E) --->bad

(CLAIRFONDS BRCH. RD No.3) --->ok

(EX M A R E G R A V I E R III S T R E E T ) --->bad

(SOLFERINO ROAD B77) --->ok

A.L.M. GELLE STREET --->ok

I think it's probable that you won't find one regex to solve all of this - since sometimes it's a bit tough to know what's right and not. For example, what's wrong with Tatamaka River and why is SIR MARTIN S. S. S. right? Even the human-me can't tell!

 

However, a regexp with "text to replace": 

\s(\w)\s(\w)

And "replacement text": 

\1\2

Does the trick... okay? It looks for word-characters surrounded on both sides by spaces, and keeps only the word characters. However, it struggles with EX M A R E G R A V I E R III S T R E E T - since it doesn't really know that the space before and after III should still be there. I think there could be a good solution to do this in two steps: First find the spaces we want, with this regexp: 

(\w{2,})\s

and replace those with some other character. Perhaps we need one the other way around too - that looks for a space \s after more than one letter \w{2,}. 

 

So let's say you have three StringReplacers: One to replace good trailing spaces with #, one to replace good leading spaces with #, then one "big one" to remove the incorrect spaces.


Forgot to attach a workspace, which might help as a starting point.


Many Thanks @fhilding​ for taking the time to look into these challenging Regex scenarios.

I guess the \\s(\\w)\\s(\\w) token solved many of the issues.

however, it seems then it's better to export the rest unsolved to Excel spreadsheet and clean them individually as they have inconsistent spaces pattern.

 


Many Thanks @fhilding​ for taking the time to look into these challenging Regex scenarios.

I guess the \\s(\\w)\\s(\\w) token solved many of the issues.

however, it seems then it's better to export the rest unsolved to Excel spreadsheet and clean them individually as they have inconsistent spaces pattern.

 

I think that's probably the best way, yep! But if FME can do some of the work, that's not bad at all! :)


One other thing you could do is make it case-sensitive. So if you had this:

H igh Stree t

...we know that the space character should exist before Street because the S is upper-case. I can sort that out with this piece of Regex (goes into a StringReplacer transformer):

\s+(?=na-z])

Just make sure that the Case Sensitive parameter is set correctly. But still, that only works when the string is in Title Case, not UPPER CASE.

 

But you could go one step further and try to put spaces back where we think they should be. For example, we could follow up with a second StringReplacer that replaces "LANE" with "<space>LANE", like so:

(?=(STREET)|(LANE)|(AVENUE))

Actually, that will look for STREET OR LANE OR AVENUE. The "Question Mark-Equals" part means search for this, but don't include it in the result. So in effect, the replacement part (just a space character) gets inserted.

 


Mant Thanks @mark2atsafe​ , I tried the tokens you suggested, and also improvised the second one to be a split word like (?=(S T R E E T)|(L A N E)|(A V E N U E)), and that was very helpful to further reduce the anomalies:

 


Reply