Solved

Regex: How to remove spaces pattern in a w o r d

4 years ago
March 13, 2021
6 replies
2294 views

+10

samisnunu
Contributor
65 replies

For the following sample text strings extracted from AutoCad dwg file.

I tried to build a RegEx pattern to replace the extra spaces in each word to make it consistent and readable.

but it's like a mix of good and incorrectly spaced words.

Mostly, the pattern looks like : letter <space>letter<space>letter

R i v i e r e --->bad

St. Martin --->ok

Tatamaka River ---> bad

SIR MARTIN S. S. S --->ok

( J O H N K E N N E D Y L A N E ) --->bad

(A N D R E M A R T I N A V E N U E) --->bad

(CLAIRFONDS BRCH. RD No.3) --->ok

(EX M A R E G R A V I E R III S T R E E T ) --->bad

(SOLFERINO ROAD B77) --->ok

A.L.M. GELLE STREET --->ok

Best answer by fhilding

I think it's probable that you won't find one regex to solve all of this - since sometimes it's a bit tough to know what's right and not. For example, what's wrong with Tatamaka River and why is SIR MARTIN S. S. S. right? Even the human-me can't tell!

However, a regexp with "text to replace":

\s(\w)\s(\w)

And "replacement text":

\1\2

Does the trick... okay? It looks for word-characters surrounded on both sides by spaces, and keeps only the word characters. However, it struggles with EX M A R E G R A V I E R III S T R E E T - since it doesn't really know that the space before and after III should still be there. I think there could be a good solution to do this in two steps: First find the spaces we want, with this regexp:

(\w{2,})\s

and replace those with some other character. Perhaps we need one the other way around too - that looks for a space \s after more than one letter \w{2,}.

So let's say you have three StringReplacers: One to replace good trailing spaces with #, one to replace good leading spaces with #, then one "big one" to remove the incorrect spaces.

View original

Did this help you find an answer to your question?

fhilding
56 replies
Best Answer
4 years ago
March 15, 2021

However, a regexp with "text to replace":

\s(\w)\s(\w)

And "replacement text":

\1\2

(\w{2,})\s

and replace those with some other character. Perhaps we need one the other way around too - that looks for a space \s after more than one letter \w{2,}.

So let's say you have three StringReplacers: One to replace good trailing spaces with #, one to replace good leading spaces with #, then one "big one" to remove the incorrect spaces.

fhilding
56 replies
4 years ago
March 15, 2021

Forgot to attach a workspace, which might help as a starting point.

+10

samisnunu
Author
Contributor
65 replies
4 years ago
March 15, 2021

Many Thanks @fhilding for taking the time to look into these challenging Regex scenarios.

I guess the \\s(\\w)\\s(\\w) token solved many of the issues.

however, it seems then it's better to export the rest unsolved to Excel spreadsheet and clean them individually as they have inconsistent spaces pattern.

fhilding
56 replies
4 years ago
March 15, 2021

samisnunu wrote:

Many Thanks @fhilding for taking the time to look into these challenging Regex scenarios.

I guess the \\s(\\w)\\s(\\w) token solved many of the issues.

however, it seems then it's better to export the rest unsolved to Excel spreadsheet and clean them individually as they have inconsistent spaces pattern.

I think that's probably the best way, yep! But if FME can do some of the work, that's not bad at all! :)

+46

mark2atsafe
Safer
2518 replies
4 years ago
March 15, 2021

One other thing you could do is make it case-sensitive. So if you had this:

H igh Stree t

...we know that the space character should exist before Street because the S is upper-case. I can sort that out with this piece of Regex (goes into a StringReplacer transformer):

\s+(?=[a-z])

Just make sure that the Case Sensitive parameter is set correctly. But still, that only works when the string is in Title Case, not UPPER CASE.

But you could go one step further and try to put spaces back where we think they should be. For example, we could follow up with a second StringReplacer that replaces "LANE" with "<space>LANE", like so:

(?=(STREET)|(LANE)|(AVENUE))

Actually, that will look for STREET OR LANE OR AVENUE. The "Question Mark-Equals" part means search for this, but don't include it in the result. So in effect, the replacement part (just a space character) gets inserted.

+10

samisnunu
Author
Contributor
65 replies
4 years ago
March 16, 2021

Mant Thanks @mark2atsafe , I tried the tokens you suggested, and also improvised the second one to be a split word like (?=(S T R E E T)|(L A N E)|(A V E N U E)), and that was very helpful to further reduce the anomalies:

1 Attachments

Reply

Rich Text Editor, editor1

Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

Cookie settings

We use 3 different kinds of cookies. You can choose which cookies you want to accept. We need basic cookies to make this site work, therefore these are the minimum you can select. Learn more about our cookies.

Basic
Functional

Normal
Functional + analytics

Complete
Functional + analytics + social media + embedded videos + marketing

Regex: How to remove spaces pattern in a w o r d

6 replies

1 Attachments

Reply

Helpful Members This Week

Recently Solved Questions

Using one AttributeRounder for different accuracies

Create date segments of two table with overlap of times

Automate Fanout of columns/splitting attributes to different output by attribute name

Tracing Multiple Networks from Sources to Valves Without Python

FME Flow version control how to use different branch

Community Stats

Latest FME

Cookie policy

Cookie settings

1 Attachments

Reply

Related Topics

How to use Environment Variables in User Parameters ?icon

same parameter name, different meaning for desktop and servericon

How to set user parameter value in workflow?icon

How do I run a script using a bat file and let User to input information before run the scripticon

How to fill Published Parameter with Environment Variable? Startup script or something else?icon

Helpful Members This Week

Recently Solved Questions

Using one AttributeRounder for different accuracies

Create date segments of two table with overlap of times

Automate Fanout of columns/splitting attributes to different output by attribute name

Tracing Multiple Networks from Sources to Valves Without Python

FME Flow version control how to use different branch

Popular Tags

Community Stats

Latest FME

Sign up

An FME Account is required to contribute

Login to the community

An FME Account is required to contribute

Scanning file for viruses.

This file cannot be downloaded

Cookie policy

Cookie settings