Question

Finding the Middle Spaces in the Field


Badge +1
I am working on an XP machine with FME'13 and I have a Address field and I have to find the spaces(middle spaces and not the left and right ) in each and every row.

 

St. Petters Road is the valid space(it need not be flagged); but

 

St. Pe tters Road is invalid,

 

and I need to flag the invalid attributes in a new column as 0 and 1

 

Can we do that with transformers ?

22 replies

Userlevel 2
Badge +17
If valid format can be defined strictly as "3 words which are separated by white space", these two ways are possible.

 

 

1. AttributeSplitter + ListElementCounter + Tester

 

1) Split the string at white space (AttributeSplitter).

 

2) Count output list elements (ListElementCounter).

 

3) Determine if number of elements is 3 (Tester).

 

 

2. StringSearcher

 

This regular expression matches with the valid format.

 

-----

 

^([^\\s]+)\\s([^\\s]+)\\s([^\\s]+)$

 

-----

 

Features having valid address will go to MATCHED port.

 

 

There should be several other approaches.
Badge +1
Hi Takashi,

 

 

Thanks for the reply.

 

No, there is no certain limit of words for the address. It can be a signle word or as long as 10 words.
Userlevel 2
Badge +17
If you have to inspect them by viewing with your eyes, how do you determine whether an address is valid or not?

 

If the way to determine can be clarified, it's possible to perform it with machine. Otherwise, it would be difficult.
Userlevel 2
Badge +17
Considerations:

 

A valid word always begins with upper case and also an invalid word always begins with lower case?

 

Do you have a list of valid words?

 

and so on.
Userlevel 4
Badge +13
or just evaluate each space......not something I would try to automate unless extra information is available.
Badge +1
The only clue that I have is, the space is valid in the middle, only if the upppercase letter is next to it, otherwise invalid

 

Ex-

 

St. Petters Road (The first letter word should be in Caps)

 

St. Petters road (This will also be valid in that condition, not a problem)

 

St. Pet ters Road (This will be invalid because the first letter is not in Caps, i.e Pet ters)
Userlevel 2
Badge +17
This workflow sends features having Address which contains a word beginning with non-uppercase-alphabet to "Invalid" inspector.

 

 

Badge +3
Hi,

 

 

this regexp grabs  spaces in front of lowercase letters.

 

(\\s)+[a-z]+

 

if u use tcl in a tester the all switch will find em all:

 

@Evaluate( [regexp -all {(\\s)+[a-z]+} "@Valeu(streetname)]) != 0

 

 

 

There is no way to equalize St. Petters road to St. Petters Road. Unless you have both, like if u had a database with correctly spelled streetnames. But then you would'nt need to find the lowers no more i guess.

 

 

Software can't decide wether Pet ters is a valid name or not. Only humans can, if they have a full database at their disposal (or a good and maybe large memory)

 

Me, i only know a St. Peters road. So to me it would flag invalid.

 

 

 

According to Stephen Donaldson in Cryptonomicron, there are some weird names on this planet...Qghlmn and such...lol.
Userlevel 2
Badge +17
Gio is right. Computer can determine easily whether a word begins with upper case character or not, but cannot decide if "road" is valid unless you have a list of all the valid words (and maybe syntax analysis logic).

 

 

The workflow I posted was too cluttered one. The workflow can be replaced with only one Tester with this setting.

 

Left Value  |  Operator  |  Right Value  |  Negate

 

Address  |  Matches Regex  |  .+\\s[^A-Z].*  |  (uncheck)

 

 

Plan B: Use your eyes and hands.

 

If you can expect that the number of invalid records are not so many, I think partial manual operation could be a quicker solution. For example:

 

Step 1: Divide the records into valid part and (candidate) invalid part with the Tester.

 

Step 2: Manually extract valid records from the result which were determined as (candidate) invalid in Step 1.

 

Step 3: Merge the valid part (Step 1) and the valid records (Step 2) to create final result.
Badge +1
Ok, I understand that it is very difficult to do so and I hope it is immposible to do without a database, but, using FME, can we find the Uppercase letters in the middle  and check for the spaces before every uppercase letter. If there is a space, flag 0, otherwise 1. Can we do that. ??
Userlevel 2
Badge +17
This regular expression matches with strings which contain "at least one space preceding a non-uppercase-alphabet".

 

-----

 

.*\\s[^A-Z].*

 

-----

 

If it satisfies your requirement, can be used to filter features (Tester) or to do Conditional Value Setting (AttributeCreator) etc.
Badge +1
Hello again,

 

 I don't bother if the word is correct or not. What all I want is, if there are three words, there should be 2 spaces, if there are 5 words, then 4 spaces-

 

Ex-

 

 

St. Petters Road - It should contain only two spaces (Assuming there are no leading and trailing spaces).

 

 The road name can also be wrong (which can be seen in address component Normalization).

 

Here, the road name can also be wrong like- St. Petters Ro ad. (Assuming that RO AD are two words, and they are correct).

 

Simply, there should be (N-1) spaces than the roads (without including lead and trail spaces). Is there any possibility to do that. ?? 
Userlevel 2
Badge +17
Hi again,

 

 

I'm not sure what is your exact requirement.

 

You mean these are correct? (there is only one space between every two words)

 

-----

 

St.<space>Petters<space>Road

 

St.<space>Petters<space>Ro<space>ad

 

-----

 

 

and these are wrong? (there are two or more spaces between two words)

 

-----

 

St.<space>Petters<space><space>Road

 

St.<space><space>Petters<space>Road

 

-----

 

 

What about these?

 

-----

 

St.<space>Pet<space>ters<space>Road  <-- correct?

 

St.<space>Pet<space><space>ters<space>Road  <-- wrong?

 

-----
Badge +1
Hi Takashi,

 

 

Yes, You get it right- 

 

St.<space>Pet<space>ters<space>Road  <-- correct?

 

St.<space>Pet<space><space>ters<space>Road  <-- wrong?

 

 

If there is more than one space between two words, its incorrect. Only one space is required between the two words.
Userlevel 2
Badge +17
O.K. This regular expression matches with a string which contains "more than one space between two words".

 

-----

 

.*[^\\s]\\s{2,}[^\\s].*

 

-----

 

You can use the expression with "Matches Regex" operator as a test condition in the Tester or the AttributeCreator (conditional value setting).

 

Badge +1
Thanks for the Reply once again !!! (:

 

But I didnot get this ---- .*[^\\s]\\s{2,}[^\\s].*

 

 

Can you please explain this. Thank you !!
Userlevel 2
Badge +17
. represents any character.

 

* means 0 or more.

 

[ ] defines character class.

 

^ means "non" in a character class definition

 

\\s represents a white space.

 

{N,} means N or more than N.

 

 

So,

 

.*  --> zero or more any character

 

[^\\s]  --> any character except white space

 

\\s{2,}  --> two or more white spaces

 

 

See here to learn more about regular expressions.

 

http://www.regular-expressions.info/

 

Badge +1
Thank you Takashi for the reply,

 

 

May be we can also do this task using Tester-

 

 

Address (Field Name) Like  %__%

 

 

where,

 

 __(underscores) are spaces
Userlevel 2
Badge +17
Your solution is nearly equal to mine. Just be aware that these addresses will match with "Like  %__%". i.e. cases where there is more than one space before or after the address.

 

-----

 

<space><space>St.<space>Petters<space>Road

 

St.<space>Petters<space>Road<space><space>

 

-----
Badge +1
Exactly. Very True

 

 

I am trying to find a solution for that, not yet succeeded !!! ):
Userlevel 2
Badge +17
I think the regular expression I suggested before is an appropriate solution. Are there problems about using the regex?
Userlevel 2
Badge +17
Plan B :)

 

If you create trimmed address beforehand, can get the same result as the regular expression without changing the original address.

 

1) AttributeCreator

 

_trimmed = Address

 

2) AttributeTrimmer

 

Attributes to Trim: _trimmed

 

Trim Type: Both

 

3) Tester

 

_trimmed  Like  %__%

Reply