Question

Finding the Middle Spaces in the Field

10 years ago
12 February 2014
22 replies
1 view

parashari
132 replies

I am working on an XP machine with FME'13 and I have a Address field and I have to find the spaces(middle spaces and not the left and right ) in each and every row.

St. Petters Road is the valid space(it need not be flagged); but

St. Pe tters Road is invalid,

and I need to flag the invalid attributes in a new column as 0 and 1

Can we do that with transformers ?

22 replies

Userlevel 2

+17

takashi
Contributor
7538 replies
10 years ago
12 February 2014

If valid format can be defined strictly as "3 words which are separated by white space", these two ways are possible.

1. AttributeSplitter + ListElementCounter + Tester

1) Split the string at white space (AttributeSplitter).

2) Count output list elements (ListElementCounter).

3) Determine if number of elements is 3 (Tester).

2. StringSearcher

This regular expression matches with the valid format.

-----

^([^\\s]+)\\s([^\\s]+)\\s([^\\s]+)$

-----

Features having valid address will go to MATCHED port.

There should be several other approaches.

Hi Takashi,

Thanks for the reply.

No, there is no certain limit of words for the address. It can be a signle word or as long as 10 words.

Userlevel 2

+17

takashi
Contributor
7538 replies
10 years ago
12 February 2014

If you have to inspect them by viewing with your eyes, how do you determine whether an address is valid or not?

If the way to determine can be clarified, it's possible to perform it with machine. Otherwise, it would be difficult.

Userlevel 2

+17

takashi
Contributor
7538 replies
10 years ago
12 February 2014

Considerations:

A valid word always begins with upper case and also an invalid word always begins with lower case?

Do you have a list of valid words?

and so on.

Userlevel 4

+13

fmelizard
Contributor
3701 replies
10 years ago
12 February 2014

or just evaluate each space......not something I would try to automate unless extra information is available.

The only clue that I have is, the space is valid in the middle, only if the upppercase letter is next to it, otherwise invalid

Ex-

St. Petters Road (The first letter word should be in Caps)

St. Petters road (This will also be valid in that condition, not a problem)

St. Pet ters Road (This will be invalid because the first letter is not in Caps, i.e Pet ters)

Userlevel 2

+17

takashi
Contributor
7538 replies
10 years ago
12 February 2014

This workflow sends features having Address which contains a word beginning with non-uppercase-alphabet to "Invalid" inspector.

Hi,

this regexp grabs spaces in front of lowercase letters.

(\\s)+[a-z]+

if u use tcl in a tester the all switch will find em all:

@Evaluate( [regexp -all {(\\s)+[a-z]+} "@Valeu(streetname)]) != 0

There is no way to equalize St. Petters road to St. Petters Road. Unless you have both, like if u had a database with correctly spelled streetnames. But then you would'nt need to find the lowers no more i guess.

Software can't decide wether Pet ters is a valid name or not. Only humans can, if they have a full database at their disposal (or a good and maybe large memory)

Me, i only know a St. Peters road. So to me it would flag invalid.

According to Stephen Donaldson in Cryptonomicron, there are some weird names on this planet...Qghlmn and such...lol.

Userlevel 2

+17

takashi
Contributor
7538 replies
10 years ago
13 February 2014

Gio is right. Computer can determine easily whether a word begins with upper case character or not, but cannot decide if "road" is valid unless you have a list of all the valid words (and maybe syntax analysis logic).

The workflow I posted was too cluttered one. The workflow can be replaced with only one Tester with this setting.

Left Value | Operator | Right Value | Negate

Address | Matches Regex | .+\\s[^A-Z].* | (uncheck)

Plan B: Use your eyes and hands.

If you can expect that the number of invalid records are not so many, I think partial manual operation could be a quicker solution. For example:

Step 1: Divide the records into valid part and (candidate) invalid part with the Tester.

Step 2: Manually extract valid records from the result which were determined as (candidate) invalid in Step 1.

Step 3: Merge the valid part (Step 1) and the valid records (Step 2) to create final result.

Ok, I understand that it is very difficult to do so and I hope it is immposible to do without a database, but, using FME, can we find the Uppercase letters in the middle and check for the spaces before every uppercase letter. If there is a space, flag 0, otherwise 1. Can we do that. ??

Userlevel 2

+17

takashi
Contributor
7538 replies
10 years ago
13 February 2014

This regular expression matches with strings which contain "at least one space preceding a non-uppercase-alphabet".

-----

.*\\s[^A-Z].*

-----

If it satisfies your requirement, can be used to filter features (Tester) or to do Conditional Value Setting (AttributeCreator) etc.

Hello again,

I don't bother if the word is correct or not. What all I want is, if there are three words, there should be 2 spaces, if there are 5 words, then 4 spaces-

Ex-

St. Petters Road - It should contain only two spaces (Assuming there are no leading and trailing spaces).

The road name can also be wrong (which can be seen in address component Normalization).

Here, the road name can also be wrong like- St. Petters Ro ad. (Assuming that RO AD are two words, and they are correct).

Simply, there should be (N-1) spaces than the roads (without including lead and trail spaces). Is there any possibility to do that. ??

Userlevel 2

+17

takashi
Contributor
7538 replies
10 years ago
18 February 2014

Hi again,

I'm not sure what is your exact requirement.

You mean these are correct? (there is only one space between every two words)

-----

St.<space>Petters<space>Road

St.<space>Petters<space>Ro<space>ad

-----

and these are wrong? (there are two or more spaces between two words)

-----

St.<space>Petters<space><space>Road

St.<space><space>Petters<space>Road

-----

What about these?

-----

St.<space>Pet<space>ters<space>Road <-- correct?

St.<space>Pet<space><space>ters<space>Road <-- wrong?

-----

Hi Takashi,

Yes, You get it right-

St.<space>Pet<space>ters<space>Road <-- correct?

St.<space>Pet<space><space>ters<space>Road <-- wrong?

If there is more than one space between two words, its incorrect. Only one space is required between the two words.

Userlevel 2

+17

takashi
Contributor
7538 replies
10 years ago
18 February 2014

O.K. This regular expression matches with a string which contains "more than one space between two words".

-----

.*[^\\s]\\s{2,}[^\\s].*

-----

You can use the expression with "Matches Regex" operator as a test condition in the Tester or the AttributeCreator (conditional value setting).

Thanks for the Reply once again !!! (:

But I didnot get this ---- .*[^\\s]\\s{2,}[^\\s].*

Can you please explain this. Thank you !!

Userlevel 2

+17

takashi
Contributor
7538 replies
10 years ago
18 February 2014

. represents any character.

* means 0 or more.

[ ] defines character class.

^ means "non" in a character class definition

\\s represents a white space.

{N,} means N or more than N.

So,

.* --> zero or more any character

[^\\s] --> any character except white space

\\s{2,} --> two or more white spaces

See here to learn more about regular expressions.

http://www.regular-expressions.info/

Thank you Takashi for the reply,

May be we can also do this task using Tester-

Address (Field Name) Like %__%

where,

__(underscores) are spaces

Userlevel 2

+17

takashi
Contributor
7538 replies
10 years ago
28 February 2014

Your solution is nearly equal to mine. Just be aware that these addresses will match with "Like %__%". i.e. cases where there is more than one space before or after the address.

-----

<space><space>St.<space>Petters<space>Road

St.<space>Petters<space>Road<space><space>

-----

Exactly. Very True

I am trying to find a solution for that, not yet succeeded !!! ):

Userlevel 2

+17

takashi
Contributor
7538 replies
10 years ago
28 February 2014

I think the regular expression I suggested before is an appropriate solution. Are there problems about using the regex?

Userlevel 2

+17

takashi
Contributor
7538 replies
10 years ago
28 February 2014

Plan B :)

If you create trimmed address beforehand, can get the same result as the regular expression without changing the original address.

1) AttributeCreator

_trimmed = Address

2) AttributeTrimmer

Attributes to Trim: _trimmed

Trim Type: Both

3) Tester

_trimmed Like %__%

Reply

Community Stats

Sign up

An FME Account is required to contribute

Login to the community

An FME Account is required to contribute

Scanning file for viruses.

This file cannot be downloaded