Question

Splitting address strings

11 years ago
January 23, 2014
8 replies
248 views

si
21 replies

Hi,

Does anyone know the most efficent way on how to split a full address string into 4 fields. Must take into consideration different address string formats.

Split into:

A) House number (could be house name)

B) Street

C) Town

D) Postcode

Test examples of full address strings are:

139 Beech Road, LONDON, LW1 1AA

18 Sandy Lane, Oakland, NOTTINGHAM, NG1 1AA

8 London Road, Theakston, Brumington Stoke, LEEDS, LE11 1AA

Beech Tree Farm, Old Dove Lane, Millerton, Keyton, BRISTOL, BR12 1AA

david_r
8355 replies
11 years ago
January 23, 2014

Hi,

have a look at the responses here.

David

si
Author
21 replies
11 years ago
January 23, 2014

Thanks for the help.

So far AttributeCreator combined with String Functions, seems to be helping a bit. (TrimLeft / TrimRight etc)

But I am now stuck, doing conditions on first character (alpha or numeric)

+20

philippeb
Enthusiast
308 replies
11 years ago
January 24, 2014

I agree with you that I would like to know how to difference characters and numbers.

FME Lover

si
Author
21 replies
11 years ago
January 24, 2014

Might need a bit of tweaking but I've pretty much done it with AttributeSplitter, AttributeCreators and AttributeClassifier.

To differentiate betweem numbers and characters, I created a new attribute using the LEFT function on AttributeCreator to get the first character and then I used AttributeClassifier on this attribute to do a digit test.

+15

gio
Contributor
2252 replies
11 years ago
January 24, 2014

^(\\d*|[\\da-zA-Z\\s]*,)\\s([a-zA-Z\\s]*|[a-zA-Z\\s]*,\\s[a-zA-Z\\s]*),\\s([A-Z\\s]*),\\s([A-Z0-9\\s]*)$

takes care of the first three examples.

U then need to expose matched parts 0-4 (5 reports or "captures") if u use the searcher transformer.

If u use creator u can do

[regepx -inline {^(\\d*|[\\da-zA-Z\\s]*,)\\s([a-zA-Z\\s]*|[a-zA-Z\\s]*,\\s[a-zA-Z\\s]*),\\s([A-Z\\s]*),\\s([A-Z0-9\\s]*)$ yourvariable} ..etc. to capture the parts.

Then u could use listindexer to grab every captured part (indexes 0,2 etc.)

Or use tcl caller and create FME_attributes form captures.

You can make similar expressions for

for the first 3 examples

The fourth example is all letters and a zip. U can get the parts from the expression i made.

\\d = digits \\w wordcharacters etc. \\D non digit, \\W non wordcharacters etc.

there are lots of very good tcl sites around.

+39

ebygomm
Influencer
3313 replies
11 years ago
January 24, 2014

Assuming these are UK addresses

You need to make sure you can also deal with suffixes

2A Marhill Road, Carlton NOTTINGHAM NG4 3AH

Secondary Addressable Objects

2 Sandpiper House, Marhill Road, Carlton, NOTTINGHAM NG4 3AJ

Number ranges

12-14 Station Street

takashi
7715 replies
11 years ago
January 25, 2014

Hi,

I don't know exact rule of address representations, but there seems to be this rule as long as seeing your examples. 1) Address elements are separated by commas. 2) The first element is "house number (digits) <space> street name" or "house name" (starting with non-digit). 3) Only if the first element is "house name", the second element is "street name". 4) The last element is always "post code". 5) "town name" consists of other one or more element(s). If it's correct, these steps might help you. 1) Determine the first element is which of "hose number <space> street name" and "house name" with a Tester. address Matches Regex ^[0-9]+\\s.+$ 2) If the string starts with digits (i.e. house number), insert a comma between "house number" and "street name" using a StringReplacer. Otherwise do nothing. Text to Find: ^([0-9]+)(.+)$ Replacement Text: \\1,\\2 Use Regular Expressions: yes 3) Move the last element (i.e. post code) to head of the string using another StringReplacer. Text to Find: ^(.+),([^,]+)$ Replacement Text: \\2,\\1 Use Regular Expressions: yes 4) Split the srting into 4 elements with a StringSearcher. Regular Expression: ^([^,]+),([^,]+),([^,]+),(.+)$ Every output feature will have a list attribute (named "_matched_parts" by default) which contains these elements. _matched_parts{0} = post code _matched_parts{1} = house number or house name _matched_parts{2} = street name _matched_parts{3} = town name Then rename and trim them, if necessary.

Takashi

takashi
7715 replies
11 years ago
January 25, 2014

If you use this regular expression in the StringSearcher (4th step), the 2nd StringReplecer (3rd step) can be removed. ^([^,]+),([^,]+),(.+),([^,]+)$ In that case, the elements of _matched_parts list will be: _matched_parts{0} = house number or house name _matched_parts{1} = street name _matched_parts{2} = town name _matched_parts{3} = post code Anyway, I think the point is how to determine whether the first element is "house number + street name".

If the house number always consists of digits only (and also house name doesn't start with digit), it's easy.

However, if there are some exceptional conditions as EGomm mentioned, you will have to modify the first regular expression. "how to" depends on the actual data condition.

Reply

Rich Text Editor, editor1

Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

Cookie settings

We use 3 different kinds of cookies. You can choose which cookies you want to accept. We need basic cookies to make this site work, therefore these are the minimum you can select. Learn more about our cookies.

Basic
Functional

Normal
Functional + analytics

Complete
Functional + analytics + social media + embedded videos + marketing

Splitting address strings