Question

Splitting address strings

  • 23 January 2014
  • 8 replies
  • 11 views

Badge
Hi,

 

 

Does anyone know the most efficent way on how to split a full address string into 4 fields. Must take into consideration different address string formats.

 

 

Split into:

 

 

A) House number (could be house name)

 

B) Street

 

C) Town

 

D) Postcode

 

 

Test examples of full address strings are: 

 

139 Beech Road, LONDON, LW1 1AA

 

18 Sandy Lane, Oakland, NOTTINGHAM, NG1 1AA

 

8 London Road, Theakston, Brumington Stoke, LEEDS, LE11 1AA

 

Beech Tree Farm, Old Dove Lane, Millerton, Keyton, BRISTOL, BR12 1AA

 


8 replies

Userlevel 4
Hi,

 

 

have a look at the responses here.

 

 

David
Badge
Thanks for the help.

 

 

So far AttributeCreator combined with String Functions, seems to be helping a bit. (TrimLeft / TrimRight etc)

 

 

But I am now stuck, doing conditions on first character (alpha or numeric)
Badge +2
I agree with you that I would like to know how to difference characters and numbers.
Badge
Might need a bit of tweaking but I've pretty much done it with AttributeSplitter, AttributeCreators and AttributeClassifier.

 

 

To differentiate betweem numbers and characters, I created a new attribute using the LEFT function on AttributeCreator to get the first character and then I used AttributeClassifier on this attribute to do a digit test.
Badge +3
^(\\d*|[\\da-zA-Z\\s]*,)\\s([a-zA-Z\\s]*|[a-zA-Z\\s]*,\\s[a-zA-Z\\s]*),\\s([A-Z\\s]*),\\s([A-Z0-9\\s]*)$

 

takes care of the first three examples.

 

U then need to expose matched parts 0-4 (5 reports or "captures") if u use the searcher transformer.

 

 

If u use creator u can do

 

[regepx -inline {^(\\d*|[\\da-zA-Z\\s]*,)\\s([a-zA-Z\\s]*|[a-zA-Z\\s]*,\\s[a-zA-Z\\s]*),\\s([A-Z\\s]*),\\s([A-Z0-9\\s]*)$ yourvariable} ..etc. to capture the parts.

 

Then u could use listindexer to grab every captured part (indexes 0,2 etc.)

 

 

Or use tcl caller and create FME_attributes form captures.

 

 

 

You can make similar expressions for

 

 

for the first 3 examples

 

 

The fourth example is all letters and a zip. U can get the parts from the expression i made.

 

 

\\d = digits \\w wordcharacters etc. \\D non digit, \\W non wordcharacters etc.

 

there are lots of very good tcl sites around.
Userlevel 1
Badge +10
Assuming these are UK addresses

 

 

You need to make sure you can also deal with suffixes

 

 

2A Marhill Road, Carlton NOTTINGHAM NG4 3AH

 

 

Secondary Addressable Objects

 

 

2 Sandpiper House, Marhill Road, Carlton, NOTTINGHAM NG4 3AJ

 

 

Number ranges

 

 

12-14 Station Street

 

 

 

 

Userlevel 2
Badge +17
Hi,

 

 

I don't know exact rule of address representations, but there seems to be this rule as long as seeing your examples. 1) Address elements are separated by commas. 2) The first element is "house number (digits) <space> street name" or "house name" (starting with non-digit). 3) Only if the first element is "house name", the second element is "street name". 4) The last element is always "post code". 5) "town name" consists of other one or more element(s).   If it's correct, these steps might help you. 1) Determine the first element is which of "hose number <space> street name" and "house name" with a Tester. address  Matches Regex  ^[0-9]+\\s.+$   2) If the string starts with digits (i.e. house number), insert a comma between "house number" and "street name" using a StringReplacer. Otherwise do nothing. Text to Find: ^([0-9]+)(.+)$ Replacement Text: \\1,\\2 Use Regular Expressions: yes   3) Move the last element (i.e. post code) to head of the string using another StringReplacer. Text to Find: ^(.+),([^,]+)$ Replacement Text: \\2,\\1 Use Regular Expressions: yes   4) Split the srting into 4 elements with a StringSearcher. Regular Expression: ^([^,]+),([^,]+),([^,]+),(.+)$   Every output feature will have a list attribute (named "_matched_parts" by default) which contains these elements. _matched_parts{0} = post code _matched_parts{1} = house number or house name _matched_parts{2} = street name _matched_parts{3} = town name   Then rename and trim them, if necessary.

 

Takashi
Userlevel 2
Badge +17
If you use this regular expression in the StringSearcher (4th step), the 2nd StringReplecer (3rd step) can be removed. ^([^,]+),([^,]+),(.+),([^,]+)$   In that case, the elements of  _matched_parts list will be: _matched_parts{0} = house number or house name _matched_parts{1} = street name _matched_parts{2} = town name _matched_parts{3} = post code   Anyway, I think the point is how to determine whether the first element is "house number + street name".

 

If the house number always consists of digits only (and also house name doesn't start with digit), it's easy.

 

However, if there are some exceptional conditions as EGomm mentioned, you will have to modify the first regular expression. "how to" depends on the actual data condition.

Reply