Hi left65,
You can concatenate the various attributes, using an AttributeCreator (but maybe there are other solutions), in an new attribute containing the full address string.
Then I think you will have to deal with regular expressions and construct one or more of them based on rules derived from the data themself (this may be quite difficult, depending on the string composing the address). With these regular expressions you can setup one ore more chain of:
- StringReplacer (in mode 'Replace Regular Expression'), to split the string in "tokens" using a charachter of your choice (say '|', for example) ;
- AttributeSplitter to build a list of the "tokens";
- AttributeCreator to create new attributes from the list elements.
This online regex tester and debugger can help you to develop the regular expressions.
Hope that helps!
Hi,
You could enter a dummy-value for all the empty spaces using a stringreplacer and/or testing for empty strings and further handle the attributes in a list . That way you could sort through them and parse them differently depending if you want to split or trim.
@left65 It's fairly hard to parse and standardize addresses, as you've already found with your "very convoluted sql script". The equivalent in FME would be to either use that same script in the InlineQuerier or do the equivalent in regular expressions (StringSearcher).
You might try to use a geocoder. These sometimes return the standardized address, and separate address fields, along with the address location. You can also get back the JSON response that might be easier to parse.
thanks for the responses. i shall let you know how i go
Having done Address parsing a few times, all I will say is: RegEx is your friend! Addresses have so many different "rules" in the way they are entered, then RegEx is a way of distilling each of the different string patterns that can exist.
I've typically standardised Addresses with StringSearcher and StringReplacer using RegEx, which was far less tedious than CASE or LIKE statements in SQL.
Having done Address parsing a few times, all I will say is: RegEx is your friend! Addresses have so many different "rules" in the way they are entered, then RegEx is a way of distilling each of the different string patterns that can exist.
I've typically standardised Addresses with StringSearcher and StringReplacer using RegEx, which was far less tedious than CASE or LIKE statements in SQL.
Hi there bwn,
I have the same question as left65. Do you have an example of using RegEx to do this?
Hi there bwn,
I have the same question as left65. Do you have an example of using RegEx to do this?
The best tool to use to learn/practice is an online RegEx tool like (but there are others just as good)
https://regex101.com/
So take something simple like 154 Green Road
The string pattern here is Road Number{1 or more digits}-{space character}-Road Name{1 or more characters}-{space character}-Road Type{1 or more characters}
In RegEx, the pattern is therefore
\d+\s.+\s.+
\d is a single number character
+is 1 or more of the preceding type of pattern Eg. \d+ is 1 or more number characters
\s is a whitesapce
. is any character
.+ is 1 or more of any character
So now though we might want to extract the character "Groups". We can group things using brackets (). So let's group the Road Number, Road Name and Road Type
(\d+)\s\(.+)\s(.+)
In the order that the brackets appear Road Number is the first group, Road Name is the second group and Road Type is the third group.
In something like a StringReplacer or AttributeCreator, we can call back the value of that substring "group" by setting an Attribute to a value of "\1" for the first group of characters, "\2" for the second group of characters etc. In a RegEx enabled FME Transformer, it will extract each character group and await the user to call it back using Eg. "\1"
So take something complex like
54/154A Green Yellow Road E
This is now where it gets somewhat trial and error to come up with a general purpose RegEx pattern
So this is where there is optionally a Unit Number, a multiple word Road Name and an optional East, South, West, North Suffix to the Road Type
I usually like to break these more complex strings down piece by piece. So first lets separate the Road Number from the rest of the string
^(?:\d*\d\/){0,1}(\d+\w?)\s+(.*)$
The first bracketed expression: (?:\d*\d\/){0,1}
We are looking for 0 to 1 instances ie. "({0,1}" of Zero of More Digit Characters "\d*", followed by 1 Digit Characters "\d", followed by a / separator character "\/"
So this filters out any Unit numbers and its separator character "/" from the start.
The next group (\d+\w?) we are looking for a pattern of 1 or more Digits, followed by zero or 1 Word Characters. So Eg. "15" matches this pattern of having 1 or more digits, followed by 0 or 1 word characters, as does Eg. "154A". This is Group 1 and we can set this in StringReplacer or the same function in AttributeCreator as "\1" to return the Road Number
In this is example I put the residual text of into Group 2 so that we can split this off for the next RegEx to separate out the Road Type and Road Suffix, now that we have successfully extracted the Road Number component from the string.