Skip to main content

Dear all,

I’m spending a lot of time trying to sort out an issue with a substring operation but I cannot see my results are not correct in several places.

In FME workbench 2024.0, I use the the AttributeManager to read a substring from the attribute _response_body and to write it to the attribute info. The response body is the result from a reverse geo-coding with OpenStreetMap and I want to extract the location for the given coordinates. To extract the relevant part, I use the following sequence of String operators:

1) Take everything left of </result>
2) From this, take everything right of address_rank
3) From this, take everything right of >

The actual (unfortunately complicated) condition statement which I’m using is:

@Right(@Right(@Left(@Value(_response_body),@FindString(@Value(_response_body),</result>)-1),@StringLength(@Left(@Value(_response_body),@FindString(@Value(_response_body),</result>)-1))-@FindString(@Value(_response_body),address_rank)),@StringLength(@Right(@Left(@Value(_response_body),@FindString(@Value(_response_body),</result>)-1),@StringLength(@Left(@Value(_response_body),@FindString(@Value(_response_body),</result>)-1))-@FindString(@Value(_response_body),address_rank)))-@FindString(@Right(@Left(@Value(_response_body),@FindString(@Value(_response_body),</result>)-1),@StringLength(@Left(@Value(_response_body),@FindString(@Value(_response_body),</result>)-1))-@FindString(@Value(_response_body),address_rank)+1),>))


This seems to work fine for a number of datasets but it gives me a few extra characters in some cases:


The syntax and structure of the source text seems to be consistent:

670:
<?xml version="1.0" encoding="UTF-8" ?>
<reversegeocode timestamp="Thu, 08 Aug 2024 09:13:46 +00:00" attribution="Data ©️ OpenStreetMap contributors, ODbL 1.0. http://osm.org/copyright" querystring="lat=58.152879104460816&lon=8.126058832869633&amp;format=xml&amp;zoom=12"><result place_id="172781434" osm_type="node" osm_id="31264100" ref="Kristiansand" lat="58.1461500" lon="7.9957333" boundingbox="57.9861500,58.3061500,7.8357333,8.1557333" place_rank="16" address_rank="16">Kristiansand, Agder, 4611, Norway</result><addressparts><city>Kristiansand</city><municipality>Kristiansand</municipality><county>Agder</county><ISO3166-2-lvl4>NO-42</ISO3166-2-lvl4><postcode>4611</postcode><country>Norway</country><country_code>no</country_code></addressparts></reversegeocode>

671:
<?xml version="1.0" encoding="UTF-8" ?>
<reversegeocode timestamp="Thu, 08 Aug 2024 09:13:46 +00:00" attribution="Data ©️ OpenStreetMap contributors, ODbL 1.0. http://osm.org/copyright" querystring="lat=58.72402439827633&amp;lon=9.257224322307422&amp;format=xml&amp;zoom=12"><result place_id="177641629" osm_type="node" osm_id="2380166198" ref="Risør" lat="58.7207522" lon="9.2343838" boundingbox="58.6807522,58.7607522,9.1943838,9.2743838" place_rank="18" address_rank="16">Risør, Agder, 4956, Norway</result><addressparts><town>Risør</town><municipality>Risør</municipality><county>Agder</county><ISO3166-2-lvl4>NO-42</ISO3166-2-lvl4><postcode>4956</postcode><country>Norway</country><country_code>no</country_code></addressparts></reversegeocode>


672:
<?xml version="1.0" encoding="UTF-8" ?>
<reversegeocode timestamp="Thu, 08 Aug 2024 09:13:47 +00:00" attribution="Data ©️ OpenStreetMap contributors, ODbL 1.0. http://osm.org/copyright" querystring="lat=60.51981765734847&amp;lon=4.907608675437386&amp;format=xml&amp;zoom=12"><result place_id="173934469" osm_type="relation" osm_id="10150629" ref="4626" lat="60.4267846" lon="4.7265289" boundingbox="60.1040492,60.7409047,4.2661374,5.2380227" place_rank="14" address_rank="14">Øygarden, Vestland, Norway</result><addressparts><municipality>Øygarden</municipality><county>Vestland</county><ISO3166-2-lvl4>NO-46</ISO3166-2-lvl4><country>Norway</country><country_code>no</country_code></addressparts></reversegeocode>

I attach the source data in an Excel file and the relevant part of my workbench.
It would be wonderful if somebody could help me understand why my substring operation fails is numerous cases. Is there a better way to extract the relevant data from the result string?

Thanks
Axel
 

There is something wrong with the XML for 670, this should read

<?xml version="1.0" encoding="UTF-8" ?>
<reversegeocode timestamp="Thu, 08 Aug 2024 09:13:46 +00:00" attribution="Data ©️ OpenStreetMap contributors, ODbL 1.0. http://osm.org/copyright" querystring="lat=58.152879104460816&amp;lon=8.126058832869633&amp;format=xml&amp;zoom=12"><result place_id="172781434" osm_type="node" osm_id="31264100" ref="Kristiansand" lat="58.1461500" lon="7.9957333" boundingbox="57.9861500,58.3061500,7.8357333,8.1557333" place_rank="16" address_rank="16">Kristiansand, Agder, 4611, Norway</result><addressparts><city>Kristiansand</city><municipality>Kristiansand</municipality><county>Agder</county><ISO3166-2-lvl4>NO-42</ISO3166-2-lvl4><postcode>4611</postcode><country>Norway</country><country_code>no</country_code></addressparts></reversegeocode>

 

 


You can get the data you want with an XMLFlattener.

Afterwards you can combine the individual attributes.


Oh, and the problem in your complex statement is caused by the Ø and ø in the text. These probably make the @StringLength go awry.


But of course this really asks for a solution with a regular expression using subexpressions in a StringSearcher.

And the best part: this even works with the erroneous XML for 670.


Wow! Thank you very much. This was really helpful! 
In fact, I did not consider the StringSearcher, but it is actually a really helpful transformer, and thanks to you, I now know how it works...  😃
 

Thank you so much!


I think I should elaborate a bit on the regular expression I used:

(address_rank=\"\d+\"\>)(.+)(</result>)

The parentheses () indicate subexpressions (or capturing groups in RexEx lingo). The results of the subexpressions are added to the list named after Subexpression Matches List Name.

First subexpression address_rank=\"\d+\"\> : find the literal text address_rank, followed by an equal sign, a double quote, one or more digits, another double quote, and a closing angle bracket.

Second subexpression .+ : One or more characters (letter, digit, whitespace, almost everything) 

Third subexpression </result> : the literal text </result>

 

This way the second subexpression contains everything that is between the > after address_rank and </result>.


Reply