Skip to main content
Question

Substring operation incorrect in several cases


axcel.fme
Contributor
Forum|alt.badge.img+3

Dear all,

I’m spending a lot of time trying to sort out an issue with a substring operation but I cannot see my results are not correct in several places.

In FME workbench 2024.0, I use the the AttributeManager to read a substring from the attribute _response_body and to write it to the attribute info. The response body is the result from a reverse geo-coding with OpenStreetMap and I want to extract the location for the given coordinates. To extract the relevant part, I use the following sequence of String operators:

1) Take everything left of </result>
2) From this, take everything right of address_rank
3) From this, take everything right of >

The actual (unfortunately complicated) condition statement which I’m using is:

@Right(@Right(@Left(@Value(_response_body),@FindString(@Value(_response_body),</result>)-1),@StringLength(@Left(@Value(_response_body),@FindString(@Value(_response_body),</result>)-1))-@FindString(@Value(_response_body),address_rank)),@StringLength(@Right(@Left(@Value(_response_body),@FindString(@Value(_response_body),</result>)-1),@StringLength(@Left(@Value(_response_body),@FindString(@Value(_response_body),</result>)-1))-@FindString(@Value(_response_body),address_rank)))-@FindString(@Right(@Left(@Value(_response_body),@FindString(@Value(_response_body),</result>)-1),@StringLength(@Left(@Value(_response_body),@FindString(@Value(_response_body),</result>)-1))-@FindString(@Value(_response_body),address_rank)+1),>))


This seems to work fine for a number of datasets but it gives me a few extra characters in some cases:


The syntax and structure of the source text seems to be consistent:

670:
<?xml version="1.0" encoding="UTF-8" ?>
<reversegeocode timestamp="Thu, 08 Aug 2024 09:13:46 +00:00" attribution="Data © OpenStreetMap contributors, ODbL 1.0. http://osm.org/copyright" querystring="lat=58.152879104460816&lon=8.126058832869633&amp;format=xml&amp;zoom=12"><result place_id="172781434" osm_type="node" osm_id="31264100" ref="Kristiansand" lat="58.1461500" lon="7.9957333" boundingbox="57.9861500,58.3061500,7.8357333,8.1557333" place_rank="16" address_rank="16">Kristiansand, Agder, 4611, Norway</result><addressparts><city>Kristiansand</city><municipality>Kristiansand</municipality><county>Agder</county><ISO3166-2-lvl4>NO-42</ISO3166-2-lvl4><postcode>4611</postcode><country>Norway</country><country_code>no</country_code></addressparts></reversegeocode>

671:
<?xml version="1.0" encoding="UTF-8" ?>
<reversegeocode timestamp="Thu, 08 Aug 2024 09:13:46 +00:00" attribution="Data © OpenStreetMap contributors, ODbL 1.0. http://osm.org/copyright" querystring="lat=58.72402439827633&amp;lon=9.257224322307422&amp;format=xml&amp;zoom=12"><result place_id="177641629" osm_type="node" osm_id="2380166198" ref="Risør" lat="58.7207522" lon="9.2343838" boundingbox="58.6807522,58.7607522,9.1943838,9.2743838" place_rank="18" address_rank="16">Risør, Agder, 4956, Norway</result><addressparts><town>Risør</town><municipality>Risør</municipality><county>Agder</county><ISO3166-2-lvl4>NO-42</ISO3166-2-lvl4><postcode>4956</postcode><country>Norway</country><country_code>no</country_code></addressparts></reversegeocode>


672:
<?xml version="1.0" encoding="UTF-8" ?>
<reversegeocode timestamp="Thu, 08 Aug 2024 09:13:47 +00:00" attribution="Data © OpenStreetMap contributors, ODbL 1.0. http://osm.org/copyright" querystring="lat=60.51981765734847&amp;lon=4.907608675437386&amp;format=xml&amp;zoom=12"><result place_id="173934469" osm_type="relation" osm_id="10150629" ref="4626" lat="60.4267846" lon="4.7265289" boundingbox="60.1040492,60.7409047,4.2661374,5.2380227" place_rank="14" address_rank="14">Øygarden, Vestland, Norway</result><addressparts><municipality>Øygarden</municipality><county>Vestland</county><ISO3166-2-lvl4>NO-46</ISO3166-2-lvl4><country>Norway</country><country_code>no</country_code></addressparts></reversegeocode>

I attach the source data in an Excel file and the relevant part of my workbench.
It would be wonderful if somebody could help me understand why my substring operation fails is numerous cases. Is there a better way to extract the relevant data from the result string?

Thanks
Axel
 

6 replies

geomancer
Evangelist
Forum|alt.badge.img+50
  • Evangelist
  • August 9, 2024

There is something wrong with the XML for 670, this should read

<?xml version="1.0" encoding="UTF-8" ?>
<reversegeocode timestamp="Thu, 08 Aug 2024 09:13:46 +00:00" attribution="Data © OpenStreetMap contributors, ODbL 1.0. http://osm.org/copyright" querystring="lat=58.152879104460816&amp;lon=8.126058832869633&amp;format=xml&amp;zoom=12"><result place_id="172781434" osm_type="node" osm_id="31264100" ref="Kristiansand" lat="58.1461500" lon="7.9957333" boundingbox="57.9861500,58.3061500,7.8357333,8.1557333" place_rank="16" address_rank="16">Kristiansand, Agder, 4611, Norway</result><addressparts><city>Kristiansand</city><municipality>Kristiansand</municipality><county>Agder</county><ISO3166-2-lvl4>NO-42</ISO3166-2-lvl4><postcode>4611</postcode><country>Norway</country><country_code>no</country_code></addressparts></reversegeocode>

 

 


geomancer
Evangelist
Forum|alt.badge.img+50
  • Evangelist
  • August 9, 2024

You can get the data you want with an XMLFlattener.

Afterwards you can combine the individual attributes.


geomancer
Evangelist
Forum|alt.badge.img+50
  • Evangelist
  • August 9, 2024

Oh, and the problem in your complex statement is caused by the Ø and ø in the text. These probably make the @StringLength go awry.


geomancer
Evangelist
Forum|alt.badge.img+50
  • Evangelist
  • August 9, 2024

But of course this really asks for a solution with a regular expression using subexpressions in a StringSearcher.

And the best part: this even works with the erroneous XML for 670.


axcel.fme
Contributor
Forum|alt.badge.img+3
  • Author
  • Contributor
  • August 9, 2024

Wow! Thank you very much. This was really helpful! 
In fact, I did not consider the StringSearcher, but it is actually a really helpful transformer, and thanks to you, I now know how it works...  😃
 

Thank you so much!


geomancer
Evangelist
Forum|alt.badge.img+50
  • Evangelist
  • September 2, 2024

I think I should elaborate a bit on the regular expression I used:

(address_rank=\"\d+\"\>)(.+)(</result>)

The parentheses () indicate subexpressions (or capturing groups in RexEx lingo). The results of the subexpressions are added to the list named after Subexpression Matches List Name.

First subexpression address_rank=\"\d+\"\> : find the literal text address_rank, followed by an equal sign, a double quote, one or more digits, another double quote, and a closing angle bracket.

Second subexpression .+ : One or more characters (letter, digit, whitespace, almost everything) 

Third subexpression </result> : the literal text </result>

 

This way the second subexpression contains everything that is between the > after address_rank and </result>.


Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

 
Cookie settings