Solved

Extract XML from URL


Badge +3

Hi,

I'm using a HTTPCaller to connect to an XML api service.

 

Although I can connect and receive XML I can't create attributes from the file. I'm new to XML and don't know how to do this.

 

I tried using an XML reader and successfully extracted the attributes, but then FME 2019.2 keeps crashing after reading the file, or reads the file and then doesn't show the reader in the workbench, so I'm hoping this will be a more robust method.

 

I've tried using an XMLFragmenter to get the attributes, but they aren't showing - I assuming I'm missing a step somewhere.

 

testxml.txt

 

I've attached a test file.

Thanks

N

 

 

icon

Best answer by nielsgerrits 29 April 2020, 10:56

View original

12 replies

Userlevel 4

First of all, there's an invalid character on lines 59456 and 59473, e.g.:

<notes>IPID - 10762190
SS - 42109, 25614

Actuator Max 30m ? Min 15m</notes>

You might want to e.g. use the StringReplacer to fix those first.

You can then read the XML like this (or similar using the FeatureReader):

0684Q00000ArK1ZQAV.png

Result:

0684Q00000ArK9qQAF.png

Userlevel 6
Badge +33

When I try to read it using a FeatureReader I get:

When I check the file in Notepad++ to line 59455 I see some newlines, probably from the source data field SiteNotes.

This corrupts the XML structure which is causing your issue.

You also can only use FME to do check the XML structure:

  • Read the xml as text file, change parameter to read whole file at once (entire text in one feature). Make sure you read the correct encoding. (UTF-8).
  • Connect a XMLValidator. The validator will stop at the first error.

Tip: Create a two samples of the file for faster iteration while working. One corrupt file, one good file. When it works, just change to the original big file.

Use the good file to generate the paths from the structure you need with a FeatureReader.

So you need to replace or remove the incorrect newlines in SiteNotes. One way to do this:

  • Remove all correct newlines. (not the incorrect ones) StringReplacer, Replace Regular Expression, replace > by >.
  • Replace all remaining newlines with another, not used, character, for example a pipe. This way you are able to restore the newlines from SiteNotes in a later phase of the proces. StringReplacer, Replace Regular Expression, replace by |.
  • Connect a XMLValidator to check if this was the only error.

Now you can write the XML file or just process the corrected file.

  • XMLFragmenter. (Create a feature for each logger.) Match loggers/logger
  • XMLFlattener. (This can also be done in the XMLFragmenter, but for learning this might be easier to do step by step.) Match logger.

Edit: added workspace template

parsexml.fmwt

Badge +3

First of all, there's an invalid character on lines 59456 and 59473, e.g.:

<notes>IPID - 10762190
SS - 42109, 25614

Actuator Max 30m ? Min 15m</notes>

You might want to e.g. use the StringReplacer to fix those first.

You can then read the XML like this (or similar using the FeatureReader):

0684Q00000ArK1ZQAV.png

Result:

0684Q00000ArK9qQAF.png

Thank you - the invalid character was created when I copied the xml into notepad. 

Badge +3

When I try to read it using a FeatureReader I get:

When I check the file in Notepad++ to line 59455 I see some newlines, probably from the source data field SiteNotes.

This corrupts the XML structure which is causing your issue.

You also can only use FME to do check the XML structure:

  • Read the xml as text file, change parameter to read whole file at once (entire text in one feature). Make sure you read the correct encoding. (UTF-8).
  • Connect a XMLValidator. The validator will stop at the first error.

Tip: Create a two samples of the file for faster iteration while working. One corrupt file, one good file. When it works, just change to the original big file.

Use the good file to generate the paths from the structure you need with a FeatureReader.

So you need to replace or remove the incorrect newlines in SiteNotes. One way to do this:

  • Remove all correct newlines. (not the incorrect ones) StringReplacer, Replace Regular Expression, replace > by >.
  • Replace all remaining newlines with another, not used, character, for example a pipe. This way you are able to restore the newlines from SiteNotes in a later phase of the proces. StringReplacer, Replace Regular Expression, replace by |.
  • Connect a XMLValidator to check if this was the only error.

Now you can write the XML file or just process the corrected file.

  • XMLFragmenter. (Create a feature for each logger.) Match loggers/logger
  • XMLFlattener. (This can also be done in the XMLFragmenter, but for learning this might be easier to do step by step.) Match logger.

Edit: added workspace template

parsexml.fmwt

That's incredibly helpful - thank you so much for your detailed response!

Userlevel 6
Badge +33

That's incredibly helpful - thank you so much for your detailed response!

Welcome, but the real issue seemed to be the invalid character, as @david_r points out. Reading the file as UTF-8 text solved the issue, works as well when you remove the stringreplacers. Facepalm.

Badge +3

When I try to read it using a FeatureReader I get:

When I check the file in Notepad++ to line 59455 I see some newlines, probably from the source data field SiteNotes.

This corrupts the XML structure which is causing your issue.

You also can only use FME to do check the XML structure:

  • Read the xml as text file, change parameter to read whole file at once (entire text in one feature). Make sure you read the correct encoding. (UTF-8).
  • Connect a XMLValidator. The validator will stop at the first error.

Tip: Create a two samples of the file for faster iteration while working. One corrupt file, one good file. When it works, just change to the original big file.

Use the good file to generate the paths from the structure you need with a FeatureReader.

So you need to replace or remove the incorrect newlines in SiteNotes. One way to do this:

  • Remove all correct newlines. (not the incorrect ones) StringReplacer, Replace Regular Expression, replace > by >.
  • Replace all remaining newlines with another, not used, character, for example a pipe. This way you are able to restore the newlines from SiteNotes in a later phase of the proces. StringReplacer, Replace Regular Expression, replace by |.
  • Connect a XMLValidator to check if this was the only error.

Now you can write the XML file or just process the corrected file.

  • XMLFragmenter. (Create a feature for each logger.) Match loggers/logger
  • XMLFlattener. (This can also be done in the XMLFragmenter, but for learning this might be easier to do step by step.) Match logger.

Edit: added workspace template

parsexml.fmwt

Thank you.

 

My next question is how do I get nested XML? Using @david_r s method of reading in at the logger level I now need to get all the messages associated with the logger, but only the most recent message (with the highest id).

 

FME doesn't seem to expose all the message id's when I just use the <logger> as the element to match.

 

I've attached another file.

 

xml2.txt

 

 

Userlevel 4

The messages are there, but the standard behavior in FME is to either create an attribute or a list in each "logger" feature depending on the number (cardinality) of "message" objects. If there is only one "message", then the object is output as regular attributes, but if there are several "message" objects per "logger", then a list is output. This is a bit cumbersome because you will have to treat the two cases differently, unless you always have multiple messages per logger.

However, it is possible to tell FME to always use a list for a specific element, here's an example on how to force the "messages" as a list:

0684Q00000ArJsYQAV.png

In the dialog "XML Flatten options" you will have to toggle advanced mode and type the following to specify the cardinality of the message objects:

cardinality="*/messages/message{}/+ /+"

This means that all "message" objects inside the "messages" element should be rendered as an FME list regardless of the number. You can then use a ListExploder to get all the messages per logger:

0684Q00000ArKAJQA3.png

Userlevel 4

The messages are there, but the standard behavior in FME is to either create an attribute or a list in each "logger" feature depending on the number (cardinality) of "message" objects. If there is only one "message", then the object is output as regular attributes, but if there are several "message" objects per "logger", then a list is output. This is a bit cumbersome because you will have to treat the two cases differently, unless you always have multiple messages per logger.

However, it is possible to tell FME to always use a list for a specific element, here's an example on how to force the "messages" as a list:

0684Q00000ArJsYQAV.png

In the dialog "XML Flatten options" you will have to toggle advanced mode and type the following to specify the cardinality of the message objects:

cardinality="*/messages/message{}/+ /+"

This means that all "message" objects inside the "messages" element should be rendered as an FME list regardless of the number. You can then use a ListExploder to get all the messages per logger:

0684Q00000ArKAJQA3.png

For reference, here's the relevant part of the documentation: https://docs.safe.com/fme/html/FME_Desktop_Documentation/FME_ReadersWriters/xml/structure_element.htm

Badge +3

The messages are there, but the standard behavior in FME is to either create an attribute or a list in each "logger" feature depending on the number (cardinality) of "message" objects. If there is only one "message", then the object is output as regular attributes, but if there are several "message" objects per "logger", then a list is output. This is a bit cumbersome because you will have to treat the two cases differently, unless you always have multiple messages per logger.

However, it is possible to tell FME to always use a list for a specific element, here's an example on how to force the "messages" as a list:

0684Q00000ArJsYQAV.png

In the dialog "XML Flatten options" you will have to toggle advanced mode and type the following to specify the cardinality of the message objects:

cardinality="*/messages/message{}/+ /+"

This means that all "message" objects inside the "messages" element should be rendered as an FME list regardless of the number. You can then use a ListExploder to get all the messages per logger:

0684Q00000ArKAJQA3.png

That's very helpful - thank you. I may be back with more questions tho (sorry in advance)

Badge +3

First of all, there's an invalid character on lines 59456 and 59473, e.g.:

<notes>IPID - 10762190
SS - 42109, 25614

Actuator Max 30m ? Min 15m</notes>

You might want to e.g. use the StringReplacer to fix those first.

You can then read the XML like this (or similar using the FeatureReader):

0684Q00000ArK1ZQAV.png

Result:

0684Q00000ArK9qQAF.png

Hi @david_r,

 

Looks like I'm going to have to parameterize my api in order to fire a list of values into the url string. I can't get this to work with a feature reader - do I have to use a HTTPCaller and then use the XML fragmenter?

Badge +3

The messages are there, but the standard behavior in FME is to either create an attribute or a list in each "logger" feature depending on the number (cardinality) of "message" objects. If there is only one "message", then the object is output as regular attributes, but if there are several "message" objects per "logger", then a list is output. This is a bit cumbersome because you will have to treat the two cases differently, unless you always have multiple messages per logger.

However, it is possible to tell FME to always use a list for a specific element, here's an example on how to force the "messages" as a list:

0684Q00000ArJsYQAV.png

In the dialog "XML Flatten options" you will have to toggle advanced mode and type the following to specify the cardinality of the message objects:

cardinality="*/messages/message{}/+ /+"

This means that all "message" objects inside the "messages" element should be rendered as an FME list regardless of the number. You can then use a ListExploder to get all the messages per logger:

0684Q00000ArKAJQA3.png

Hi @david_r,

 

Looks like I'm going to have to parameterize my api in order to fire a list of values into the url string. I can't get this to work with a feature reader - do I have to use a HTTPCaller and then use the XML fragmenter?

Userlevel 4

Hi @david_r,

 

Looks like I'm going to have to parameterize my api in order to fire a list of values into the url string. I can't get this to work with a feature reader - do I have to use a HTTPCaller and then use the XML fragmenter?

It should work just fine using the FeatureReader. Consider posting a new question (for better visibility) and also post some screenshots / relevant bits from the log. That way you'll get more eyes on your question.

Reply