Skip to main content

FMErs I need some help!

I've been beating my head against the wall trying to figure out how to properly parse these HTML Tables and have used some of the examples on here ( @takashi ) as go by's but just can't get it right.

http://webapps.rrc.state.tx.us/CMPL/publicSearchAction.do?packetSummaryId=171477&formData.methodHndlr.inputValue;=loadPacket&formData.hrefValue;=%257C1003%253Dhome%257C1005%253Dhome%257C1007%253D0&searchArgs.paramValue;=%257C0%253D04%252F01%252F2017%257C1%253D04%252F27%252F2017%257C2%253D01&pager.paramValue;=%7C1%3D1%7C2%3D100%7C3%3D221%7C4%3D0%7C5%3D3%7C6%3D10&pager.offset;=&publicUser;=

What we have

And want

 

I've tried different combinations of

  1. HTTPCaller>HTMLExtractor (but for the table CSS Selector, I can't get it to flatten w/o substringextracting)
  2. FeatureReader (HTML Table) (works great for the tables but it looses the hyperlinks)

Any help would be greatly appreciated!

Hi @natehewes13, I looked at the HTML source document and found that the required data are stored in two <table> elements, which can be identified by their class names - "GroupBox1" and "DataGrid" and therefore you can extract the <table> elements with the HTMLExtractor.

However, unfortunately, the current HTMLExtractor doesn't support to identify class names containing upper case characters (FME 2017.0). It's a known issue and I hope this will be fixed in the near future.

As a workaround in the interim, change "GroupBox1" and "DataGrid" within the response body to lower case (StringReplacer can be used), then extract the two <table> element using the HTMLExtractor with this setting. You can then parse the extracted <table> elements as XML fragments.


Additionally, if you want to create a working hyperlink in Excel, you will have to create two attributes:

1. 'View Form/Attachment' with value 'View'

2. 'View Form/Attachment.hyperlink' with the URL of the hyperlink.

Only the first must be used as an attribute in the writer. The .hyperlink is explained for the Reader but not for the Writer.

One thing to keep in mind: there is a limit of 66530 hyperlinks per worksheet. (excel specs)


Hi @natehewes13, I looked at the HTML source document and found that the required data are stored in two <table> elements, which can be identified by their class names - "GroupBox1" and "DataGrid" and therefore you can extract the <table> elements with the HTMLExtractor.

However, unfortunately, the current HTMLExtractor doesn't support to identify class names containing upper case characters (FME 2017.0). It's a known issue and I hope this will be fixed in the near future.

As a workaround in the interim, change "GroupBox1" and "DataGrid" within the response body to lower case (StringReplacer can be used), then extract the two <table> element using the HTMLExtractor with this setting. You can then parse the extracted <table> elements as XML fragments.

The issue where upper case characters are not allowed for class names has been addressed and should be allowed in FME 2017.1 build 17504.

 

 


The issue where upper case characters are not allowed for class names has been addressed and should be allowed in FME 2017.1 build 17504.

 

 

Good to hear 🙂 Thanks for the update, @stephenwu.

 


Hi @natehewes13 ,

Could you please provide another link? I just want to check on the consistacy of the tags.

Thanks.

Lyes


Hi @natehewes13, I looked at the HTML source document and found that the required data are stored in two <table> elements, which can be identified by their class names - "GroupBox1" and "DataGrid" and therefore you can extract the <table> elements with the HTMLExtractor.

However, unfortunately, the current HTMLExtractor doesn't support to identify class names containing upper case characters (FME 2017.0). It's a known issue and I hope this will be fixed in the near future.

As a workaround in the interim, change "GroupBox1" and "DataGrid" within the response body to lower case (StringReplacer can be used), then extract the two <table> element using the HTMLExtractor with this setting. You can then parse the extracted <table> elements as XML fragments.

This worked great! Thank you @takashi (and sorry this post is so late).

 


Reply