Skip to main content
Question

HTML Table & Hyperlink Parser


FMErs I need some help!

I've been beating my head against the wall trying to figure out how to properly parse these HTML Tables and have used some of the examples on here ( @takashi ) as go by's but just can't get it right.

http://webapps.rrc.state.tx.us/CMPL/publicSearchAction.do?packetSummaryId=171477&formData.methodHndlr.inputValue;=loadPacket&formData.hrefValue;=%257C1003%253Dhome%257C1005%253Dhome%257C1007%253D0&searchArgs.paramValue;=%257C0%253D04%252F01%252F2017%257C1%253D04%252F27%252F2017%257C2%253D01&pager.paramValue;=%7C1%3D1%7C2%3D100%7C3%3D221%7C4%3D0%7C5%3D3%7C6%3D10&pager.offset;=&publicUser;=

What we have

And want

 

I've tried different combinations of

  1. HTTPCaller>HTMLExtractor (but for the table CSS Selector, I can't get it to flatten w/o substringextracting)
  2. FeatureReader (HTML Table) (works great for the tables but it looses the hyperlinks)

Any help would be greatly appreciated!

6 replies

takashi
Influencer
  • May 9, 2017

Hi @natehewes13, I looked at the HTML source document and found that the required data are stored in two <table> elements, which can be identified by their class names - "GroupBox1" and "DataGrid" and therefore you can extract the <table> elements with the HTMLExtractor.

However, unfortunately, the current HTMLExtractor doesn't support to identify class names containing upper case characters (FME 2017.0). It's a known issue and I hope this will be fixed in the near future.

As a workaround in the interim, change "GroupBox1" and "DataGrid" within the response body to lower case (StringReplacer can be used), then extract the two <table> element using the HTMLExtractor with this setting. You can then parse the extracted <table> elements as XML fragments.


lars_de_vries
Forum|alt.badge.img+10

Additionally, if you want to create a working hyperlink in Excel, you will have to create two attributes:

1. 'View Form/Attachment' with value 'View'

2. 'View Form/Attachment.hyperlink' with the URL of the hyperlink.

Only the first must be used as an attribute in the writer. The .hyperlink is explained for the Reader but not for the Writer.

One thing to keep in mind: there is a limit of 66530 hyperlinks per worksheet. (excel specs)


Forum|alt.badge.img
takashi wrote:

Hi @natehewes13, I looked at the HTML source document and found that the required data are stored in two <table> elements, which can be identified by their class names - "GroupBox1" and "DataGrid" and therefore you can extract the <table> elements with the HTMLExtractor.

However, unfortunately, the current HTMLExtractor doesn't support to identify class names containing upper case characters (FME 2017.0). It's a known issue and I hope this will be fixed in the near future.

As a workaround in the interim, change "GroupBox1" and "DataGrid" within the response body to lower case (StringReplacer can be used), then extract the two <table> element using the HTMLExtractor with this setting. You can then parse the extracted <table> elements as XML fragments.

The issue where upper case characters are not allowed for class names has been addressed and should be allowed in FME 2017.1 build 17504.

 

 


takashi
Influencer
  • June 7, 2017
stephenwu wrote:
The issue where upper case characters are not allowed for class names has been addressed and should be allowed in FME 2017.1 build 17504.

 

 

Good to hear :-) Thanks for the update, @stephenwu.

 


mygis
Supporter
Forum|alt.badge.img+13
  • Supporter
  • June 7, 2017

Hi @natehewes13 ,

Could you please provide another link? I just want to check on the consistacy of the tags.

Thanks.

Lyes


takashi wrote:

Hi @natehewes13, I looked at the HTML source document and found that the required data are stored in two <table> elements, which can be identified by their class names - "GroupBox1" and "DataGrid" and therefore you can extract the <table> elements with the HTMLExtractor.

However, unfortunately, the current HTMLExtractor doesn't support to identify class names containing upper case characters (FME 2017.0). It's a known issue and I hope this will be fixed in the near future.

As a workaround in the interim, change "GroupBox1" and "DataGrid" within the response body to lower case (StringReplacer can be used), then extract the two <table> element using the HTMLExtractor with this setting. You can then parse the extracted <table> elements as XML fragments.

This worked great! Thank you @takashi (and sorry this post is so late).

 


Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

 
Cookie settings