Skip to main content
Question

HTML ancestors

  • June 12, 2018
  • 2 replies
  • 20 views

jdh
Contributor
Forum|alt.badge.img+40

I have some html data that's in the structure 

 

<h2>Status</h2>
<h3>Place</h3>
<p><a name="1">Name</a></p> <p><a name="2">Name<blockquote><p>Description</p></blockquote></a></p> <h3>Place2</h3>
<p><a name="3">Name<blockquote><p>Description</p></blockquote></a></p>

but the Line Feeds are entirely erratic.

 

 

I need to have one feature per name anchor (which is easily enough done with the HTMLExtractor) but I also need to have the corresponding contents of the h2|h3 tags stored as attributes.

 

Normally I would read in the data line by line and use a TestFilter and variables to do so, but since the lines breaks don't match the data structure in any way, I'm not sure as to the best way to proceed.
This post is closed to further activity.
It may be an old question, an answered question, an implemented idea, or a notification-only post.
Please check post dates before relying on any information in a question or answer.
For follow-up or related questions, please post a new question or idea.
If there is a genuine update to be made, please contact us and request that the post is reopened.

2 replies

takashi
Celebrity
  • June 13, 2018

Hi @jdh, I think it's hard to accomplish that with CSS Selectors.

A workaround I can think of is, collect all your interested elements with a StringSearcher and save them into a list attribute, explode the list, and then parse them line by line. If your interested elements are <h2>, <h3>, and <a>, this regex matches them, for example.

<h2.+?</h2>|<h3.+?</h3>|<a.+?</a>


jdh
Contributor
Forum|alt.badge.img+40
  • Author
  • Contributor
  • June 13, 2018

Hi @jdh, I think it's hard to accomplish that with CSS Selectors.

A workaround I can think of is, collect all your interested elements with a StringSearcher and save them into a list attribute, explode the list, and then parse them line by line. If your interested elements are <h2>, <h3>, and <a>, this regex matches them, for example.

<h2.+?</h2>|<h3.+?</h3>|<a.+?</a>

That's definitely more elegant than solutions I was considering.

 

 

I did need to modify the regex to allow for closing tags split across multiple lines. ( I may have mentioned the Line Feeds were erratic).