Skip to main content

I have a webpage that has links at the top of the page. Clicking the link brings you down the page to the relevant info (Name Name). I have extracted and parsed a list from the links, and I am now trying to get the data from each link. any ideas? The Href are the links and the h2 id are where the links bring you to I want all the data until the next h2 id. I keep getting the same data if any...

 

<li><a href="http://www.link.com<a></li>

<li><a href=" http://www.link2.com </a></li>

</ul>

<br/>

<hr/>

<h2 id="1">Name Name</h2>

<p><strong>This is the beginning</strong><br/>DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA </p>

<p><strong>Description</strong><br/> DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA </p>

<p class="date"><strong>Date listed</strong><br/>2016-09-21</p>

<p class="date"><strong>Date reviewed</strong><br/>2017-08-11</p>

<p> </p><br/><h2 id="2">Name Name Name </h2>

<p><strong>This is the beginning</strong><br/> DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA </p>

<p><strong>Description</strong><br/> DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA </p>

<p class="date"><strong>Date listed</strong><br/>2013-08-17</p>

<p class="date"><strong>Date reviewed</strong><br/>2014-09-01</p

 

 

 

The HTMLExtractor could help you, if you have extracted the URLs of the links.


Hi tmtech,

 

The HTML in question is not the easiest to do this with. Please find the attached FME Workspace for an example of how I've achieved what you're looking for, but if you have any way of modifying the webpage you're working with then I definitely recommend giving a parent div to each segment (as shown in the attached workspace). Please note: The method I've used is by no means the best way of doing this and is extremely "hardcoded", but hopefully it puts you on the right track.

 

HTML Extractor Example.fmwt

 

Also a few useful links for you:

 

 

I hope this helps!

 

Cheers

Ryan


Hi tmtech,

 

The HTML in question is not the easiest to do this with. Please find the attached FME Workspace for an example of how I've achieved what you're looking for, but if you have any way of modifying the webpage you're working with then I definitely recommend giving a parent div to each segment (as shown in the attached workspace). Please note: The method I've used is by no means the best way of doing this and is extremely "hardcoded", but hopefully it puts you on the right track.

 

HTML Extractor Example.fmwt

 

Also a few useful links for you:

 

 

I hope this helps!

 

Cheers

Ryan

I had to massage it a little to fit how I got to the point of having the example data but this worked. Would you mind explaining what the concatenation is for? I understand the use of that transformer just not in this instance. Thank you very much for your help!!


I had to massage it a little to fit how I got to the point of having the example data but this worked. Would you mind explaining what the concatenation is for? I understand the use of that transformer just not in this instance. Thank you very much for your help!!

Hi tmtech,

I'm glad this has done the trick for you. With regards to the use of the string concatenator, as I've had to pull in each of the four paragraphs separately, without doing this you're left with a whole bunch of lists each with their respective HTML element (see below).

by stitching these back together based on their list id/position (which I've done inside the StringConcatenator) you can turn multiple separate elements back into 1 block. Looking at the workspace now I can see it's actually done something funny with the StringConcatenator!

The original Concatenated Result actually looked like this:

 

I wanted to use the _element_index to specify which item of that specific list would be used. Sorry I didn't check this before uploading. I've attached a slightly amended version of the workspace which uses an attribute creator instead of the string concatenator to achieve the same thing. I didn't add an AttributeManager after on this one though, but that's just my personal preference to "clean up" unneeded attributes once I've got what I need.

 

HTML Extractor Example - Updated.fmw

 

Cheers

Ryan

 


Hi tmtech,

I'm glad this has done the trick for you. With regards to the use of the string concatenator, as I've had to pull in each of the four paragraphs separately, without doing this you're left with a whole bunch of lists each with their respective HTML element (see below).

by stitching these back together based on their list id/position (which I've done inside the StringConcatenator) you can turn multiple separate elements back into 1 block. Looking at the workspace now I can see it's actually done something funny with the StringConcatenator!

The original Concatenated Result actually looked like this:

 

I wanted to use the _element_index to specify which item of that specific list would be used. Sorry I didn't check this before uploading. I've attached a slightly amended version of the workspace which uses an attribute creator instead of the string concatenator to achieve the same thing. I didn't add an AttributeManager after on this one though, but that's just my personal preference to "clean up" unneeded attributes once I've got what I need.

 

HTML Extractor Example - Updated.fmw

 

Cheers

Ryan

 

Hi thanks again, yes I am getting the same error. With both the concatonator and the attribute creator. I am getting 60 results for each header. Not sure its picking up the value from the header


Hi thanks again, yes I am getting the same error. With both the concatonator and the attribute creator. I am getting 60 results for each header. Not sure its picking up the value from the header

No problem. Hmm.. sounds like the HTML you're using may require a few CSS selector tweaks then. I'm assuming there are many more "h2" tags on the page than just those you need. If any of them have any particular classes it may be useful, so within the HTMLExtractor instead of the CSS Selector simply stating "h2" we can refine this bit to only grab those h2 that you're interested in. The two links in my original answer should help to figure out exactly how this can be achieved.


Reply