Skip to main content
Solved

Extract table from webpage and hyperlinks in table

  • January 13, 2021
  • 3 replies
  • 154 views

I'm trying to extract all of the data from a table that lives on a webpage along with the hyperlinks that are embedded inside of one of the columns. The HTML table reader is only able to return the data that is displayed on the webpage, not the hyperlinks that are present. On the flip side, I can only get the HTMLExtractor transformer to collect the first record in the table. Attached is my workbench... what can I do to collect all of the records in the table?

Best answer by caracadrian

​@Chris Warren​  You are on the right track.

By modifying your workspace a bit you can obtain the desired result.

Set your second HTMLExtractor to Output: Return List Attribute

Add ListExploders for every list that you need than merge the list via FeatureMerger using one of them as a Requestor and the rest as Suppliers, Join On _element_index, set it to Process Duplicate Suppliers.

Than you can continue to your Attribute Splitter.Explode HTML ListsBy setting a ListExploder for every list you can obtain something like this:

Exploded Lists

This post is closed to further activity.
It may be an old question, an answered question, an implemented idea, or a notification-only post.
Please check post dates before relying on any information in a question or answer.
For follow-up or related questions, please post a new question or idea.
If there is a genuine update to be made, please contact us and request that the post is reopened.

3 replies

warrendev
Enthusiast
Forum|alt.badge.img+26
  • Enthusiast
  • 121 replies
  • January 13, 2021

Hi @johnscreekga​ ,

One way I think would work would be to set the output in the HTMLExtractor to return a list. Then you could explode the list items and join by the element index. That should return all items in the table.

example


caracadrian
Contributor
Forum|alt.badge.img+23
  • Contributor
  • 571 replies
  • Best Answer
  • January 14, 2021

​@Chris Warren​  You are on the right track.

By modifying your workspace a bit you can obtain the desired result.

Set your second HTMLExtractor to Output: Return List Attribute

Add ListExploders for every list that you need than merge the list via FeatureMerger using one of them as a Requestor and the rest as Suppliers, Join On _element_index, set it to Process Duplicate Suppliers.

Than you can continue to your Attribute Splitter.Explode HTML ListsBy setting a ListExploder for every list you can obtain something like this:

Exploded Lists


  • Author
  • 2 replies
  • January 14, 2021

​@Chris Warren​  You are on the right track.

By modifying your workspace a bit you can obtain the desired result.

Set your second HTMLExtractor to Output: Return List Attribute

Add ListExploders for every list that you need than merge the list via FeatureMerger using one of them as a Requestor and the rest as Suppliers, Join On _element_index, set it to Process Duplicate Suppliers.

Than you can continue to your Attribute Splitter.Explode HTML ListsBy setting a ListExploder for every list you can obtain something like this:

Exploded Lists

@Chris Warren​ and @caracadrian​ ... thank you both so much for the push in the right direction. I never would have thought of this on my own.