Skip to main content
Question

Extract data from webpage


ingalla
Contributor
Forum|alt.badge.img+10

Can anyone tell me the best method to use to scrape the following webpage

https://historicengland.org.uk/listing/the-list/list-entry/1435084

I am trying initially to extract the following data:-

  • Name
  • List Entry Number

I have tried using the HTTP Caller linked to a HTML Extractor but i can not get anything extracted.

Thanks in advance

6 replies

itay
Supporter
Forum|alt.badge.img+16
  • Supporter
  • March 25, 2017

Hi @ingalla, how are you configuring the HTML Extractor? can you place a screen dump of the settings?


Forum|alt.badge.img+2
  • March 25, 2017

I tried with the  CSS Selector .attributeList p and can't get a handle on 

<div class="attributeList">
	<p><span>Name:</span> Basingstoke War Memorial</p>
	<p><span>List entry Number:</span> 1435084</p>
</div>

 

However, if I hack the HTML file and change the class value to all lower case "attributelist" it finds it with the selector .attributeList p

Maybe I'm missing something simple or there is an issue with case sensitivity.


takashi
Contributor
Forum|alt.badge.img+19
  • Contributor
  • March 26, 2017

@mark_1spatial, I also found the HTMLExtroctor doesn't work as expected if you set the class name ".attributeList" as the CSS Selector, in FME 2017.0.0.1 build 17271. I don't think you are missing something, and am afraid that there could be a potential bug here.

@ingalla, in the interim (and in FME 2016 or earlier), you can use the StringSearcher to extract <div class="attributeList"> elements from the entire HTML document, and then extract your desired strings which are stored in the <p> elements under the <div>.

Regular Expression Example

<div class="attributeList">.+?</div>

Forum|alt.badge.img+2
  • March 26, 2017
takashi wrote:

@mark_1spatial, I also found the HTMLExtroctor doesn't work as expected if you set the class name ".attributeList" as the CSS Selector, in FME 2017.0.0.1 build 17271. I don't think you are missing something, and am afraid that there could be a potential bug here.

@ingalla, in the interim (and in FME 2016 or earlier), you can use the StringSearcher to extract <div class="attributeList"> elements from the entire HTML document, and then extract your desired strings which are stored in the <p> elements under the <div>.

Regular Expression Example

<div class="attributeList">.+?</div>
Yes same build number as me. You posted in here about case sensitive tags:

 

https://knowledge.safe.com/questions/34058/how-to-parse-html-file.html

 

 

 


mygis
Contributor
Forum|alt.badge.img+12
  • Contributor
  • March 31, 2017

Hi @ingalla , I could use this method , I did the workspace quickly, I am sure there are other methods, but this works. I hope it works for you too.

Good luck.

Lyes.

 

 

extractvaluesfromhtml.fmw


stefanh
Contributor
Forum|alt.badge.img+8
  • Contributor
  • October 19, 2017
mygis wrote:

Hi @ingalla , I could use this method , I did the workspace quickly, I am sure there are other methods, but this works. I hope it works for you too.

Good luck.

Lyes.

 

 

extractvaluesfromhtml.fmw

Thanks for the workspace example. This is very useful in my case.

 

 


Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

 
Cookie settings