Skip to main content

Can anyone tell me the best method to use to scrape the following webpage

https://historicengland.org.uk/listing/the-list/list-entry/1435084

I am trying initially to extract the following data:-

  • Name
  • List Entry Number

I have tried using the HTTP Caller linked to a HTML Extractor but i can not get anything extracted.

Thanks in advance

Hi @ingalla, how are you configuring the HTML Extractor? can you place a screen dump of the settings?


I tried with the  CSS Selector .attributeList p and can't get a handle on 

<div class="attributeList">
<p><span>Name:</span> Basingstoke War Memorial</p>
<p><span>List entry Number:</span> 1435084</p>
</div>

 

However, if I hack the HTML file and change the class value to all lower case "attributelist" it finds it with the selector .attributeList p

Maybe I'm missing something simple or there is an issue with case sensitivity.


@mark_1spatial, I also found the HTMLExtroctor doesn't work as expected if you set the class name ".attributeList" as the CSS Selector, in FME 2017.0.0.1 build 17271. I don't think you are missing something, and am afraid that there could be a potential bug here.

@ingalla, in the interim (and in FME 2016 or earlier), you can use the StringSearcher to extract <div class="attributeList"> elements from the entire HTML document, and then extract your desired strings which are stored in the <p> elements under the <div>.

Regular Expression Example

<div class="attributeList">.+?</div>

@mark_1spatial, I also found the HTMLExtroctor doesn't work as expected if you set the class name ".attributeList" as the CSS Selector, in FME 2017.0.0.1 build 17271. I don't think you are missing something, and am afraid that there could be a potential bug here.

@ingalla, in the interim (and in FME 2016 or earlier), you can use the StringSearcher to extract <div class="attributeList"> elements from the entire HTML document, and then extract your desired strings which are stored in the <p> elements under the <div>.

Regular Expression Example

<div class="attributeList">.+?</div>
Yes same build number as me. You posted in here about case sensitive tags:

 

https://knowledge.safe.com/questions/34058/how-to-parse-html-file.html

 

 

 


Hi @ingalla , I could use this method , I did the workspace quickly, I am sure there are other methods, but this works. I hope it works for you too.

Good luck.

Lyes.

 

 

extractvaluesfromhtml.fmw


Hi @ingalla , I could use this method , I did the workspace quickly, I am sure there are other methods, but this works. I hope it works for you too.

Good luck.

Lyes.

 

 

extractvaluesfromhtml.fmw

Thanks for the workspace example. This is very useful in my case.

 

 


Reply