Skip to main content
Solved

Has anyone tried integrating Beautiful Soup with FME Desktop


mygis
Supporter
Forum|alt.badge.img+13
  • Supporter

Hello,

Wondering if anyone has tried to use Beautiful Soup with FME desktop.

 

 

Thanks.

Best answer by mark2catsafe

Yes, I believe we've used it with FME before. There used to be an article on the knowledgebase, but it was removed because it became out of date. In case it's of use, the following is the content of that article:

_____

Parsing HTML files with Beautiful Soup for Python

BeautifulSoup for Python is a powerful parser for HTML/XML. It can serve well as a replacement of the standard FME tools such as StringSearcher (aka Grepper) or StringReplacer. Their use for HTML parsing is shown on HTTPFetcher page.

The attached example takes an HTML page containing a few tables. Some of them are used for design purposes, the other contain useful information about extra cost plug-ins for FME. BeautifulSoup scans through them, takes only necessary tables (<table> tags), searches for rows (<tr> tags) and cells (<td> tags) turning them accordingly into feature types, features, and attributes.

Then FME itself takes care about exposing and renaming attributes, cleaning and replacing attribute values where necessary.

In order to use BeautifulSoup, Python 2.3 or higher should be installed (find more details here). BeautifulSoup.py should be placed either into \\Python24\\Lib\\site-packages (to use it with any workspace) or together with the workspace calling it (to make it portable).

Note that HTML can have a very complex structure, and it's impossible to use one Python script for any HTML file. Use this example as a simple introduction into HTML parsing.

Refer to BeautifulSoup documentation for more details about HTML parsing.

 

_____

Sadly, the example that is mentioned in the article has also been removed and isn't available. But I hope the above helps in some way

Mark

View original
Did this help you find an answer to your question?

6 replies

Forum|alt.badge.img+5

Yes, I believe we've used it with FME before. There used to be an article on the knowledgebase, but it was removed because it became out of date. In case it's of use, the following is the content of that article:

_____

Parsing HTML files with Beautiful Soup for Python

BeautifulSoup for Python is a powerful parser for HTML/XML. It can serve well as a replacement of the standard FME tools such as StringSearcher (aka Grepper) or StringReplacer. Their use for HTML parsing is shown on HTTPFetcher page.

The attached example takes an HTML page containing a few tables. Some of them are used for design purposes, the other contain useful information about extra cost plug-ins for FME. BeautifulSoup scans through them, takes only necessary tables (<table> tags), searches for rows (<tr> tags) and cells (<td> tags) turning them accordingly into feature types, features, and attributes.

Then FME itself takes care about exposing and renaming attributes, cleaning and replacing attribute values where necessary.

In order to use BeautifulSoup, Python 2.3 or higher should be installed (find more details here). BeautifulSoup.py should be placed either into \\Python24\\Lib\\site-packages (to use it with any workspace) or together with the workspace calling it (to make it portable).

Note that HTML can have a very complex structure, and it's impossible to use one Python script for any HTML file. Use this example as a simple introduction into HTML parsing.

Refer to BeautifulSoup documentation for more details about HTML parsing.

 

_____

Sadly, the example that is mentioned in the article has also been removed and isn't available. But I hope the above helps in some way

Mark


fmelizard
Safer
Forum|alt.badge.img+19
  • Safer
  • June 7, 2016

FYI FME 2017 has a couple of transformers/readers that use this package. THere is an HTML Extractor transformer, and a reader that just can read tables and lists from HTML pages. Watch for the beta coming soon...


mygis
Supporter
Forum|alt.badge.img+13
  • Author
  • Supporter
  • June 7, 2016
mark2catsafe wrote:

Yes, I believe we've used it with FME before. There used to be an article on the knowledgebase, but it was removed because it became out of date. In case it's of use, the following is the content of that article:

_____

Parsing HTML files with Beautiful Soup for Python

BeautifulSoup for Python is a powerful parser for HTML/XML. It can serve well as a replacement of the standard FME tools such as StringSearcher (aka Grepper) or StringReplacer. Their use for HTML parsing is shown on HTTPFetcher page.

The attached example takes an HTML page containing a few tables. Some of them are used for design purposes, the other contain useful information about extra cost plug-ins for FME. BeautifulSoup scans through them, takes only necessary tables (<table> tags), searches for rows (<tr> tags) and cells (<td> tags) turning them accordingly into feature types, features, and attributes.

Then FME itself takes care about exposing and renaming attributes, cleaning and replacing attribute values where necessary.

In order to use BeautifulSoup, Python 2.3 or higher should be installed (find more details here). BeautifulSoup.py should be placed either into \\Python24\\Lib\\site-packages (to use it with any workspace) or together with the workspace calling it (to make it portable).

Note that HTML can have a very complex structure, and it's impossible to use one Python script for any HTML file. Use this example as a simple introduction into HTML parsing.

Refer to BeautifulSoup documentation for more details about HTML parsing.

 

_____

Sadly, the example that is mentioned in the article has also been removed and isn't available. But I hope the above helps in some way

Mark

Hi @mark2catsafe, I was wondering where the article was as. I have been using RegEx to parse the html pages to extract table automatically, need to remove all the unnecessary html tags. Tedious work at the beginning but you get the work done.


mygis
Supporter
Forum|alt.badge.img+13
  • Author
  • Supporter
  • June 7, 2016
fmelizard wrote:

FYI FME 2017 has a couple of transformers/readers that use this package. THere is an HTML Extractor transformer, and a reader that just can read tables and lists from HTML pages. Watch for the beta coming soon...

Thanks @daleatsafe, will surely keep an eye on the beta !


mygis
Supporter
Forum|alt.badge.img+13
  • Author
  • Supporter
  • June 7, 2016
fmelizard wrote:

FYI FME 2017 has a couple of transformers/readers that use this package. THere is an HTML Extractor transformer, and a reader that just can read tables and lists from HTML pages. Watch for the beta coming soon...

@mark2catsafe; @daleatsafe just a random thought; wouldn't be great to be able to accept multiple answers on a question here?


fmelizard
Safer
Forum|alt.badge.img+19
  • Safer
  • June 8, 2016
mygis wrote:

@mark2catsafe; @daleatsafe just a random thought; wouldn't be great to be able to accept multiple answers on a question here?

Yes, I agree. I've wanted to accept more than one answer in the past too. I'll suggest it.


Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

 
Cookie settings