Skip to main content

Hi guys

I'm trying to understand how to extract data from website but it's very hard (I'm not a python expert so FME is the perfect solution but......)

If I inspect the webpage it's very easy to extract the css selector then ........ it's very hard to use in FME

 

for ex I have this css selector for different value

#ctl00_mainIndexContent_pnlRisultati > table > tbody > tr:nth-child(3) > td:nth-child(1)

#ctl00_mainIndexContent_pnlRisultati > table > tbody > tr:nth-child(3) > td:nth-child(2)

#ctl00_mainIndexContent_pnlRisultati > table > tbody > tr:nth-child(3) > td:nth-child(3)

 

what I have to write in the field css selector of FME

to retrieve the data ?

 

thx in advance

 

Francesco

Hi @checcosisani, you can just set those CSS Selector strings to the "CSS Selector" column in the Extract Queries table of the HTMLExtractor parameters, if "body" stores an HTML document which contains elements pointed by the CSS Selectors.

For example:


Hi @checcosisani, you can just set those CSS Selector strings to the "CSS Selector" column in the Extract Queries table of the HTMLExtractor parameters, if "body" stores an HTML document which contains elements pointed by the CSS Selectors.

For example:

Hi Takashi

 

thx for quick reply but unfortunaltely id doesn't work

I sent you the wb with the url

I've tried a lot of combinations of css query but without success.

 

the goal at the end is to obtain also a list of data

 

Thx

 

Farncesco

 

test_css_tmp.fmw


Hi @checcosisani, you can just set those CSS Selector strings to the "CSS Selector" column in the Extract Queries table of the  HTMLExtractor parameters, if "body" stores an HTML document which contains elements pointed by the CSS Selectors.

For example:

0684Q00000ArLInQAN.png

I looked at the value of "body" attribute (source HTML text) entered into the HTMLExtractor_5 in your workspace and found that the target <table> element doesn't contain <tbody>. That is, <tr> elements are direct children of <table>, so this CSS Selector should work theoretically.

#ctl00_mainIndexContent_pnlRisultati > table > tr:nth-child(3) > td:nth-child(1)

However, strangely, the HTMLExtractor_5 routed the input feature to the <Rejected> port when I ran the workspace with FME 2019.2. Then, I ran the same workspace with FME 2020.0 and successfully got this result. I think the value of "ufficio_tmp" is your desired one.

0684Q00000ArMhLQAV.png

I suspect that FME 2019 has a defect on parsing HTML in some condition. I would recommend you to upgrade FME version to 2020.


Hi @checcosisani, you can just set those CSS Selector strings to the "CSS Selector" column in the Extract Queries table of the HTMLExtractor parameters, if "body" stores an HTML document which contains elements pointed by the CSS Selectors.

For example:

I'm wondering if FME 2019 may not support the selector "nth-child()".

This workspace is an alternative without using "nth-child()". If you cannot upgrade FME version for some reason, consider applying this approach as a workaround.

test-css-tmp-2.fmw (FME 2019.2)


Hi Takashi

thx for support

Yes FME 2020 support the css selector but now I have another stopper...

how can "transpose" the values present in the html table in separate fields ?

basically the idea is (like in python) to have all values belonging to the first column of html table in the same field

 

#ctl00_mainIndexContent_pnlRisultati > table > tbody > tr:nth-child(3) > td:nth-child(1)

#ctl00_mainIndexContent_pnlRisultati > table > tbody > tr:nth-child(8) > td:nth-child(1)

#ctl00_mainIndexContent_pnlRisultati > table > tbody > tr:nth-child(11) > td:nth-child(1)

 

I hope my explanation is clear

thx

 

Francesco


Hi Takashi

thx for support

Yes FME 2020 support the css selector but now I have another stopper...

how can "transpose" the values present in the html table in separate fields ?

basically the idea is (like in python) to have all values belonging to the first column of html table in the same field

 

#ctl00_mainIndexContent_pnlRisultati > table > tbody > tr:nth-child(3) > td:nth-child(1)

#ctl00_mainIndexContent_pnlRisultati > table > tbody > tr:nth-child(8) > td:nth-child(1)

#ctl00_mainIndexContent_pnlRisultati > table > tbody > tr:nth-child(11) > td:nth-child(1)

 

I hope my explanation is clear

thx

 

Francesco

Do you mean that the first column values in the row 3, 8, 11 should be concatenated and stored in a single attribute?


Hi Takashi

thx for support

Yes FME 2020 support the css selector but now I have another stopper...

how can "transpose" the values present in the html table in separate fields ?

basically the idea is (like in python) to have all values belonging to the first column of html table in the same field

 

#ctl00_mainIndexContent_pnlRisultati > table > tbody > tr:nth-child(3) > td:nth-child(1)

#ctl00_mainIndexContent_pnlRisultati > table > tbody > tr:nth-child(8) > td:nth-child(1)

#ctl00_mainIndexContent_pnlRisultati > table > tbody > tr:nth-child(11) > td:nth-child(1)

 

I hope my explanation is clear

thx

 

Francesco

Hi Takashi

attach you can find the wb that I use to extract the data (all data) from website but there are a lot's of workaround because of my inexperience ......my goal is to reach a clean and efficient process to extract data from the web

 

any suggestion are more than welcome

 

thx

 

Francesco

 

Milano_testcss_v1.zip


Hi Takashi

thx for support

Yes FME 2020 support the css selector but now I have another stopper...

how can "transpose" the values present in the html table in separate fields ?

basically the idea is (like in python) to have all values belonging to the first column of html table in the same field

 

#ctl00_mainIndexContent_pnlRisultati > table > tbody > tr:nth-child(3) > td:nth-child(1)

#ctl00_mainIndexContent_pnlRisultati > table > tbody > tr:nth-child(8) > td:nth-child(1)

#ctl00_mainIndexContent_pnlRisultati > table > tbody > tr:nth-child(11) > td:nth-child(1)

 

I hope my explanation is clear

thx

 

Francesco

If I understand structure of the source HTML and your requirement correctly, this workspace example might help you. You just need to transform / rename some attributes to achieve the final goal.

milano-testcss-example.fmw (FME 2019.2)


Hi @checcosisani, you can just set those CSS Selector strings to the "CSS Selector" column in the Extract Queries table of the HTMLExtractor parameters, if "body" stores an HTML document which contains elements pointed by the CSS Selectors.

For example:

Super !

.. I have another question

Some websites are not "scrapable" because they are dynamic (java)

If I have a simple python script done with Selenium I can run this script using python caller ?

Thx


Super !

.. I have another question

Some websites are not "scrapable" because they are dynamic (java)

If I have a simple python script done with Selenium I can run this script using python caller ?

Thx

I don't think FME has capability to interpret JavaScript script to dynamically generate HTML document unfortunately.


I don't think FME has capability to interpret JavaScript script to dynamically generate HTML document unfortunately.

Hi Takashi

sorry for late reply and for "bad" request

What I want to know if it's possible to run a python script from FME (I mean using the python caller)

The script need to install selenium library so I don't know if this is possible .. not expert in python caller

In case I can share the script

thx again

 

Francesco


I don't think FME has capability to interpret JavaScript script to dynamically generate HTML document unfortunately.

Hi @checcosisani, in general, you can implement and run a Python script containing any external modules with a PythonCaller, if you have installed required modules into your FME Python environment. See here to learn how you can install an external module into FME Desktop.

Installing Python Packages to FME Desktop

However, I'm not sure if the selenium module provides classes and/or functions to get your desired result, since I've never used it.

I'd recommend you to post a new question if you want to hear a useful suggestion regarding use of the selenium module. Hopefully someone in the Community have experienced to use the module.


Reply