Skip to main content
Question

Downloading Multiple ZIP files from URLs


Forum|alt.badge.img

Hi all,

I've got a problem that I'm trying to solve. I need to download a large number (circa 50) zip files from this website. They all follows the same format, it's

http://data.inspire.landregistry.gov.uk/Abertawe_-_Swansea.zip. In each zip file is a GML for that local authority area. I need to download each of them and merge them as one feature in a file geodatabase. I have already tried going through manually and saving the URLs to a CSV and then using the below workbench.

1. CSV Reader.

2. HTTP Caller with the following settings:

3. Feature Reader with the following settings:

3. Attribute Creator using the 'fme_dataset' as the new attribute called 'Local Authority'. This obviously creates a large file path in the following format:

I:\\UK\\OutsideLondon\\Land_Ownership\\Out_East_1\\Data\\Inspire\\Gravesham.zip\\Land_Registry_Cadastral_Parcels.gml.

I would then like to use a StringSearcher to strip out everything except 'Gravesham' so that I only have the local authority name as an attribute.

I then use a ESRIReprojector to set the projection and finally a file geodatabase writer with the geometry as a polygon and the user attributes set to automatic.

This is the whole workbench (minus the StringSeacher because I haven't worked on the regex yet.

When I try and run this I get the following error:

XML Parser error: 'Error in input dataset: file:///xxx/yyyy/zzzz/GIS/Data/Inspire/http_download_1493802040496_7056.html' line:1 column:103 message:unable to connect socket for URL

Along with the error, the files get read as far as the FeatureReader but all end up at the rejected port and when I inspect them there's no geometry present.

The questions I have are:

1. Is there something obvious that I'm doing wrong?

2. Is there any way to get a list of all the URLs together to be able to download?

3. Is there any source of help I can get for the regex on my point 3 above?

Thanks for any help anybody can give me.

17 replies

Forum|alt.badge.img
  • Author
  • May 3, 2017

The file extension for the HTTPCaller should be .zip instead of .zp above.


itay
Supporter
Forum|alt.badge.img+16
  • Supporter
  • May 3, 2017

Hi @dunuts,

I have downloaded the zip file you mention and read it correctly with the following settings in a GML reader:

Thsi should also wirk in the FeatureReader for all the file downloaded.

Hope this helps.


david_r
Evangelist
  • May 3, 2017

The fact that you got an html file rather than a zip file back from the HTTPCaller makes it seem like your http request was badly formed, or that some other error occurred. You could set a breakpoint (also called inspection point in earlier versions of FME) just after the HTTPCaller and run the workspace. When it stops at the breakpoint, look at the file contents of the file given in _response_file_path.


itay
Supporter
Forum|alt.badge.img+16
  • Supporter
  • May 3, 2017

Agreed, the url http://data.inspire.landregistry.gov.uk/ is actually a S3 bucket and accessing it via the HTTPCaller doesnt work as expected and stops with the following error:

2017-05-03 13:53:11|  45.6|  0.0|ERROR |HTTPCaller(HTTPFactory): HTTP/FTP transfer error: 'Couldn'connect to server'
2017-05-03 13:53:11|  45.6|  0.0|ERROR |HTTPCaller(HTTPFactory): Please ensure that your network connection is properly set up
2017-05-03 13:53:11|  45.6|  0.0|ERROR |HTTPCaller(HTTPFactory): No proxy settings have been entered.  If you require a proxy to access external URLs, please ensure the appropriate information has been entered

Since I dont have any idea if the connection is set properly, I would suggest trying the S3 transformers to get the data.

For the LocalAuthority I would use the AttributeSplitter on the path and grab the correct element and clean it up.

I:\UK\OutsideLondon\Land_Ownership\Out_East_1\Data\Inspire\Gravesham.zip\Land_Registry_Cadastral_Parcels.gml. > split

Gravesham.zip > clean (remove .zip) > result is Gravesham

Hope this helps.


Forum|alt.badge.img
  • Author
  • May 3, 2017
itay wrote:

Agreed, the url http://data.inspire.landregistry.gov.uk/ is actually a S3 bucket and accessing it via the HTTPCaller doesnt work as expected and stops with the following error:

2017-05-03 13:53:11|  45.6|  0.0|ERROR |HTTPCaller(HTTPFactory): HTTP/FTP transfer error: 'Couldn'connect to server'
2017-05-03 13:53:11|  45.6|  0.0|ERROR |HTTPCaller(HTTPFactory): Please ensure that your network connection is properly set up
2017-05-03 13:53:11|  45.6|  0.0|ERROR |HTTPCaller(HTTPFactory): No proxy settings have been entered.  If you require a proxy to access external URLs, please ensure the appropriate information has been entered

Since I dont have any idea if the connection is set properly, I would suggest trying the S3 transformers to get the data.

For the LocalAuthority I would use the AttributeSplitter on the path and grab the correct element and clean it up.

I:\UK\OutsideLondon\Land_Ownership\Out_East_1\Data\Inspire\Gravesham.zip\Land_Registry_Cadastral_Parcels.gml. > split

Gravesham.zip > clean (remove .zip) > result is Gravesham

Hope this helps.

This does help. Can you recommend any resource to get started with S3 downloader etc.? Not sure where to even begin. Thanks

 


david_r
Evangelist
  • May 3, 2017
dunuts wrote:
This does help. Can you recommend any resource to get started with S3 downloader etc.? Not sure where to even begin. Thanks

 

This is a good starting point:

 

https://knowledge.safe.com/articles/24146/s3objectlister-s3downloader-and-s3uploader-transfo.html

takashi
Contributor
Forum|alt.badge.img+19
  • Contributor
  • May 3, 2017
itay wrote:

Agreed, the url http://data.inspire.landregistry.gov.uk/ is actually a S3 bucket and accessing it via the HTTPCaller doesnt work as expected and stops with the following error:

2017-05-03 13:53:11|  45.6|  0.0|ERROR |HTTPCaller(HTTPFactory): HTTP/FTP transfer error: 'Couldn'connect to server'
2017-05-03 13:53:11|  45.6|  0.0|ERROR |HTTPCaller(HTTPFactory): Please ensure that your network connection is properly set up
2017-05-03 13:53:11|  45.6|  0.0|ERROR |HTTPCaller(HTTPFactory): No proxy settings have been entered.  If you require a proxy to access external URLs, please ensure the appropriate information has been entered

Since I dont have any idea if the connection is set properly, I would suggest trying the S3 transformers to get the data.

For the LocalAuthority I would use the AttributeSplitter on the path and grab the correct element and clean it up.

I:\UK\OutsideLondon\Land_Ownership\Out_East_1\Data\Inspire\Gravesham.zip\Land_Registry_Cadastral_Parcels.gml. > split

Gravesham.zip > clean (remove .zip) > result is Gravesham

Hope this helps.

The XML document seems to be just a list of contents within a S3 bucket, and I think each content (*.zip) can be downloaded via general HTTP GET request.

 

Actually I was able to download a zip file from this URL using the HTTPCaller. 

 

http://data.inspire.landregistry.gov.uk/Abertawe_-_Swansea.zip

 

Anyway, first of all, make sure that the URL read from the CSV table is the correct location of a zip file on web.

 


takashi
Contributor
Forum|alt.badge.img+19
  • Contributor
  • May 3, 2017
takashi wrote:
The XML document seems to be just a list of contents within a S3 bucket, and I think each content (*.zip) can be downloaded via general HTTP GET request.

 

Actually I was able to download a zip file from this URL using the HTTPCaller.

 

http://data.inspire.landregistry.gov.uk/Abertawe_-_Swansea.zip

 

Anyway, first of all, make sure that the URL read from the CSV table is the correct location of a zip file on web.

 

In addition, this workflow extracts 350 Contents from this URL (XML).

 

URL: http://data.inspire.landregistry.gov.uk/

 


Forum|alt.badge.img
  • Author
  • May 3, 2017
itay wrote:

Agreed, the url http://data.inspire.landregistry.gov.uk/ is actually a S3 bucket and accessing it via the HTTPCaller doesnt work as expected and stops with the following error:

2017-05-03 13:53:11|  45.6|  0.0|ERROR |HTTPCaller(HTTPFactory): HTTP/FTP transfer error: 'Couldn'connect to server'
2017-05-03 13:53:11|  45.6|  0.0|ERROR |HTTPCaller(HTTPFactory): Please ensure that your network connection is properly set up
2017-05-03 13:53:11|  45.6|  0.0|ERROR |HTTPCaller(HTTPFactory): No proxy settings have been entered.  If you require a proxy to access external URLs, please ensure the appropriate information has been entered

Since I dont have any idea if the connection is set properly, I would suggest trying the S3 transformers to get the data.

For the LocalAuthority I would use the AttributeSplitter on the path and grab the correct element and clean it up.

I:\UK\OutsideLondon\Land_Ownership\Out_East_1\Data\Inspire\Gravesham.zip\Land_Registry_Cadastral_Parcels.gml. > split

Gravesham.zip > clean (remove .zip) > result is Gravesham

Hope this helps.

@takashi Thank you. I used your below workflow and got the 350 zip files that I wrote to a CSV. Do you know how I could connect the XML fragmenter to a feature reader etc. in order to be able to download the zip files? Thank you.

0684Q00000ArMNxQAN.png


itay
Supporter
Forum|alt.badge.img+16
  • Supporter
  • May 3, 2017
dunuts wrote:
@takashi Thank you. I used your below workflow and got the 350 zip files that I wrote to a CSV. Do you know how I could connect the XML fragmenter to a feature reader etc. in order to be able to download the zip files? Thank you.

That's easy just concatenate the beginning of the url (http://data.inspire.landregistry.gov.uk/) with the Key attribute to form the correct download link. This can all be done in the HTTPCaller's request url parameter.

 

Hope this helps.

 

 


itay
Supporter
Forum|alt.badge.img+16
  • Supporter
  • May 3, 2017
takashi wrote:
The XML document seems to be just a list of contents within a S3 bucket, and I think each content (*.zip) can be downloaded via general HTTP GET request.

 

Actually I was able to download a zip file from this URL using the HTTPCaller.

 

http://data.inspire.landregistry.gov.uk/Abertawe_-_Swansea.zip

 

Anyway, first of all, make sure that the URL read from the CSV table is the correct location of a zip file on web.

 

was my initial idea, but got the error mentioned above...

 

 


takashi
Contributor
Forum|alt.badge.img+19
  • Contributor
  • May 3, 2017
itay wrote:

Agreed, the url http://data.inspire.landregistry.gov.uk/ is actually a S3 bucket and accessing it via the HTTPCaller doesnt work as expected and stops with the following error:

2017-05-03 13:53:11|  45.6|  0.0|ERROR |HTTPCaller(HTTPFactory): HTTP/FTP transfer error: 'Couldn'connect to server'
2017-05-03 13:53:11|  45.6|  0.0|ERROR |HTTPCaller(HTTPFactory): Please ensure that your network connection is properly set up
2017-05-03 13:53:11|  45.6|  0.0|ERROR |HTTPCaller(HTTPFactory): No proxy settings have been entered.  If you require a proxy to access external URLs, please ensure the appropriate information has been entered

Since I dont have any idea if the connection is set properly, I would suggest trying the S3 transformers to get the data.

For the LocalAuthority I would use the AttributeSplitter on the path and grab the correct element and clean it up.

I:\UK\OutsideLondon\Land_Ownership\Out_East_1\Data\Inspire\Gravesham.zip\Land_Registry_Cadastral_Parcels.gml. > split

Gravesham.zip > clean (remove .zip) > result is Gravesham

Hope this helps.

This screenshot illustrate the following procedure.

 

Request URL: http://data.inspire.landregistry.gov.uk/@Value(Key)

 

Note that there could be non-zip files among the contents, so you will have to filter them by checking the extension, for example.

 

0684Q00000ArMHzQAN.png

Forum|alt.badge.img
  • Author
  • May 4, 2017
itay wrote:

Agreed, the url http://data.inspire.landregistry.gov.uk/ is actually a S3 bucket and accessing it via the HTTPCaller doesnt work as expected and stops with the following error:

2017-05-03 13:53:11|  45.6|  0.0|ERROR |HTTPCaller(HTTPFactory): HTTP/FTP transfer error: 'Couldn'connect to server'
2017-05-03 13:53:11|  45.6|  0.0|ERROR |HTTPCaller(HTTPFactory): Please ensure that your network connection is properly set up
2017-05-03 13:53:11|  45.6|  0.0|ERROR |HTTPCaller(HTTPFactory): No proxy settings have been entered.  If you require a proxy to access external URLs, please ensure the appropriate information has been entered

Since I dont have any idea if the connection is set properly, I would suggest trying the S3 transformers to get the data.

For the LocalAuthority I would use the AttributeSplitter on the path and grab the correct element and clean it up.

I:\UK\OutsideLondon\Land_Ownership\Out_East_1\Data\Inspire\Gravesham.zip\Land_Registry_Cadastral_Parcels.gml. > split

Gravesham.zip > clean (remove .zip) > result is Gravesham

Hope this helps.

@takashi, @ I've tried the exact workbench and setting you've described and I get the below error. Any ideas how I can fix this? Thank you.

0684Q00000ArMbiQAF.png


takashi
Contributor
Forum|alt.badge.img+19
  • Contributor
  • May 4, 2017
dunuts wrote:
@takashi, @ I've tried the exact workbench and setting you've described and I get the below error. Any ideas how I can fix this? Thank you.

Looks like HTTP requests for some zip files could not get expected response from the server. The exact reason cannot be identified, but I'm wondering if the zip file does exist in the correct URL actually. Check if the zip file can be downloaded using a web browser.

 

 


Forum|alt.badge.img
  • Author
  • May 5, 2017

@takashi I have tried one or two of the URLs in the browser and they download the zip files without a problem.Would you have any idea if there's anything else I can try? Thanks


takashi
Contributor
Forum|alt.badge.img+19
  • Contributor
  • May 5, 2017
dunuts wrote:

@takashi I have tried one or two of the URLs in the browser and they download the zip files without a problem.Would you have any idea if there's anything else I can try? Thanks

Well, does the HTTPCaller download a zip file if you set a known URL to the Request URL? If this couldn't, there could be an issue on network environment.

 

 


Forum|alt.badge.img
  • Author
  • May 10, 2017
takashi wrote:
Well, does the HTTPCaller download a zip file if you set a known URL to the Request URL? If this couldn't, there could be an issue on network environment.

 

 

@takashi I can download a normal file but still getting the error with this. Thanks for all your help.

Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

 
Cookie settings