Skip to main content
Open

Don't extract zip files to temp dir- stream directly into next process step

Related products:FME Form
  • oweno
  • vlroyrenn
    vlroyrenn
  • chriswilson
    chriswilson

Along themes mentioned in

https://community.safe.com/s/bridea/a0r4Q00000Hbr7TQAR/reading-from-zip-files-extract-only-file-specified-to-use-less-disk-space-in-t?currentpage=5


and

https://community.safe.com/s/bridea/a0r4Q00000HbrNzQAJ/select-data-in-compressed-files-eg-zip-downloaded-from-an-url-same-way-as-fr?currentpage=7


I've got a few hundred zip files, with around 1000 files in each one. The reason they are zipped is because they are json files, so compress quite well, and handling 100,000 files in a directory gets unwieldy.


when reading the features in using a wild-carded reader e.g. *.zip[***.geojson]

the behaviour seems to be that it sits there for a long, long, long time before starting to read any features. Combined with the messages in the console, this makes me think it's unzipping every zip file before starting to read any features, rather than unzipping and reading in parallel ?


Would be far better to skip the entire unzipping and writing to disc stage and just stream the zip file contents, unzipping the the data stream as it gets pulled off the stream and piping it into next process step.


There is a python based streaming reader here

https://stream-unzip.docs.trade.gov.uk/getting-started/

https://stream-zip.docs.trade.gov.uk/


similar lib here

https://github.com/allanlei/python-zipstream

0 replies

Be the first to reply!

Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

 
Cookie settings