Along themes mentioned in
and
I've got a few hundred zip files, with around 1000 files in each one. The reason they are zipped is because they are json files, so compress quite well, and handling 100,000 files in a directory gets unwieldy.
when reading the features in using a wild-carded reader e.g. *.zip[***.geojson]
the behaviour seems to be that it sits there for a long, long, long time before starting to read any features. Combined with the messages in the console, this makes me think it's unzipping every zip file before starting to read any features, rather than unzipping and reading in parallel ?
Would be far better to skip the entire unzipping and writing to disc stage and just stream the zip file contents, unzipping the the data stream as it gets pulled off the stream and piping it into next process step.
There is a python based streaming reader here
https://stream-unzip.docs.trade.gov.uk/getting-started/
https://stream-zip.docs.trade.gov.uk/
similar lib here