Hi All, I have been "gifted" 500gb of contours (as 20,000 20meg shapefile tiles) and have been asked to create a single dataset (merge and dissolve/union for loading into a ESRI database.) I have read many of the FME articles on BIG data processing and Lightspeed Batch Processing...
So its a shapefile to FileGDB transform essentially. I always like a challenge so i have had a think on how to achieve this...
- Use Fme to process the data in batches, ie merge and union the first 500 shapefiles into fme FFS format , then the next 500 etc, then merge the FFS into larger and large FFS'S then write it all out eventually to FGDB.
Pros - process it locally, easy to setup and let run.
Cons-not sure if merging then dissolving over and over again is more efficient than doing it all at once or will be sloooow?
Â
Â
- Use FME to create 20,000 GeoJson csv's for upload to google BigQuery and use ST_UNION_AGG function on the data using Googles infrastructure to crunch the numbers.
Pros- Apparently very powerful hardware and NOSQL database that "can process Petabytes of data"
Cons- Have to shift 1tb of data to the cloud and get it back again. (20meg shapefile turns into a 60meg csv to load into BIg Query) . Then process the GeoJson WKB/WKT back to FGDB
Â
- Spin up a PostGIS database locally and push the data into here. Use the database to process (ST_UNION_AGG) it and then spit it out into FME for transform into FGDB.
Pros - local processing
Cons - Set up time of PostGIS and unknown processing time on local machine of large dataset
Â
- Write something in Python Gdal/ogr2ogr to process the data
Pros - seems like lite weight tools would be memory efficient
Cons - have to figure it out (can do thou)
Â
Thanks for reading my thoughts and please give me any feedback/Tips/Suggestions/Prayers/Other options .
Steve
Â
Â
Â
Â