Skip to main content

Hi All, I have been "gifted" 500gb of contours (as 20,000 20meg shapefile tiles) and have been asked to create a single dataset (merge and dissolve/union for loading into a ESRI database.) I have read many of the FME articles on BIG data processing and Lightspeed Batch Processing...

So its a shapefile to FileGDB transform essentially. I always like a challenge so i have had a think on how to achieve this...

  • Use Fme to process the data in batches, ie merge and union the first 500 shapefiles into fme FFS format , then the next 500 etc, then merge the FFS into larger and large FFS'S then write it all out eventually to FGDB.

Pros - process it locally, easy to setup and let run.

Cons-not sure if merging then dissolving over and over again is more efficient than doing it all at once or will be sloooow?

 

 

  • Use FME to create 20,000 GeoJson csv's for upload to google BigQuery and use ST_UNION_AGG function on the data using Googles infrastructure to crunch the numbers.

Pros- Apparently very powerful hardware and NOSQL database that "can process Petabytes of data"

Cons- Have to shift 1tb of data to the cloud and get it back again. (20meg shapefile turns into a 60meg csv to load into BIg Query) . Then process the GeoJson WKB/WKT back to FGDB

 

  • Spin up a PostGIS database locally and push the data into here. Use the database to process (ST_UNION_AGG) it and then spit it out into FME for transform into FGDB.

Pros - local processing

Cons - Set up time of PostGIS and unknown processing time on local machine of large dataset

 

  • Write something in Python Gdal/ogr2ogr to process the data

Pros - seems like lite weight tools would be memory efficient

Cons - have to figure it out (can do thou)

 

Thanks for reading my thoughts and please give me any feedback/Tips/Suggestions/Prayers/Other options .

Steve

 

 

 

 

@goatboy Are the contours represented as lines or areas? I'm asking because you mention dissolving. Contour lines would only need joining at the boundaries.

As an aside - you can make a significant impact on the size of your data by using the Curvefitter. Might be worth considering.


Hi Mark, They are PolylineZ ( i can strip the Z off them if that is desirable). do you think the line joiner is a better route or what were you thinking ?

Interesting about the curve fitter. I will look into that. and see if i can utilize it.


Hi Mark, They are PolylineZ ( i can strip the Z off them if that is desirable). do you think the line joiner is a better route or what were you thinking ?

Interesting about the curve fitter. I will look into that. and see if i can utilize it.

I think the line joiner maybe the way to go here. testing with 4 shapefiles takes FME 25 seconds to aggregate them, 4 seconds to use the line joiner. ....


Another option to consider: spin up an FME Cloud Enterprise instance (48 processors and 192 Gb RAM). Yes, it'll cost you $10/hour but I've done it a few times to process large amounts of data. You still have to get the data up in the cloud (but can do that as zipped shapefiles) though.


@goatboy LineCombiner (previously called LineJoiner) can be used to merge the contours across the tile boundaries with the Group By set to the contour elevation.

With so much data you might want to do this using a two step process.

  1. Load all the Shape data tiles into a staging table in a suitable database i.e. PostGIS, GeoPackage etc. On the load, identify which contours do not close by comparing the start / end coordinates. These should be the contours that touch the tile boundary. Add a flag closed? = y|n.
  2. Use a SQL WHERE for each contour interval and process each interval in turn. Pass the closed contours straight thru, pass the open ended contours thru the LineCombiner.

Reply