Hi All, I have been "gifted" 500gb of contours (as 20,000 20meg shapefile tiles) and have been asked to create a single dataset (merge and dissolve/union for loading into a ESRI database.) I have read many of the FME articles on BIG data processing and Lightspeed Batch Processing...So its a shapefile to FileGDB transform essentially. I always like a challenge so i have had a think on how to achieve this...Use Fme to process the data in batches, ie merge and union the first 500 shapefiles into fme FFS format , then the next 500 etc, then merge the FFS into larger and large FFS'S then write it all out eventually to FGDB. Pros - process it locally, easy to setup and let run. Cons-not sure if merging then dissolving over and over again is more efficient than doing it all at once or will be sloooow? Use FME to create 20,000 GeoJson csv's for upload to google BigQuery and use ST_UNION_AGG function on the data using Googles infrastructure to crunch the numbers. Pros- Apparently very powerful hardware and NOSQL database that "can process Petabytes of data" Cons- Have to shift 1tb of data to the cloud and get it back again. (20meg shapefile turns into a 60meg csv to load into BIg Query) . Then process the GeoJson WKB/WKT back to FGDB Spin up a PostGIS database locally and push the data into here. Use the database to process (ST_UNION_AGG) it and then spit it out into FME for transform into FGDB. Pros - local processing Cons - Set up time of PostGIS and unknown processing time on local machine of large dataset Write something in Python Gdal/ogr2ogr to process the data Pros - seems like lite weight tools would be memory efficient Cons - have to figure it out (can do thou) Thanks for reading my thoughts and please give me any feedback/Tips/Suggestions/Prayers/Other options .Steve

Question

Big Data Processing

6 years ago
May 28, 2019
5 replies
112 views

goatboy
49 replies

Hi All, I have been "gifted" 500gb of contours (as 20,000 20meg shapefile tiles) and have been asked to create a single dataset (merge and dissolve/union for loading into a ESRI database.) I have read many of the FME articles on BIG data processing and Lightspeed Batch Processing...

So its a shapefile to FileGDB transform essentially. I always like a challenge so i have had a think on how to achieve this...

Use Fme to process the data in batches, ie merge and union the first 500 shapefiles into fme FFS format , then the next 500 etc, then merge the FFS into larger and large FFS'S then write it all out eventually to FGDB.

Pros - process it locally, easy to setup and let run.

Cons-not sure if merging then dissolving over and over again is more efficient than doing it all at once or will be sloooow?

Use FME to create 20,000 GeoJson csv's for upload to google BigQuery and use ST_UNION_AGG function on the data using Googles infrastructure to crunch the numbers.

Pros- Apparently very powerful hardware and NOSQL database that "can process Petabytes of data"

Cons- Have to shift 1tb of data to the cloud and get it back again. (20meg shapefile turns into a 60meg csv to load into BIg Query) . Then process the GeoJson WKB/WKT back to FGDB

Spin up a PostGIS database locally and push the data into here. Use the database to process (ST_UNION_AGG) it and then spit it out into FME for transform into FGDB.

Pros - local processing

Cons - Set up time of PostGIS and unknown processing time on local machine of large dataset

Write something in Python Gdal/ogr2ogr to process the data

Pros - seems like lite weight tools would be memory efficient

Cons - have to figure it out (can do thou)

Thanks for reading my thoughts and please give me any feedback/Tips/Suggestions/Prayers/Other options .

Steve

markatsafe
1891 replies
6 years ago
May 29, 2019

@goatboy Are the contours represented as lines or areas? I'm asking because you mention dissolving. Contour lines would only need joining at the boundaries.

As an aside - you can make a significant impact on the size of your data by using the Curvefitter. Might be worth considering.

goatboy
Author
49 replies
6 years ago
May 29, 2019

Hi Mark, They are PolylineZ ( i can strip the Z off them if that is desirable). do you think the line joiner is a better route or what were you thinking ?

Interesting about the curve fitter. I will look into that. and see if i can utilize it.

goatboy
Author
49 replies
6 years ago
May 29, 2019

goatboy wrote:

Hi Mark, They are PolylineZ ( i can strip the Z off them if that is desirable). do you think the line joiner is a better route or what were you thinking ?

Interesting about the curve fitter. I will look into that. and see if i can utilize it.

I think the line joiner maybe the way to go here. testing with 4 shapefiles takes FME 25 seconds to aggregate them, 4 seconds to use the line joiner. ....

+50

redgeographics
Celebrity
3643 replies
6 years ago
May 30, 2019

Another option to consider: spin up an FME Cloud Enterprise instance (48 processors and 192 Gb RAM). Yes, it'll cost you $10/hour but I've done it a few times to process large amounts of data. You still have to get the data up in the cloud (but can do that as zipped shapefiles) though.

markatsafe
1891 replies
6 years ago
May 30, 2019

@goatboy LineCombiner (previously called LineJoiner) can be used to merge the contours across the tile boundaries with the Group By set to the contour elevation.

With so much data you might want to do this using a two step process.

Load all the Shape data tiles into a staging table in a suitable database i.e. PostGIS, GeoPackage etc. On the load, identify which contours do not close by comparing the start / end coordinates. These should be the contours that touch the tile boundary. Add a flag closed? = y|n.
Use a SQL WHERE for each contour interval and process each interval in turn. Pass the closed contours straight thru, pass the open ended contours thru the LineCombiner.

Reply

Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

Cookie settings

We use 3 different kinds of cookies. You can choose which cookies you want to accept. We need basic cookies to make this site work, therefore these are the minimum you can select. Learn more about our cookies.

Basic
Functional

Normal
Functional + analytics

Complete
Functional + analytics + social media + embedded videos + marketing

Big Data Processing