Question

localized operation on big datasets

  • 29 August 2018
  • 4 replies
  • 1 view

Badge

Hi All,

I want to count the number of houses that intersect (gas)lines, in order to do some quality checking.

most gas pipes only intersect one house. Any pipe with multiple intersects is suspect.

The gas pipes are of limited length (typ. 30 meters).

However I need to do this for > 1 mln pipes aganinst > 3 mln houses. That crashes my pc due to lack of memory.

I'm looking for a way to process this in chunks, by dividing the country in some grid, make the gridsquares slightly overlapping (say 50 meters) and then intersect only the houses and gaspipes in the squares. No need to check a pipe against houses in another city...

How can I achieve this localized processing (using e.g. line on area overlayer) ?

I tried clipper with a hand made "grid" that divides the country in 5 chunks, but I'm forced to process these slabs of land manually, by running the workbench 5 times and combine the output after running the jobs.

I feel this must have been delt with more elegantly and with less manual labor. Any suggestions how to go about this ? Thanks in advance for any suggestion you might have.

Regards,

Ronald van Aalst


4 replies

Userlevel 1
Badge +21

What format is your input data held in?

Userlevel 5
Badge +25

There's a couple of things you can consider:

  • Parallel processing: if your houses and pipelines both have a grid attribute you can use that to do parallel processing on, the work will be divided over multiple processor cores and objects with a grid id of 1 will not consider any other grid id's. I doubt that this will help you though as you're running into memory issues.
  • The SpatialFilter instead of the LineOnAreaOverlayer. Set it to "Filters First" prevents it from having to cache both datasets in memory before starting the operation. This can improve your memory usage.
  • Set up a master workspace with a WorkspaceRunner to run the processing workspace for every grid cell. At least this will save you the manual labor.
  • Probably not practical due to the data being in your own network, but I'm mentioning it anyway: take it to the FME Cloud! You can have a Premium instance (64 Gb RAM) for $6/hour or an Enterprise (192 Gb RAM) for $10/hour.
Badge

@redgeographics, thanks for your suggestions. I will investigate number 2&3 as the problem is memory-bound. unfortunately, number 4, while very valid, would have a huge cost in the meetings required to gain approval for exporting the data (looking at you, GDPR...).

@egomm, the chuncks of land are in a shapefile, the gaspipes in an oracle table and the houses in an ffs (destilled from an oracle table of 12 million entries).

I would welcome some tips on how to generate a grid in FME (from e.g. an extent or bounding box) and how to apply a (collection of) grid cell to my intersection-operation.

Userlevel 1
Badge +21

2d grid creator or 2d grid accumulator for creating your grid.

If your data is in Oracle and properly spatially indexed you can save a lot of time by only reading in the data per grid square at a time and performing the spatial comparison on the subset. Then as mentioned, you can set up a master workspace to run each grid through one (or 7) at a time.

Reply