Question

Strategies to process huge datasets?

  • 12 April 2021
  • 5 replies
  • 66 views

Badge +14
  • Contributor
  • 120 replies

This is more of a hypothetical question, I don't have a problem that needs solving, but am pondering the possibilities and limits of FME.

What are the strategies to process huge datasets? I would imagine that there is a limit to the resource management that FME can do on its own? At some point the software might need some "help" in order to break down the data in manageable chunks? How is this done? Tiling? Memory management? Renting a super computer? Etc?

Trying to think of an example I came up with the following idea. How would one go about generalizing the contour lines for a whole country in one run without watching life go by?

Please note that I am not asking about how to generalize, my question is what strategies there are to speed up huge processing operations similar to the example above?


5 replies

Badge +20

For very large jobs you can bring in your data by chunks or tiles.

Use a FeatureReader instead of a conventional one and use the Spatial Filter parameter of the FeatureReader.

Call the workspace via a WorkspaceRunner from a "coordinating" workspace.

Do as little processing in each workspace as posible and serialize workspaces via WorkspaceRunner.

Use VM's on Azure or AWS to process the data or try FME Cloud. VM's are pretty inexpensive these days.

Badge +22

I would if at all possible break the data into tiles. As @caracadrian​ suggests, you don't need to tile your original data.

If your process does not require knowledge of other features, you can simply break the data up by features to read, starting feature on your reader.

If you do need to interact with multiple features, then you can either use the Spatial Filter parameter on the FeatureReader or publish the bounding box on a regular reader.

In either case you have a main workspace that reads in the tiles (or generated them on the fly) and then a workspaceRunner or FMEServerJobSubmitter that processes the individual tiles, using parameters to control what tile to process.

 

I disagree that the helper workspace should have as little as possible, and many of them chained together, as I/O can be very expensive. You do want to make sure that your workspace is not approaching your memory limit, and ideally you would only be using 1/4 or 1/8th of the ram, so you can take advantage of multiple concurrent processes. The workspaceRunner is effectively limited 7 child process (8 total including the main workspace) and FMEServer by the number of engines.

 

The other major benefit to tiling your data is that if there is a problem, you do not have to restart the entire process from the beginning, just from where things went wrong. We once had a major power failure (longer than the ups could handle) 600 hours into a process.

Badge +2

@aron​ It is a great question and there will never be a single answer because it will depend on the task at hand.

  • Let-the-database-do-your-work: any time you can push processing onto your database is usually a help. i.e. table joins or to reduce the volume of data read (rather then filtering in FME). Tutorial with some ideas.
  • Blocking transformations: if you're not using blocking transformations then you can handle larger datasets. For example, in Generalizer, if Preserve Shared Boundaries = Yes, then Generalizer has to hold all features to find the area common edges, generalize and then rebuild the areas. For your contour example you wouldn't need that so you'd be no blocking and have faster through put.
  • Process smaller chunks of data - as suggested by both @jdh​ and @caracadrian​ . This could be by tile, region or attribution. So sticking with you contour example - it could be by county or by contour elevation.
    • WorkspaceRunner, to run the different regions in parallel (to a degree using Maximum Concurrent FME Processes) Some thoughts & gotchas on parallel processing here.
    • FME Server automations now has great tools for parallelization (not paralyzation!)
Badge +9

Thanks for the question and the great suggestions. I am trying to process a raster dataset with 145 Million cells and just ran a 'quick' test which took over 15 hours to complete but was a good exercise just to see how I can improve it.

 

Should I be tiling the raster in the database or is that something that can be done in FME?

Badge +14

Thanks for the question and the great suggestions. I am trying to process a raster dataset with 145 Million cells and just ran a 'quick' test which took over 15 hours to complete but was a good exercise just to see how I can improve it.

 

Should I be tiling the raster in the database or is that something that can be done in FME?

There are others who can give you a more knowledgeable reply, but I have tried the Tiler a few times with good results. What I did was break down the dataset into tiles and then run the dessired transformers on these in order. If needed, (ie using dissolver or similar) one could run the transformers on the tiled output to get everything into one "set".

Reply