For very large jobs you can bring in your data by chunks or tiles.
Use a FeatureReader instead of a conventional one and use the Spatial Filter parameter of the FeatureReader.
Call the workspace via a WorkspaceRunner from a "coordinating" workspace.
Do as little processing in each workspace as posible and serialize workspaces via WorkspaceRunner.
Use VM's on Azure or AWS to process the data or try FME Cloud. VM's are pretty inexpensive these days.
I would if at all possible break the data into tiles. As @caracadrian suggests, you don't need to tile your original data.
If your process does not require knowledge of other features, you can simply break the data up by features to read, starting feature on your reader.
If you do need to interact with multiple features, then you can either use the Spatial Filter parameter on the FeatureReader or publish the bounding box on a regular reader.
In either case you have a main workspace that reads in the tiles (or generated them on the fly) and then a workspaceRunner or FMEServerJobSubmitter that processes the individual tiles, using parameters to control what tile to process.
I disagree that the helper workspace should have as little as possible, and many of them chained together, as I/O can be very expensive. You do want to make sure that your workspace is not approaching your memory limit, and ideally you would only be using 1/4 or 1/8th of the ram, so you can take advantage of multiple concurrent processes. The workspaceRunner is effectively limited 7 child process (8 total including the main workspace) and FMEServer by the number of engines.
The other major benefit to tiling your data is that if there is a problem, you do not have to restart the entire process from the beginning, just from where things went wrong. We once had a major power failure (longer than the ups could handle) 600 hours into a process.
@aron It is a great question and there will never be a single answer because it will depend on the task at hand.
- Let-the-database-do-your-work: any time you can push processing onto your database is usually a help. i.e. table joins or to reduce the volume of data read (rather then filtering in FME). Tutorial with some ideas.
- Blocking transformations: if you're not using blocking transformations then you can handle larger datasets. For example, in Generalizer, if Preserve Shared Boundaries = Yes, then Generalizer has to hold all features to find the area common edges, generalize and then rebuild the areas. For your contour example you wouldn't need that so you'd be no blocking and have faster through put.
- Process smaller chunks of data - as suggested by both @jdh and @caracadrian . This could be by tile, region or attribution. So sticking with you contour example - it could be by county or by contour elevation.
- WorkspaceRunner, to run the different regions in parallel (to a degree using Maximum Concurrent FME Processes) Some thoughts & gotchas on parallel processing here.
- FME Server automations now has great tools for parallelization (not paralyzation!)
Thanks for the question and the great suggestions. I am trying to process a raster dataset with 145 Million cells and just ran a 'quick' test which took over 15 hours to complete but was a good exercise just to see how I can improve it.
Should I be tiling the raster in the database or is that something that can be done in FME?
Thanks for the question and the great suggestions. I am trying to process a raster dataset with 145 Million cells and just ran a 'quick' test which took over 15 hours to complete but was a good exercise just to see how I can improve it.
Should I be tiling the raster in the database or is that something that can be done in FME?
There are others who can give you a more knowledgeable reply, but I have tried the Tiler a few times with good results. What I did was break down the dataset into tiles and then run the dessired transformers on these in order. If needed, (ie using dissolver or similar) one could run the transformers on the tiled output to get everything into one "set".