Skip to main content

I have lots of files (read into FME 1 by 1) within which there are details of lots of polylines, this detail consist of:

1) A LineID

2) A set of coordinates that build the line.

It looks like the LineIDs ARE sorted, but the coordinates are supplied with a vertex number and these are definitely not sorted.

I have to use a Sorter, but this is a major bottleneck and even with a few hundred thousand rows of data the memory usage is climbing unnervingly. When I throw the real data at the process I'm not confident that it won't fall over at some point.

What I'd like to do is only sort coordinates for LineIDs on a per file basis. So every time a new filename arrives on a feature that lets the Sorter know to release the existing 'sorted' features that had a different filename and let them pass on through the process.

I tried to create a GroupedSorter by wrapping a Sorter up as a custom transformer and exposing the Parallel Processing Group parameter, so that I could group by fme_basename. This hasn't worked, all the features are still trapped till everything has been sorted.

Please help me avoid this, it's giving me sleepless nights:

Hi David,

How about using a database as intermediate format.

Store all data in a table and use SQL to sort the data.

Usually databases are pretty good at these solutions.

Erik


Before I give the below proposal, let me assure you that we are working very hard to make the below type of thing "just happen" .. but until then .. here goes some advice from the dev team:

 

 

PointCloud big data techniques could help. Try turning the data into a point cloud first by using a PointCloudCombiner, grouping by LineID, and preserving the vertex number as a component (in addition to the X, Y, and Z). Then the PointCloudSorter could sort each group individually. The point cloud can then be broken up afterwards using the PointCloudCoercer. Follow that with a PointConnector and you should be golden.

 

 

Do let us know if this helps...(in general, turning big data problems into Point Cloud exercises has been a winner in some of the cases I've seen...)

Hi David,

How about using a database as intermediate format.

Store all data in a table and use SQL to sort the data.

Usually databases are pretty good at these solutions.

Erik

I was thinking that too -- the InlineQuerier perhaps..but unfortunately we save the FME Features so it will also use memory. In the modern time, perhaps write everything out to SQLite with a FeatureWriter and then follow that with a SQLCreator to read things back in...that would exercise SQLites ability to sort. Would be interesting to compare that against the point cloud technique below.


Why not use a filename reader and pass each file to a sorter workspace separately using a WorkspaceRunner?


Before I give the below proposal, let me assure you that we are working very hard to make the below type of thing "just happen" .. but until then .. here goes some advice from the dev team:

 

 

PointCloud big data techniques could help. Try turning the data into a point cloud first by using a PointCloudCombiner, grouping by LineID, and preserving the vertex number as a component (in addition to the X, Y, and Z). Then the PointCloudSorter could sort each group individually. The point cloud can then be broken up afterwards using the PointCloudCoercer. Follow that with a PointConnector and you should be golden.

 

 

Do let us know if this helps...(in general, turning big data problems into Point Cloud exercises has been a winner in some of the cases I've seen...)

Hi Dale - Thanks for the detail. I have replaced the Sorter with the technique you've described and it does indeed work. The technique isn't quite ready for mass consumption because retaining attributes needs a bit of effort, especially when you have lots, but its good to put it into practice.

Using just the Sorter the memory usage climbs as its a blocker, using the Point Cloud approach the memory use is a flat line because the features are released point cloud by point cloud into the rest of the process. This certainly gives us an option if we want to run all the data through the process at the same time as, fingers crossed, the memory use wouldn't become unmanageable. But... there's always a but, using the Sorter on a single sample file took the whole process just over 1 minute, using the Point Cloud approach, despite the more consistent memory use, took 17 minutes. I can only assume that this is because in this instance we're constructing so many small point clouds.

1 of 2


Before I give the below proposal, let me assure you that we are working very hard to make the below type of thing "just happen" .. but until then .. here goes some advice from the dev team:

 

 

PointCloud big data techniques could help. Try turning the data into a point cloud first by using a PointCloudCombiner, grouping by LineID, and preserving the vertex number as a component (in addition to the X, Y, and Z). Then the PointCloudSorter could sort each group individually. The point cloud can then be broken up afterwards using the PointCloudCoercer. Follow that with a PointConnector and you should be golden.

 

 

Do let us know if this helps...(in general, turning big data problems into Point Cloud exercises has been a winner in some of the cases I've seen...)

So for now, we think we're able to get away with running the process in batch, so we're controlling the feeding of data to the FMW with a batch file and calling multiple jobs in series, releasing the memory at the end of each load. Its not quite where we wanted to be, but it is at least possible owing to how flexible FME is in the many options we have to run it.

I dream of the day when managing memory utilisation is a thing of the past, so its good to hear you have this one in the cross-hairs!

Thanks again, Dave

2 of 2


Why not use a filename reader and pass each file to a sorter workspace separately using a WorkspaceRunner?

Thanks Mark - Presorting the data is our "get out of jail free card"... we're going to keep that one in the back pocket until we really have to use it. Thanks for the tip.


Reply