Solved

Trying to find a less memory intensive way to SORT

9 years ago
March 19, 2016
7 replies
23 views

+21

davideagle
Contributor
578 replies

I have lots of files (read into FME 1 by 1) within which there are details of lots of polylines, this detail consist of:

1) A LineID

2) A set of coordinates that build the line.

It looks like the LineIDs ARE sorted, but the coordinates are supplied with a vertex number and these are definitely not sorted.

I have to use a Sorter, but this is a major bottleneck and even with a few hundred thousand rows of data the memory usage is climbing unnervingly. When I throw the real data at the process I'm not confident that it won't fall over at some point.

What I'd like to do is only sort coordinates for LineIDs on a per file basis. So every time a new filename arrives on a feature that lets the Sorter know to release the existing 'sorted' features that had a different filename and let them pass on through the process.

I tried to create a GroupedSorter by wrapping a Sorter up as a custom transformer and exposing the Parallel Processing Group parameter, so that I could group by fme_basename. This hasn't worked, all the features are still trapped till everything has been sorted.

Please help me avoid this, it's giving me sleepless nights:

Best answer by fmelizard

Before I give the below proposal, let me assure you that we are working very hard to make the below type of thing "just happen" .. but until then .. here goes some advice from the dev team:

PointCloud big data techniques could help. Try turning the data into a point cloud first by using a PointCloudCombiner, grouping by LineID, and preserving the vertex number as a component (in addition to the X, Y, and Z). Then the PointCloudSorter could sort each group individually. The point cloud can then be broken up afterwards using the PointCloudCoercer. Follow that with a PointConnector and you should be golden.

Do let us know if this helps...(in general, turning big data problems into Point Cloud exercises has been a winner in some of the cases I've seen...)

View original

Did this help you find an answer to your question?

+18

erik_jan
Contributor
2181 replies
9 years ago
March 19, 2016

Hi David,

How about using a database as intermediate format.

Store all data in a table and use SQL to sort the data.

Usually databases are pretty good at these solutions.

Erik

+18

fmelizard
Safer
3725 replies
Best Answer
9 years ago
March 19, 2016

Before I give the below proposal, let me assure you that we are working very hard to make the below type of thing "just happen" .. but until then .. here goes some advice from the dev team:

Do let us know if this helps...(in general, turning big data problems into Point Cloud exercises has been a winner in some of the cases I've seen...)

+18

fmelizard
Safer
3725 replies
9 years ago
March 19, 2016

erik_jan wrote:

Hi David,

How about using a database as intermediate format.

Store all data in a table and use SQL to sort the data.

Usually databases are pretty good at these solutions.

Erik

I was thinking that too -- the InlineQuerier perhaps..but unfortunately we save the FME Features so it will also use memory. In the modern time, perhaps write everything out to SQLite with a FeatureWriter and then follow that with a SQLCreator to read things back in...that would exercise SQLites ability to sort. Would be interesting to compare that against the point cloud technique below.

+44

mark2atsafe
Safer
2520 replies
9 years ago
March 20, 2016

Why not use a filename reader and pass each file to a sorter workspace separately using a WorkspaceRunner?

+21

davideagle
Author
Contributor
578 replies
9 years ago
March 30, 2016

fmelizard wrote:

Before I give the below proposal, let me assure you that we are working very hard to make the below type of thing "just happen" .. but until then .. here goes some advice from the dev team:

Do let us know if this helps...(in general, turning big data problems into Point Cloud exercises has been a winner in some of the cases I've seen...)

Hi Dale - Thanks for the detail. I have replaced the Sorter with the technique you've described and it does indeed work. The technique isn't quite ready for mass consumption because retaining attributes needs a bit of effort, especially when you have lots, but its good to put it into practice.

Using just the Sorter the memory usage climbs as its a blocker, using the Point Cloud approach the memory use is a flat line because the features are released point cloud by point cloud into the rest of the process. This certainly gives us an option if we want to run all the data through the process at the same time as, fingers crossed, the memory use wouldn't become unmanageable. But... there's always a but, using the Sorter on a single sample file took the whole process just over 1 minute, using the Point Cloud approach, despite the more consistent memory use, took 17 minutes. I can only assume that this is because in this instance we're constructing so many small point clouds.

1 of 2

+21

davideagle
Author
Contributor
578 replies
9 years ago
March 30, 2016

fmelizard wrote:

Before I give the below proposal, let me assure you that we are working very hard to make the below type of thing "just happen" .. but until then .. here goes some advice from the dev team:

Do let us know if this helps...(in general, turning big data problems into Point Cloud exercises has been a winner in some of the cases I've seen...)

So for now, we think we're able to get away with running the process in batch, so we're controlling the feeding of data to the FMW with a batch file and calling multiple jobs in series, releasing the memory at the end of each load. Its not quite where we wanted to be, but it is at least possible owing to how flexible FME is in the many options we have to run it.

I dream of the day when managing memory utilisation is a thing of the past, so its good to hear you have this one in the cross-hairs!

Thanks again, Dave

2 of 2

+21

davideagle
Author
Contributor
578 replies
9 years ago
March 30, 2016

mark2atsafe wrote:

Why not use a filename reader and pass each file to a sorter workspace separately using a WorkspaceRunner?

Thanks Mark - Presorting the data is our "get out of jail free card"... we're going to keep that one in the back pocket until we really have to use it. Thanks for the tip.

Reply

Rich Text Editor, editor1

Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

Cookie settings

We use 3 different kinds of cookies. You can choose which cookies you want to accept. We need basic cookies to make this site work, therefore these are the minimum you can select. Learn more about our cookies.

Basic
Functional

Normal
Functional + analytics

Complete
Functional + analytics + social media + embedded videos + marketing

Trying to find a less memory intensive way to SORT