Question

Memory issues with Neighborfinder + ListExploder


Badge

Dear community,

I need to calculate the weighted distance from many points to all the others within a radius of 2000m. For this task I combined the NeigborFinder with a ListExploder and a ExpressionEvaluator to calculate the weighted distance per point.

Now, the ListExploder produces a huge amount of elements which causes memory issues, although I'm using a 64-Bit machine with 65 GByte of RAM. Is there another way of solving this task e.g. without ListExploder? I think I need the ListExploder to be able to calculate the weighted distance for every point in the ExpressionEvaluator...

Thank you for your help,

Vincent


13 replies

Userlevel 2
Badge +19
A single list triggers the memory issue or do you have several lists?

 

 

Badge

It's only one list. In fact the problem occurs later in the process as all the elements from the ListExploder have to be processed.

Userlevel 4
Badge +25

I would say that if you want an output record for every distance, then yes you will need a separate feature and so have to explode the list. But, to be honest, if you can create a list of that size, exploding it shouldn't cause any issues. It's especially odd because the ListExploder removes the list itself, so the subsequent features should use no more memory than the list did.

I'm thinking the problem occurs in the other processing that you do. Are there any group-based transformers in there? If so it could be piling up data while previous features still have their unexploded lists. Can you put a FeatureHolder transformer after the ListExploder? And maybe run the workspace up to that point (or one transformer after). I'm just thinking to prove that at least the amount of features is not a problem in itself. How many features exit the ListExploder, by the way?

The other thing is, are you using Feature Caching? If so, try turning that off. If you're caching lists and all the features from an exploded list, multiple times, then that's going to be an issue. If you need feature caching then try putting a bookmark around parts you don't need caching, and collapsing the bookmark. That will prevent caching at that point (or at least limit it to a single cache).

If you can post a screenshot of the workspace - especially we just need to know what transformers follow the ListExploder - then that would help to diagnose if there is a problem, or maybe what parameters to use to avoid the problem.

Badge

I would say that if you want an output record for every distance, then yes you will need a separate feature and so have to explode the list. But, to be honest, if you can create a list of that size, exploding it shouldn't cause any issues. It's especially odd because the ListExploder removes the list itself, so the subsequent features should use no more memory than the list did.

I'm thinking the problem occurs in the other processing that you do. Are there any group-based transformers in there? If so it could be piling up data while previous features still have their unexploded lists. Can you put a FeatureHolder transformer after the ListExploder? And maybe run the workspace up to that point (or one transformer after). I'm just thinking to prove that at least the amount of features is not a problem in itself. How many features exit the ListExploder, by the way?

The other thing is, are you using Feature Caching? If so, try turning that off. If you're caching lists and all the features from an exploded list, multiple times, then that's going to be an issue. If you need feature caching then try putting a bookmark around parts you don't need caching, and collapsing the bookmark. That will prevent caching at that point (or at least limit it to a single cache).

If you can post a screenshot of the workspace - especially we just need to know what transformers follow the ListExploder - then that would help to diagnose if there is a problem, or maybe what parameters to use to avoid the problem.

Thank you for your helpful answer. The problem was a group-based transformer later in the process. If I turn on the Aggregators option "input is ordered by group" it works. It just takes a very long time to accomplish the task (more than 1 day). The NeighborFinder has to process 20'000 Base features and 60'000 candidates resulting in 800'000'000 features after the ListExploder...and I'm not using FeatureCaching.

 

 

Userlevel 4
Badge +25
Thank you for your helpful answer. The problem was a group-based transformer later in the process. If I turn on the Aggregators option "input is ordered by group" it works. It just takes a very long time to accomplish the task (more than 1 day). The NeighborFinder has to process 20'000 Base features and 60'000 candidates resulting in 800'000'000 features after the ListExploder...and I'm not using FeatureCaching.

 

 

Just to state the obvious, the features do need to be ordered by group if you choose that option. Otherwise the results can be misleading.

 

 

Also, are you creating a list in the Aggregator? Cause that would be a huge memory sink (and don't even try to inspect the output with a 100k+ feature list). And are you merging geometry by assembling multiple levels? If you need that functionality, then that's fair enough. But you need to do what you can to reduce memory use here by removing any unnecessary attributes coming into the Aggregator, and not using Aggregator options that will cause memory use without any reason.

 

 

Badge +1

If any process takes more than the time to drink a cup of coffee, interrupt the process and find a better way! I would not be happy with a workflow that takes more than a day. You have a fast machine so there must be some really inefficient steps. Perhaps look at some of the raster tools to do the weighted distance calcs.

Badge

If any process takes more than the time to drink a cup of coffee, interrupt the process and find a better way! I would not be happy with a workflow that takes more than a day. You have a fast machine so there must be some really inefficient steps. Perhaps look at some of the raster tools to do the weighted distance calcs.

You are right, I'm not happy with my workflow either and I'm sure there is a better way to do it...I am converting my input raster to points to be able to calculate the distances...could you elaborate on your idea?

 

I uploaded part of the workflow, maybe this helps...

 

 

 

Badge

I would say that if you want an output record for every distance, then yes you will need a separate feature and so have to explode the list. But, to be honest, if you can create a list of that size, exploding it shouldn't cause any issues. It's especially odd because the ListExploder removes the list itself, so the subsequent features should use no more memory than the list did.

I'm thinking the problem occurs in the other processing that you do. Are there any group-based transformers in there? If so it could be piling up data while previous features still have their unexploded lists. Can you put a FeatureHolder transformer after the ListExploder? And maybe run the workspace up to that point (or one transformer after). I'm just thinking to prove that at least the amount of features is not a problem in itself. How many features exit the ListExploder, by the way?

The other thing is, are you using Feature Caching? If so, try turning that off. If you're caching lists and all the features from an exploded list, multiple times, then that's going to be an issue. If you need feature caching then try putting a bookmark around parts you don't need caching, and collapsing the bookmark. That will prevent caching at that point (or at least limit it to a single cache).

If you can post a screenshot of the workspace - especially we just need to know what transformers follow the ListExploder - then that would help to diagnose if there is a problem, or maybe what parameters to use to avoid the problem.

I'm not creating lists in the Aggregator but I do assemble geometry (but on one Level). This creates multipoints but I would only need one point per location. That's why I'm using the GeometryReplacer (see screenshot above). I also removed all the unnecessary attributes...

 

 

Badge +3

Is coordinate extraction and unconditional merging and then calculate the distance based on the merged coordinates not faster or more efficient then having neighbour finder plough trough them?

You have plenty memory.

Also prevent nested lists (list1{}.list2{} etc.), these are very slow. Because I see 2 aggregators in a row.

Userlevel 4
Badge +25

So I can think of a few things to consider or try...

1. Do you get a message in your log along the lines of "ResourceManager: Optimizing Memory Usage. Please wait..."? If so that's when FME is starting to cache data, and that's when really big slowdowns occur. If we can reduce memory use to the point we don't get that message, then the time should be reduced significantly too. If you're ending up with 800 million features, then even saving a few bytes per features adds up.

Note: 65gb / 800 million features = 85 bytes per feature!

2. I don't know that the Sorter transformers are doing much for you. If they are sorting for a XXX First or Ordered by Group parameter, I could easily believe that the sorting time is longer than what you save in the subsequent transformers (unless you really do want the data in order, in which case they are obviously necessary).

3. Put the GeometryExtractors directly after the PointOnAreaOverlayer, where you would only need one.

4. You could probably combine some of the Testers (between PointOnAreaOverlayer and NeighborFinder) in a single TestFilter

5. What does Tester8 do? Could you move it before the NeighborFinder? Or if it's looking for distances >2000 you could put that into the NeighborFinder as the max distance instead.

Could you of your workspace, and a log file if possible? It'll be easier to see exactly what might be improved on if I can see the different parameters used.

Badge +1
You are right, I'm not happy with my workflow either and I'm sure there is a better way to do it...I am converting my input raster to points to be able to calculate the distances...could you elaborate on your idea?

 

I uploaded part of the workflow, maybe this helps...

 

 

 

I assume you require an Inverse Distance Weighting or a similar function. These can all be done in memory using a Numpy array and some convenient functions. You can convert a raster to a numpy array and run a python function in a few seconds. For example this thread on stack exchange has a nice self contained demo that runs 3 different analyses in 6 seconds.

 

inverse-distance-weighted-idw-interpolation-with-python

 

My python install includes scipy and numpy. The default python install for FME probably doesn't have these modules. You would use the PythonCaller to run the script. Maybe R would have a function, in that case you can use the RCaller. The point is that raster operations are extremely fast and the functions are well known. Leverage these inside your workspace.

 

 

 

Badge

With the help of @Mark2AtSafe I used a PythonCaller instead of ListExploder/ExpressionEvaluator/Aggregator to accomplish the task. The code I used:

import fme
import fmeobjects
import math
# Template Function interface:
# When using this function, make sure its name is set as the value of
# the 'Class or Function to Process Features' transformer parameter
def processFeature1(feature):
list_of_distances = feature.getAttribute('neighbors{}.distance')
sum_of_distances=0.0
number_per_cell=0
for distance in list_of_distances:
number_per_cell +=1 
f_dij=(math.sqrt(2*float(distance)+1)-1)
sum_of_distances=sum_of_distances+f_dij
feature.setAttribute("sum_of_distances",sum_of_distances)
feature.setAttribute("number_per_cell",number_per_cell)
pass

This reduced the calculation time drastically.

But to reduce the calculation time even more a interpolation method as proposed by @kimo should be the way to go I guess.

Userlevel 4
Badge +25

With the help of @Mark2AtSafe I used a PythonCaller instead of ListExploder/ExpressionEvaluator/Aggregator to accomplish the task. The code I used:

import fme
import fmeobjects
import math
# Template Function interface:
# When using this function, make sure its name is set as the value of
# the 'Class or Function to Process Features' transformer parameter
def processFeature1(feature):
list_of_distances = feature.getAttribute('neighbors{}.distance')
sum_of_distances=0.0
number_per_cell=0
for distance in list_of_distances:
number_per_cell +=1 
f_dij=(math.sqrt(2*float(distance)+1)-1)
sum_of_distances=sum_of_distances+f_dij
feature.setAttribute("sum_of_distances",sum_of_distances)
feature.setAttribute("number_per_cell",number_per_cell)
pass

This reduced the calculation time drastically.

But to reduce the calculation time even more a interpolation method as proposed by @kimo should be the way to go I guess.

Glad you got it working. I'm sad that I had to recommend Python but it turned out to be the right call. I think the ExpressionEvaluators (which are Tcl based) were the slow point. But I passed this onto our developers and hopefully it will help to show where we need some performance improvements.

Reply