Question

Finding Neighbors mathematically (increase performance)

  • 20 November 2019
  • 14 replies
  • 13 views

Badge +1

Hi guys,

I got a challenge for you: I have 500k+ grid points with a distance of 0.5m and need to find the neighbors from each point within a given radius (let's say 2m).

The NeighborFinder takes horribly long. The SpatialRelator has about the same calculation time.

What I am now thinking of is to get the neighbors mathematically rather than spatially. For each point I can create the attributes minX, minY, maxX and maxY. Then, I just need to test if one of the other points' coordinates are greater or smaller, right?!

In the end every point should have a list of all his neighbors (and their attributes).

Any ideas? Or any hints how I can speed up the spatial tester?

Cheers,

 

Maria


14 replies

Userlevel 4

I suspect that the NeighborFinder does indeed work "mathematically" inside :-)

First of all, make sure you're running the very latest FME version, there have been some impressive progress on spatial performance lately (FME 2019+). With 500k+ features, also make sure to use 64-bit FME.

If you use numeric solution outlined above, you'll effectively do a square buffer around each point, meaning that you're not guaranteed a 2 m distance around each point (either more or less). If that's not an issue, however, it could well be faster depending on how you implement it.

 

Userlevel 4
Badge +25

You *could* try speeding it up by tiling the data first, one for each core, then run it in parallel - With 2019+ parallel stuff must be run in Custom a Transformer. Results (and performance might be unexpected though)

 

 

Not sure about mathematically, I mean in the end spatial is just math...Also you would need to test each point against each point, that's 500,000 tests for each of the 500,000 points. FME will build a spatial index and use that to speed things up for the Neigborfinder.

 

 

But maybe I'm missing something here and there is a clever way to do it.

 

 

Badge +22

If the points are a true grid with even spacing, couldn't you just assume the neighbours by row/col attributes?

 

 

ex. Point i,j will have neighbours:
            (i+2, j-1)  (i+2, j) (i+2, j+1) 
(i+1, j-2)  (i+1, j-1)  (i+1, j) (i+1, j+1) (i+1, j+2)
(i, j-2)    (i, j-1)             (i, j+1)   (i, j+2)
(i-1, j-2)  (i-1, j-1)  (i-1, j) (i-1, j+1) (i-1, j+2)
            (i-2, j-1)  (i-2, j) (i-2, j+1) 

The exact kernel will depend on the buffer distance vs cell spacing, and whether any part of the cell is within the buffer distance, or only the centroid matters

Badge +1

Thanks to all for the input!

I updated to the very recent version. It indeed speeds up, but still takes a few minutes with the NeighborFinder (500k points - 30min).

It seems I'm a blockhead: How can I get the calculated neighbors? I need to merge every point with every other point, don't I?!

Badge +1

D'oh! I found out what really slows down my workspace: creating a list of closest candidates makes it at least 10 time slower (3 minutes vs. 30 minutes @ 500k points)!

Userlevel 4
Badge +25

D'oh! I found out what really slows down my workspace: creating a list of closest candidates makes it at least 10 time slower (3 minutes vs. 30 minutes @ 500k points)!

Yeah for sure - The list is the killer here. No matter which way you cut it, if you want the list of all the attributes it will be sadly quite slow. Although I think that when you don't generate the list it only matches one candidate to each (see log file for warnings). So in a sense it doesn't really find all 10 closest.

Badge +3

Optimisation is essentially just creating and searching an R-Tree index, and is the core of how spatial indexing and searching works. Essentially you want to find for Point A (Xa, Ya) where there are any other Points related to it that have values of abs(X-Xa)<2 AND abs(Y-Ya)<2 if you are testing a proximity of 2 metres. This is a rectangular search pattern (what R-Trees do) and then NeigbourFinder can be used for the refinement for testing which of those final groups of points are within a circular distance of 2 metres by using it in Group By mode, which will work way faster than in a Naive, non-Grouped mode, or you can also solve it in SQL using the formula for distance as well.

This can be handled in SQL quite readily, and this leads to potentially using the InlineQuerier:

  1. Feed into InlineQuerier the Grid Point dataset: PointID, XCoord, YCoord as "GridPoints"
  2. Issue a SQL command as a Result "CREATE INDEX idx_PointID ON Gridpoints (PointID)". This will return an empty Result, but that's fine, it is just creating the necessary B-Tree indices that we are going to fudge a little into behaving like an R-Tree index. This COULD also be done be FeatureWriting out to SpatiaLite and there is some simpler SpatiaLite specific SQL syntax from there that makes use of its native R-Tree indexing built into its extended version of SQLite, but this example is just pure dumb SQL so needs these B-Tree indices instead.
  3. Similarly issue "CREATE INDEX idx_XCoords ON Gridpoints (XCoord)"
  4. Similarly issue "CREATE INDEX idx_YCoords ON Gridpoints (YCoord)".
  5. Then issue "
  6. SELECT A.PointID AS OriginPointID, B.PointID AS ProximityPointID

    FROM GridPoints B, GridPoints A

    WHERE B.rowID IN (SELECT rowID FROM GridPoints WHERE GridPoints.XCoord < A.XCoord+2 AND GridPoints.XCoord > A.XCoord-2) AND B.rowID IN (SELECT rowID FROM GridPoints WHERE GridPoints.YCoord < A.YCoord+2 AND GridPoints.YCoord > A.YCoord-2)

    AND (B.XCoord-A.XCoord)*(B.XCoord-A.XCoord)+(B.YCoord-A.YCoord)*(B.YCoord-A.YCoord)<(2*2) AND B.rowID NOT IN (SELECT rowID FROM GridPoints WHERE GridPoints.PointID=A.PointID)" This probably would take a lot of separate explaining, but is just the way to optimise SQLite to make it use its indices effectively as per https://www.sqlite.org/optoverview.html

A word of warning though is that I haven't fully tested InlineQuerier's implementation of SQLite and whether it makes and uses indices created using "CREATE Index ..." statements. But above is the raw SQL way that spatial indices work by doing a coarse minimum bounding rectangle search first to group up proximate geometries before doing a final distance test.

 

Badge +3

...and here you go @gpt_executer! An FME Workflow using the power of R*Tree index search combined with super-powerful but lightweight SQL engines like SQLite. It takes ~15 minutes to output a 24,000,000 row, completely flat lookup table of OriginPointID vs ProximityPointID where the Distance is <=2 metres in a 500,000 point, 0.5m Point Grid. The underlying SQL query itself only takes ~4:00 minutes to complete, but there is a bottleneck in how fast FME can read the query results back in.

The workflow generates a 2D Grid of 500,000 points, and sends it out to SpatiaLite (extension of SQLIte) to run an R*Tree nearest SQL query (SpatiaLite puts an accessor to its built in Geometry R*Tree index with the Virtual Table "SpatialIndex"). The *.sqlite file could be made temporary by inserting a TempPathNameCreator.

..and the specific SQL that searches the R*Tree. This creates a 4x4 square around the point, and the engine uses the Min/Max X, Y of this search geometry to look for Points spatially indexed to that location. The final refinement of this group is by checking actual distance of the smaller group of points returned using ST_Distance (a rough equivalent of NeighbourFinder)

gridsearchwithspatialite.fmw

Userlevel 4
Badge +25

...and here you go @gpt_executer! An FME Workflow using the power of R*Tree index search combined with super-powerful but lightweight SQL engines like SQLite. It takes ~15 minutes to output a 24,000,000 row, completely flat lookup table of OriginPointID vs ProximityPointID where the Distance is <=2 metres in a 500,000 point, 0.5m Point Grid. The underlying SQL query itself only takes ~4:00 minutes to complete, but there is a bottleneck in how fast FME can read the query results back in.

The workflow generates a 2D Grid of 500,000 points, and sends it out to SpatiaLite (extension of SQLIte) to run an R*Tree nearest SQL query (SpatiaLite puts an accessor to its built in Geometry R*Tree index with the Virtual Table "SpatialIndex"). The *.sqlite file could be made temporary by inserting a TempPathNameCreator.

..and the specific SQL that searches the R*Tree. This creates a 4x4 square around the point, and the engine uses the Min/Max X, Y of this search geometry to look for Points spatially indexed to that location. The final refinement of this group is by checking actual distance of the smaller group of points returned using ST_Distance (a rough equivalent of NeighbourFinder)

gridsearchwithspatialite.fmw

This is an awesome answer. I was wondering if min/max values in SQL would do the trick but I hadn't even considered R-Tree indexing. I'll be taking a close look at the workspace to see what I can learn (gotta keep learning!)

Badge +3

This is an awesome answer. I was wondering if min/max values in SQL would do the trick but I hadn't even considered R-Tree indexing. I'll be taking a close look at the workspace to see what I can learn (gotta keep learning!)

Thanks @mark2atsafe. One of the things that drew me to FME was its ability to seamlessly move data between the Workspace out to SQLite or SpatiaLite and back so that the power of both environments can be unlocked simultaneously. I've been using both engines for years to smash out very high performance data joins and queries with extremely large datasets, although unlocking the performance is in building their indices and knowing the nuances on how to use them.

I found some further optimisations. Firstly my 15 minute measurement was with Feature Caching On. I didn't think that would be a huge overhead but it turns out it was, and with it off, the whole 24 million features of OriginPoint vs ProximityPoint come back from SQLExecutor in only 6 minutes! This is nearly as fast as what the underlying SQLite/SpatiaLite engine executes the query.

A further incremental gain is knowing a couple of other SQLite tricks: SQLite can also pre-process query results into temporary RAM tables, and then you can point SQLExecutor to read from the this table that the backend has already pre-processed into RAM.

The SQL for SQLExecutor in the sample workflow is a few more lines, but it goes like this:

FME_SQL_DELIMITER ;  (This tells FME that the execution parsing character is ";")
PRAGMA temp_store = 2;   (This provides a hint to the SQLite engine that you want it to write temporary tables to RAM.  It tries to obey this where possible) 
CREATE TEMP TABLE tmpGridPointResults AS
SELECT A.PointID AS OriginPointID, B.PointID AS ProximityPointID
FROM gridpoints A, GridPoints B
WHERE ST_Distance(A.Geometry,B.Geometry)<=2 AND A.PointID<>B.PointID AND B.rowID IN (SELECT rowID FROM SpatialIndex WHERE f_table_name='GridPoints' AND Search_Frame = ST_Expand(A.Geometry,2));   (This does pre-processing of the table in the backend by writing the results to a temporary RAM table.  Temp tables in SQLite get deleted when the connection is closed ie. When SQLExecutor finishes)
SELECT * FROM tmpGridPointResults;  (All the preceding statements return empty SQL results, so all SQLExecutor sees in terms of "data" when all of the statements are strung together is retrieving records from the already written temporary RAM table rows)

Using this cuts out ~15% of processing time, by reducing the run to 5 minutes.

Userlevel 4

Thanks @mark2atsafe. One of the things that drew me to FME was its ability to seamlessly move data between the Workspace out to SQLite or SpatiaLite and back so that the power of both environments can be unlocked simultaneously. I've been using both engines for years to smash out very high performance data joins and queries with extremely large datasets, although unlocking the performance is in building their indices and knowing the nuances on how to use them.

I found some further optimisations. Firstly my 15 minute measurement was with Feature Caching On. I didn't think that would be a huge overhead but it turns out it was, and with it off, the whole 24 million features of OriginPoint vs ProximityPoint come back from SQLExecutor in only 6 minutes! This is nearly as fast as what the underlying SQLite/SpatiaLite engine executes the query.

A further incremental gain is knowing a couple of other SQLite tricks: SQLite can also pre-process query results into temporary RAM tables, and then you can point SQLExecutor to read from the this table that the backend has already pre-processed into RAM.

The SQL for SQLExecutor in the sample workflow is a few more lines, but it goes like this:

FME_SQL_DELIMITER ;  (This tells FME that the execution parsing character is ";")
PRAGMA temp_store = 2;   (This provides a hint to the SQLite engine that you want it to write temporary tables to RAM.  It tries to obey this where possible) 
CREATE TEMP TABLE tmpGridPointResults AS
SELECT A.PointID AS OriginPointID, B.PointID AS ProximityPointID
FROM gridpoints A, GridPoints B
WHERE ST_Distance(A.Geometry,B.Geometry)<=2 AND A.PointID<>B.PointID AND B.rowID IN (SELECT rowID FROM SpatialIndex WHERE f_table_name='GridPoints' AND Search_Frame = ST_Expand(A.Geometry,2));   (This does pre-processing of the table in the backend by writing the results to a temporary RAM table.  Temp tables in SQLite get deleted when the connection is closed ie. When SQLExecutor finishes)
SELECT * FROM tmpGridPointResults;  (All the preceding statements return empty SQL results, so all SQLExecutor sees in terms of "data" when all of the statements are strung together is retrieving records from the already written temporary RAM table rows)

Using this cuts out ~15% of processing time, by reducing the run to 5 minutes.

I think you should come to the FME UC 2020 and present on this topic, there's too little discussion about this type of interaction and I think you'd find a lot of interest. I'd definitely be in the crowd.

Badge +1

...and here you go @gpt_executer! An FME Workflow using the power of R*Tree index search combined with super-powerful but lightweight SQL engines like SQLite. It takes ~15 minutes to output a 24,000,000 row, completely flat lookup table of OriginPointID vs ProximityPointID where the Distance is <=2 metres in a 500,000 point, 0.5m Point Grid. The underlying SQL query itself only takes ~4:00 minutes to complete, but there is a bottleneck in how fast FME can read the query results back in.

The workflow generates a 2D Grid of 500,000 points, and sends it out to SpatiaLite (extension of SQLIte) to run an R*Tree nearest SQL query (SpatiaLite puts an accessor to its built in Geometry R*Tree index with the Virtual Table "SpatialIndex"). The *.sqlite file could be made temporary by inserting a TempPathNameCreator.

..and the specific SQL that searches the R*Tree. This creates a 4x4 square around the point, and the engine uses the Min/Max X, Y of this search geometry to look for Points spatially indexed to that location. The final refinement of this group is by checking actual distance of the smaller group of points returned using ST_Distance (a rough equivalent of NeighbourFinder)

gridsearchwithspatialite.fmw

Thanks, @bwn! I couldn't answer earlier b/c of busy days, but your suggestions was very helpful! We reached our goal and it opened our mind a bit wider for further calculations! Thanks again! Would love to come to FME UC 2020 and see/hear/talk to all the amazing people like you, but busy days keep going on...

Badge +2

@gpt_executer The NeighborFinder does use an R*Tree spatial index. We tested your example workspace and it ran in about 5min (FME 2019.2) on my laptop. If you use the option to limit the Number of Neighbors to Find: that can sometimes change the performance (you'd need a value of about 50 for this analysis). I suspect there is something else going on in your actual workspace. Make sure Feature Cache is OFF!

Userlevel 4
Badge +25

@gpt_executer The NeighborFinder does use an R*Tree spatial index. We tested your example workspace and it ran in about 5min (FME 2019.2) on my laptop. If you use the option to limit the Number of Neighbors to Find: that can sometimes change the performance (you'd need a value of about 50 for this analysis). I suspect there is something else going on in your actual workspace. Make sure Feature Cache is OFF!

This was actually my experience as well - even with looking for all matches - the result was with list attributes though, and exploding the list into individual features took some time pushing the process time up a little. I can imagine that if you had a tonne of attributes that this list building would take some time.

Reply