Question

Running one attribute (point feature) through the workspace at a time?


Badge +3

Hi,

 

I want to get the shortest path from a point in dataset A to the nearest 5 points in dataset B. I have figured out how to do this, the only problem is dataset A has 100,000 points (I need to get the nearest 5 points in dataset b for all 100,000 points). Is there a way I can release a single point until it has run through the workspace and then the next point is released? I hope this makes sense, thanks for the help.

B


15 replies

Userlevel 5
Badge +25

If you're using the NeighborFinder I think it will process the candidate points one by one anyway. You can consider a parent/child approach with a WorkspaceRunner but that means you will have to read dataset B once for every point in dataset A so that is certainly not going to be of any help.

Badge +3

If you're using the NeighborFinder I think it will process the candidate points one by one anyway. You can consider a parent/child approach with a WorkspaceRunner but that means you will have to read dataset B once for every point in dataset A so that is certainly not going to be of any help.

Thanks for the help, I'm using a network so FromToBuilder and ShortestPathFinder, I need to get the distances along the network, so I don't think NeighborFinder will work. It looks like WorkspaceRunner might be the only way to go?

Thanks

B

Userlevel 5
Badge +25

Thanks for the help, I'm using a network so FromToBuilder and ShortestPathFinder, I need to get the distances along the network, so I don't think NeighborFinder will work. It looks like WorkspaceRunner might be the only way to go?

Thanks

B

ShortestPathFinder is a heavy processing task, in that case it might be worthwile trying the WorkspaceRunner approach. Alternatively it might be worth unloading this to a cloud instance, that would come at a cost but the specs are a lot higher.

Badge +3

ShortestPathFinder is a heavy processing task, in that case it might be worthwile trying the WorkspaceRunner approach. Alternatively it might be worth unloading this to a cloud instance, that would come at a cost but the specs are a lot higher.

Thanks, I can get access to our server here if need be. Within the WorkspaceRunner because my dataset A is a single shapefile, how can I stutter the release of the points one by one?

Thanks

B

Userlevel 1
Badge +21

I don't think that one feature at a time with a workspace runner is going to be the most efficient way to do this - you would then have to read in your network 100,000 times.

Using a workspace runner to run a 1000 features through at a time may be more efficient - but a lot will depend on the rest of the process and complexity of your network.

20,000 features takes about 10 minutes for me - this is split into approx 50 groups and run by a workspace runner with 6 concurrent fme processes.

Badge +3

I don't think that one feature at a time with a workspace runner is going to be the most efficient way to do this - you would then have to read in your network 100,000 times.

Using a workspace runner to run a 1000 features through at a time may be more efficient - but a lot will depend on the rest of the process and complexity of your network.

20,000 features takes about 10 minutes for me - this is split into approx 50 groups and run by a workspace runner with 6 concurrent fme processes.

Thanks for the answer. I've uploaded my workspace, would it need to be modified to take 1000 features at a time but only give results for one at a time (if that makes sense?)?

Can I use the WHERE Clause to release each of the features individually?

SHORTEST_PATH.fmw

Userlevel 1
Badge +21

Thanks for the answer. I've uploaded my workspace, would it need to be modified to take 1000 features at a time but only give results for one at a time (if that makes sense?)?

Can I use the WHERE Clause to release each of the features individually?

SHORTEST_PATH.fmw

I don't have 2019 on my machine so I cannot view your workspace unfortunately. Typically, if you send 1000 from-to lines into the shortest path finder you should get 1000 results out.

Userlevel 4
Badge +25
Is there a way I can release a single point until it has run through the workspace and then the next point is released?

Well, that's sort of how FME works anyway. One point at a time is read and processed as far as it can go before the next needs to be read. The problem here is that the ShortestPathFinder is a group-based transformer and will demand all of the features before it starts processing. But even were it otherwise, I don't see an issue because the other features are just 99,999 points held in memory, which is pretty negligible). It's not like your workspace looks that wrong to me. It looks fine.

Anyway, although you don't specify, I'm guessing the problem is one of performance. You're trying to find the nearest five B points to every A point by finding the distance between every A point and every B point, then just picking the five shortest.

That's technically the correct solution, but I think a key issue there is how many features are in dataset B. If there are only 6 points in dataset B then you're probably doing the best you can. If there are (for example) 1,000 points in dataset B, then you're doing 100,000 * 1,000 shortest path calculations (by my calculations that's 100,000,000 operations, which is a huge amount of processing).

So I think there's a few things to consider or try out...

1) Can you carry this out on just a single point from dataset A? How long does that take to process? Can you adjust the parameters in the ShortestPathFinder to reduce that time? Because multiply this by 100,000 and that's the overall time you are looking at. The best thing that the WorkspaceRunner idea does is allow parallel processing. So if you have a quad core machine, with parallel processing you could probably reduce this time by 75%. Does that bring it down to an acceptable time?

2) What is your data like? Is the network a bunch of city blocks in a typical North American city, or is it more random and sprawling? How far apart are the B features (and again, how many are there)? Does considering the data like this show any assumptions you can make to reduce the number of operations? For example, can you say "there are 1,000 B features, but in reality only the closest ten would be likely candidates, so I'll use the NeighborFinder to find the closest ten and run my ShortestPathFinder on that reduced dataset"? Or "...in reality only B features within 1km would be likely candidates, so I'll create a buffer and do an overlay to throw out B features outside 1km"? In short, can you make assumptions to reduce the number of potential B features being tested?

3) What's the road network quality? Is it a high-precision dataset? Can we run it through a Generalizer transformer to reduce the number of points? For example, if you only measure routes to 100 metres, then having points every 10cm is obviously excessive. So can we reduce the network size to speed up processing that way?

4) Can we try a different approach? The NetworkCostCalculator might be one idea. You could use that to produce a set of isochrones (contours/polygons of equal travel time) for each A point and then overlay dataset B. That way might be more efficient (it won't be quite as accurate, but that might be a trade-off that you're willing to accept). Or can you use the NeighborFinder instead? Is straight-line distance an acceptable alternative to network/road distance? Because that would be way quicker.

I hope this helps. I know it seems complicated, but the issue is that finding a shortest path is actually a very, very complex process. It's literally impossible to guarantee finding the correct answer. All we can do is run through as many different solutions as possible, looking for something shorter (even then, if the "correct" answer comes out first, we still have to keep looking, because we don't know it's the correct answer)! So multiply this by 100,000 and it becomes a very intensive process.

So I think the best solution is to try and filter the B features, to reduce their numbers. But I'll have a try to put together my own example and experiment with that to see what I can do.

Userlevel 4
Badge +25

I asked our development team and there is an algorithm that they have in their code that will do this, it's just not exposed in the ShortestPathFinder transformer. So I am filing enhancement requests and hopefully this means it will be available in FME in the not too distant future.

Badge +3

I asked our development team and there is an algorithm that they have in their code that will do this, it's just not exposed in the ShortestPathFinder transformer. So I am filing enhancement requests and hopefully this means it will be available in FME in the not too distant future.

Brilliant, thanks so much for the help Mark, it's been really useful. I am tending to this workspace again today and I will update you with what I follow through with.

Thanks

B

Badge +3
Is there a way I can release a single point until it has run through the workspace and then the next point is released?

Well, that's sort of how FME works anyway. One point at a time is read and processed as far as it can go before the next needs to be read. The problem here is that the ShortestPathFinder is a group-based transformer and will demand all of the features before it starts processing. But even were it otherwise, I don't see an issue because the other features are just 99,999 points held in memory, which is pretty negligible). It's not like your workspace looks that wrong to me. It looks fine.

Anyway, although you don't specify, I'm guessing the problem is one of performance. You're trying to find the nearest five B points to every A point by finding the distance between every A point and every B point, then just picking the five shortest.

That's technically the correct solution, but I think a key issue there is how many features are in dataset B. If there are only 6 points in dataset B then you're probably doing the best you can. If there are (for example) 1,000 points in dataset B, then you're doing 100,000 * 1,000 shortest path calculations (by my calculations that's 100,000,000 operations, which is a huge amount of processing).

So I think there's a few things to consider or try out...

1) Can you carry this out on just a single point from dataset A? How long does that take to process? Can you adjust the parameters in the ShortestPathFinder to reduce that time? Because multiply this by 100,000 and that's the overall time you are looking at. The best thing that the WorkspaceRunner idea does is allow parallel processing. So if you have a quad core machine, with parallel processing you could probably reduce this time by 75%. Does that bring it down to an acceptable time?

2) What is your data like? Is the network a bunch of city blocks in a typical North American city, or is it more random and sprawling? How far apart are the B features (and again, how many are there)? Does considering the data like this show any assumptions you can make to reduce the number of operations? For example, can you say "there are 1,000 B features, but in reality only the closest ten would be likely candidates, so I'll use the NeighborFinder to find the closest ten and run my ShortestPathFinder on that reduced dataset"? Or "...in reality only B features within 1km would be likely candidates, so I'll create a buffer and do an overlay to throw out B features outside 1km"? In short, can you make assumptions to reduce the number of potential B features being tested?

3) What's the road network quality? Is it a high-precision dataset? Can we run it through a Generalizer transformer to reduce the number of points? For example, if you only measure routes to 100 metres, then having points every 10cm is obviously excessive. So can we reduce the network size to speed up processing that way?

4) Can we try a different approach? The NetworkCostCalculator might be one idea. You could use that to produce a set of isochrones (contours/polygons of equal travel time) for each A point and then overlay dataset B. That way might be more efficient (it won't be quite as accurate, but that might be a trade-off that you're willing to accept). Or can you use the NeighborFinder instead? Is straight-line distance an acceptable alternative to network/road distance? Because that would be way quicker.

I hope this helps. I know it seems complicated, but the issue is that finding a shortest path is actually a very, very complex process. It's literally impossible to guarantee finding the correct answer. All we can do is run through as many different solutions as possible, looking for something shorter (even then, if the "correct" answer comes out first, we still have to keep looking, because we don't know it's the correct answer)! So multiply this by 100,000 and it becomes a very intensive process.

So I think the best solution is to try and filter the B features, to reduce their numbers. But I'll have a try to put together my own example and experiment with that to see what I can do.

Hi @mark2atsafe,

I've been making updates to this workspace and have decided to give option 2 a try using NeighbourFinder, it has worked with cutting processing time significantly which is great. I am having an issue getting a consistent output however. I am using a WorkspaceRunner to run each feature individually, but I am getting varied outputs each time. For example there are 79 features, each feature should give 5 results therefore 395 rows populated in excel. However, I am sometimes getting 355/365/375, I am confused as to why this is happening. The parameters in the workspace are not changing. Is this something to do with shortest paths in general, as you say above 'It's literally impossible to guarantee finding the correct answer '. I have attached the the two workspaces.

(*Edit would either @redgeographics @egomm have any idea on solving this problem?)

Thanks for the help.

B

RUNNER_SHORTEST_PATH.fmw

SHORTEST_PATH.fmw

Userlevel 1
Badge +21

Hi @mark2atsafe,

I've been making updates to this workspace and have decided to give option 2 a try using NeighbourFinder, it has worked with cutting processing time significantly which is great. I am having an issue getting a consistent output however. I am using a WorkspaceRunner to run each feature individually, but I am getting varied outputs each time. For example there are 79 features, each feature should give 5 results therefore 395 rows populated in excel. However, I am sometimes getting 355/365/375, I am confused as to why this is happening. The parameters in the workspace are not changing. Is this something to do with shortest paths in general, as you say above 'It's literally impossible to guarantee finding the correct answer '. I have attached the the two workspaces.

(*Edit would either @redgeographics @egomm have any idea on solving this problem?)

Thanks for the help.

B

RUNNER_SHORTEST_PATH.fmw

SHORTEST_PATH.fmw

I think the reason you're getting different results is you are writing to the same excel file each time and there is some sort of conflict which means that two different outputs put the output on the same excel row, so one result effectively overwrites each other.

I'd write the output from each workspace runner to a file, then combine these together once all the processes have run.

Badge +3

I think the reason you're getting different results is you are writing to the same excel file each time and there is some sort of conflict which means that two different outputs put the output on the same excel row, so one result effectively overwrites each other.

I'd write the output from each workspace runner to a file, then combine these together once all the processes have run.

Thanks @egomm,

You appear to be correct, the excel file is overwriting rows seemingly at random. I changed the Writer to write out to individual excel files and it didn't leave any out. In the end I decided to write to Microsoft Access which returned all results.

 

Thanks very much for the help.

B

Badge +3

Hi @mark2atsafe,

I've been making updates to this workspace and have decided to give option 2 a try using NeighbourFinder, it has worked with cutting processing time significantly which is great. I am having an issue getting a consistent output however. I am using a WorkspaceRunner to run each feature individually, but I am getting varied outputs each time. For example there are 79 features, each feature should give 5 results therefore 395 rows populated in excel. However, I am sometimes getting 355/365/375, I am confused as to why this is happening. The parameters in the workspace are not changing. Is this something to do with shortest paths in general, as you say above 'It's literally impossible to guarantee finding the correct answer '. I have attached the the two workspaces.

(*Edit would either @redgeographics @egomm have any idea on solving this problem?)

Thanks for the help.

B

RUNNER_SHORTEST_PATH.fmw

SHORTEST_PATH.fmw

Hi @mark2atsafe / @egomm / @redgeographics

I've run into a little road block while trying to perform this task. While running on a relatively small sample 500 of point A and 150 of point B the amount of data being stored in my temp drive is becoming unmanageable. I'm getting 400gb of data way before the task is anywhere near complete. This seems like way too much for such a small amount of data. Is there a way to reduce this? It is consistently crashing my computer.

 

Thanks for any help.

B

Userlevel 4
Badge +13

Hi @mark2atsafe / @egomm / @redgeographics

I've run into a little road block while trying to perform this task. While running on a relatively small sample 500 of point A and 150 of point B the amount of data being stored in my temp drive is becoming unmanageable. I'm getting 400gb of data way before the task is anywhere near complete. This seems like way too much for such a small amount of data. Is there a way to reduce this? It is consistently crashing my computer.

 

Thanks for any help.

B

Hi @bjudes Do you have feature caching turned on? See https://knowledge.safe.com/articles/579/performance-tuning-fme.html and https://knowledge.safe.com/articles/79739/feature-caching-and-performance.html I created some sample data with a similar number of features and I didn't encounter any problems with temp files. If feature caching isn't your problem, can you send your source data to us via https://www.safe.com/support/report-a-problem/ ?

Reply