Skip to main content

Hi,

Faced with having to handle multiple large datasets (5M+ features), I'm looking into whether I can improve performance. I just had one of these dataset crash after 12 hours !

The task is matching any source data set (only) spatially with a data set of buffered municipalities, and so there are no common attributes. I.e., no possible values for "Group By". Any source feature could possibly end of in any number of municipalities.

Is it possible to get a performance boost under these circumstances ? And if so, how ?

Cheers

A tiling strategy would help. (tile all sets involved)

Tile by block or municipality.

Or run per tile.

Workspacerunner.

Portproces for border objects.

"Any source feature could possibly end of in any number of municipalities."

How? ;)


A tiling strategy would help. (tile all sets involved)

Tile by block or municipality.

Or run per tile.

Workspacerunner.

Portproces for border objects.

"Any source feature could possibly end of in any number of municipalities."

How? ;)

I'm considering tiling, but one based on aggregating municipalities into regions first. I really don't want to have to break my workspace into multiple.

 

 

You ask "How?". What I meant was that there is no way (except by spatial comparison) to pre-determine which municipalities a given feature might end up in. Note that I've added a km wide buffer around each of the municipalities to account for border issues, so each feature may end up in 1-3 or even more municipalities. It's a requirement from the customer.

 

 


Hi, @lifalin2016

You can use the FeatureReader to link your input features to the municipality using the spatial filter option 'intersects'. This transformer allows you to add the attributes of the initiator (your municipalities) to the output (your input dataset). It will read all input features that overlap with an incoming municipality (initiator), regardless of the fact if they should be linked to 1 or more municipalities.

It will also be more memory friendly since it does not require you to read all data before you do your analysis.


A tiling strategy would help. (tile all sets involved)

Tile by block or municipality.

Or run per tile.

Workspacerunner.

Portproces for border objects.

"Any source feature could possibly end of in any number of municipalities."

How? ;)

Hi gio.

 

Well, I reconsidered your grid suggestion, and found a way
to utilize it. I first created a grid of 1 km cells, and sorted out all
cells uniquely within a single municipality. I then preemptively matched
my features with these grid cells, only performing the "expensive"
matching on municipality polygons with the features not entirely within
these grid cells. It seems that about 2/3 can be sorted this way.

 

I'm
running my two 5 million+ feature sets with this improved workflow
(which crashed before), hoping it'll speed up things. I'll let y'all
know how it went.

 

Cheers

 

 


Reply