Question

More efficient aggregation


Badge +1

I have a dataset with 29m polygons some of which have the same attribute (say Key ID). I need to aggregate them by the Key ID but the dataset is not ordered.

After nearly 12 hours of run time it has only aggregated 1.1m features.

I need to find a better way of aggregating. Any ideas?

 

The data is being read from about 10 shapefiles as the source data.

I have access to Postgis if that would help.


7 replies

Badge +21

1. Use Generate Workspace and SHP -> FFS. Change the settings on the FFS-writer to include indexes

Then you can try reading the FFS - sorting and Aggregating

Userlevel 4
Badge +26

The other thing is to make sure there aren't unneeded attributes.

 

 

You could also import them all into a postgis database and only read in what is actually a duplcate with something like this: https://stackoverflow.com/questions/28156795/how-to-find-duplicate-records-in-postgresql - you would also want the result be sorted by the ID. You would need to make sure that the id filed has an index to make the request fast.

 

 

No sure if that would be faster than what @sigtill has suggested but in general there is somewhere here which a database could do if you have access to one.

 

 

Of Course you will still need to somehow get the results which are unique.

 

 

Badge +1

1. Use Generate Workspace and SHP -> FFS. Change the settings on the FFS-writer to include indexes

Then you can try reading the FFS - sorting and Aggregating

Is the sorter more efficient than the aggregator then?

I will give it a try

 

Userlevel 4
Badge +26

Is the sorter more efficient than the aggregator then?

I will give it a try

 

The sorter works in BulkMode so should be faster. Which version of FME you using? if you have access to 2020 then Shapefile reading is much faster too

Badge +1

I am on 2020. Running some time trials on the different combinations of shop / ffs and sorter / no sorter.

Sorter is definitely faster.

Userlevel 4
Badge +26

I am on 2020. Running some time trials on the different combinations of shop / ffs and sorter / no sorter.

Sorter is definitely faster.

Nice - performance is always a fun thing to play with. It's always a learning experience when you want to improve performance. For me I think its just such a great way to learn. When someone complains something it too slow I see it as a fun challenge and an opportunity to learn something new.

 

Always takes time though, but you just learn so much!

 

Good luck!

1. Use Generate Workspace and SHP -> FFS. Change the settings on the FFS-writer to include indexes

Then you can try reading the FFS - sorting and Aggregating

I'm normally a lurker, but I had to say this answer has SAVED ME.

I'm aggregating road names for a hefty dataset of 34mil points, and I'd been running into translation failures hours after trying to aggregate normally.

I will be sharing this tip with my team.

 

My process (not sure if I took extra steps or skipped things, but it worked):

  1. Load original dataset, use attribute keeper to only keep what I need
  2. Featurewriter to FFS format (this produced a ton of extra files, so I wish I'd done this in a separated folder)
  3. Bring in new FFS file, use sorter
  4. Use aggregator (I found even with sorting, "aggregating when group changed" created duplicates, so I left group by mode as Process At End (Blocking))
  5. I wrote a new feature of the aggregated table so I don't have to do this again in other tasks.

 

Went from a multi hour process to less than half an hour including set up.

 

Stellar help. THANK YOU!

Reply