How to remove repetitions from output of PointOnPointOverlayer?

I am looking to merge into a list selected attributes from nearby points and output a single feature for each group of nearby points. PointOnPointOverlayer perfectly finds the nearby points and generates the list I want, but outputs all the input features with a list of all the other features. Thus for say 4 nearby features it will output 4 features each with a list which identifies the other 3 features in the group, but without anything like a group ID with which you could select just the first in each group with a DuplicateFilter.

This is almost exactly the example given in the PointOnPointOverlayer manual page which creates lists of crimes at common locations... but doesn't explain how to not have multiple copies of the same list at the same location. Am I missing something obvious? It seems that PointOnPointOverlayer has done all the hard work!

The closest I could get to generating a group ID is to give all my input points a unique ID field, then set "Merge attributes" to "only use incoming". This will then produce a common ID in the output for... 3 of the 4 in my group of 4, or 11 out of 12 in a group of 12.

Any suggestions? One important caveat is that the single output feature for each group can have the geometry of any one of their coordinates (but not some other location, e.g. not a centroid).

Page 1 / 1

There is a custom transformer - ListCombiner - which would assign a group id based on the values in the lists, you could then use a duplicate filter on this attribute

If the lists are always identical, you could sort the list, concatenate it and then use this attribute in the duplicate filter also

I had not appreciated that the outputs of PointOnPointOverlayer would not necessarily be identical groups: if a number of points are clustered together individually within the specified "tolerance" but the cluster as a whole is wider than "tolerance", then points on the edge of the cluster would not include in their overlaps the points at the opposite side of the cluster.

However, the PointOnPointOverlayer still seems a good start point for my required clustering algorithm ( more or less efficient than NeighborFinder?), but the problem has become more difficult! The only solution I have so far is to post-process any output features where _overlaps > 0 with the following code in a PythonCaller (to produce a general clustering algorithm). [Edit - hence needs a "Tester" on the input to the PythonCaller to bypass it when _overlaps == 0]. Each input feature has a unique ID "cust_id", with these being added to list "connection_pt" in the preceding PointOnPointOverlayer. Each output feature is given an additional field "cluster_id":

import fme
import fmeobjects

class FeatureProcessor(object):
    def __init__(self):

        self.clusters = {} # Dictionary of arrays of feature IDs (key = cluster ID)
        self.features = {} # Temporary store for all input features
        self.cluster_id_count = 1

    def input(self, feature):
        
        # Get unique IDs for current and nearby features
        current_fid = int(feature.getAttribute('cust_id'))
        other_fids = feature.getAttribute('connection_pt{}.cust_id')
        
        # Store current feature indexed by feature ID
        self.features[current_fid] = feature
        
        # Create list of nearby features to current feature
        nearby_fids = [current_fid]
        for fid in other_fids:
            nearby_fids += [int(fid)]

        # Iterate through all nearby features
        current_cluster_id = -1
        unseen_fids = []
        for feature_id in nearby_fids:

            # Search for feature in existing clusters
            for cluster_id in self.clusters.keys():
                cluster = self.clusters[cluster_id]
                if feature_id in cluster:
                    
                    # If no current cluster, make this the current one
                    if current_cluster_id == -1:
                        current_cluster_id = cluster_id
                    
                    # If not already the current cluster, merge the clusters
                    elif current_cluster_id != cluster_id:
                        self.clusters[current_cluster_id].extend(self.clusters[cluster_id])
                        del self.clusters[cluster_id]
                    
                    # End of search for current point ID
                    break
                    
            # If feature never seen, remember it for later
            if current_cluster_id == -1:
                unseen_fids += [feature_id]

        # Need to add any unseen features to a cluster
        if len(unseen_fids) > 0:
            
            # If no existing cluster contains any of the nearby features,
            # create a new cluster of all the unseen features
            if current_cluster_id == -1:
                self.clusters[self.cluster_id_count] = unseen_fids
                self.cluster_id_count += 1
                
            # Otherwise append the unseen features to the appropriate cluster
            else:
                self.clusters[current_cluster_id].extend(unseen_fids)

    # After last feature received, traverse 2D array of clusters to
    # output each feature with its associated cluster_id
    def close(self):
        for cluster in self.clusters.values():
            
            # Output modified features, ordered by cluster_id
            for feature_id in cluster:
                feature = self.features[feature_id]
                feature.setAttribute("cluster_id", cluster_id)               
                self.pyoutput(feature)

For my purposes I have modified the final output of the above to output just the last feature in each cluster and to sum the properties of another field over all points within each cluster.

Am still interested if anyone can see a simpler and/or more efficient way to implement this in FME (or spots any problems with the above!)

import fme
import fmeobjects

class FeatureProcessor(object):
    def __init__(self):

        self.clusters = {} # Dictionary of arrays of feature IDs (key = cluster ID)
        self.features = {} # Temporary store for all input features
        self.cluster_id_count = 1

    def input(self, feature):
        
        # Get unique IDs for current and nearby features
        current_fid = int(feature.getAttribute('cust_id'))
        other_fids = feature.getAttribute('connection_pt{}.cust_id')
        
        # Store current feature indexed by feature ID
        self.features[current_fid] = feature
        
        # Create list of nearby features to current feature
        nearby_fids = [current_fid]
        for fid in other_fids:
            nearby_fids += [int(fid)]

        # Iterate through all nearby features
        current_cluster_id = -1
        unseen_fids = []
        for feature_id in nearby_fids:

            # Search for feature in existing clusters
            for cluster_id in self.clusters.keys():
                cluster = self.clusters[cluster_id]
                if feature_id in cluster:
                    
                    # If no current cluster, make this the current one
                    if current_cluster_id == -1:
                        current_cluster_id = cluster_id
                    
                    # If not already the current cluster, merge the clusters
                    elif current_cluster_id != cluster_id:
                        self.clusters[current_cluster_id].extend(self.clusters[cluster_id])
                        del self.clusters[cluster_id]
                    
                    # End of search for current point ID
                    break
                    
            # If feature never seen, remember it for later
            if current_cluster_id == -1:
                unseen_fids += [feature_id]

        # Need to add any unseen features to a cluster
        if len(unseen_fids) > 0:
            
            # If no existing cluster contains any of the nearby features,
            # create a new cluster of all the unseen features
            if current_cluster_id == -1:
                self.clusters[self.cluster_id_count] = unseen_fids
                self.cluster_id_count += 1
                
            # Otherwise append the unseen features to the appropriate cluster
            else:
                self.clusters[current_cluster_id].extend(unseen_fids)

    # After last feature received, traverse 2D array of clusters to
    # output each feature with its associated cluster_id
    def close(self):
        for cluster in self.clusters.values():
            
            # Output modified features, ordered by cluster_id
            for feature_id in cluster:
                feature = self.features[feature_id]
                feature.setAttribute("cluster_id", cluster_id)               
                self.pyoutput(feature)

For my purposes I have modified the final output of the above to output just the last feature in each cluster and to sum the properties of another field over all points within each cluster.

Am still interested if anyone can see a simpler and/or more efficient way to implement this in FME (or spots any problems with the above!)

Does every point form a cluster? If you have any points that don't have a list created from the PointOnPointOverlayer your python will fail I think?

The ListCombiner will handle this situation

Does every point form a cluster? If you have any points that don't have a list created from the PointOnPointOverlayer your python will fail I think?

The ListCombiner will handle this situation

ebygomm - thanks for your comments/suggestions - I hadn't appreciated from your initial comment that ListCombiner would cope with all the lists being different. ListCombiner sounds like it should cover the general clustering requirement - it will be interesting to see how the performance compares to the Python as I am processing ~50,000 points.

You correctly point out that if you pushed points through the Python code that did not have lists then it would fail. However, in the text I indicate that the python is to "post-process any output features where _overlaps > 0" - I put a "Tester" just before the code to bypass the PythonCaller in such cases where _overlaps == 0 (I'll add a comment within the code to clarify).

Yes, I didn't read your text properly, if you've already excluded the features without lists that wouldn't be a problem.

The ListCombiner is python based, just packaged into a customer transformer, so wouldn't expect any difference in performance.

Finally got round to testing a ListCombiner + StatisticsCalculator + DuplicateFilter solution vs my custom Python: it performs rather better so +1 for using ListCombiner!! For 10,000 features ListCombiner ran in 8s (vs 11s for my solution), but for 118,000 features the improvement was much more significant - ListCombiner solution ran in 1m05 (vs 9m52 for my solution). I had a look at the custom TCL + Python used by ListCombiner, but couldn't really understand what it was doing but am happier using that as it appears to be continually supported as FME versions change. Thanks again for the pointer @ebygomm !

Community Stats

Latest FME

Community Stats

Latest FME

Sign up

An FME Account is required to contribute

Login to the community

An FME Account is required to contribute

Scanning file for viruses.

This file cannot be downloaded