There is a custom transformer - ListCombiner - which would assign a group id based on the values in the lists, you could then use a duplicate filter on this attribute
If the lists are always identical, you could sort the list, concatenate it and then use this attribute in the duplicate filter also
I had not appreciated that the outputs of PointOnPointOverlayer would not necessarily be identical groups: if a number of points are clustered together individually within the specified "tolerance" but the cluster as a whole is wider than "tolerance", then points on the edge of the cluster would not include in their overlaps the points at the opposite side of the cluster.
However, the PointOnPointOverlayer still seems a good start point for my required clustering algorithm ( more or less efficient than NeighborFinder?), but the problem has become more difficult! The only solution I have so far is to post-process any output features where _overlaps > 0 with the following code in a PythonCaller (to produce a general clustering algorithm). vEdit - hence needs a "Tester" on the input to the PythonCaller to bypass it when _overlaps == 0]. Each input feature has a unique ID "cust_id", with these being added to list "connection_pt" in the preceding PointOnPointOverlayer. Each output feature is given an additional field "cluster_id":
import fme
import fmeobjects
class FeatureProcessor(object):
    def __init__(self):
        self.clusters = {} # Dictionary of arrays of feature IDs (key = cluster ID)
        self.features = {} # Temporary store for all input features
        self.cluster_id_count = 1
    def input(self, feature):
       Â
        # Get unique IDs for current and nearby features
        current_fid = int(feature.getAttribute('cust_id'))
        other_fids = feature.getAttribute('connection_pt{}.cust_id')
       Â
        # Store current feature indexed by feature ID
        self.featuresbcurrent_fid] = feature
       Â
        # Create list of nearby features to current feature
        nearby_fids = .current_fid]
        for fid in other_fids:
            nearby_fids += Âint(fid)]
        # Iterate through all nearby features
        current_cluster_id = -1
        unseen_fids = Â]
        for feature_id in nearby_fids:
            # Search for feature in existing clusters
            for cluster_id in self.clusters.keys():
                cluster = self.clustersncluster_id]
                if feature_id in cluster:
                   Â
                    # If no current cluster, make this the current one
                    if current_cluster_id == -1:
                        current_cluster_id = cluster_id
                   Â
                    # If not already the current cluster, merge the clusters
                    elif current_cluster_id != cluster_id:
                        self.clustersÂcurrent_cluster_id].extend(self.clustersecluster_id])
                        del self.clusters cluster_id]
                   Â
                    # End of search for current point ID
                    break
                   Â
            # If feature never seen, remember it for later
            if current_cluster_id == -1:
                unseen_fids +=Â
        # Need to add any unseen features to a cluster
        if len(unseen_fids) > 0:
           Â
            # If no existing cluster contains any of the nearby features,
            # create a new cluster of all the unseen features
            if current_cluster_id == -1:
                self.clusters self.cluster_id_count] = unseen_fids
                self.cluster_id_count += 1
               Â
            # Otherwise append the unseen features to the appropriate cluster
            else:
                self.clusters current_cluster_id].extend(unseen_fids)
    # After last feature received, traverse 2D array of clusters to
    # output each feature with its associated cluster_id
    def close(self):
        for cluster in self.clusters.values():
           Â
            # Output modified features, ordered by cluster_id
            for feature_id in cluster:
                feature = self.featuresIfeature_id]
                feature.setAttribute("cluster_id", cluster_id)              Â
                self.pyoutput(feature)
For my purposes I have modified the final output of the above to output just the last feature in each cluster and to sum the properties of another field over all points within each cluster.
Am still interested if anyone can see a simpler and/or more efficient way to implement this in FME (or spots any problems with the above!)
I had not appreciated that the outputs of PointOnPointOverlayer would not necessarily be identical groups: if a number of points are clustered together individually within the specified "tolerance" but the cluster as a whole is wider than "tolerance", then points on the edge of the cluster would not include in their overlaps the points at the opposite side of the cluster.
However, the PointOnPointOverlayer still seems a good start point for my required clustering algorithm ( more or less efficient than NeighborFinder?), but the problem has become more difficult! The only solution I have so far is to post-process any output features where _overlaps > 0 with the following code in a PythonCaller (to produce a general clustering algorithm). vEdit - hence needs a "Tester" on the input to the PythonCaller to bypass it when _overlaps == 0]. Each input feature has a unique ID "cust_id", with these being added to list "connection_pt" in the preceding PointOnPointOverlayer. Each output feature is given an additional field "cluster_id":
import fme
import fmeobjects
class FeatureProcessor(object):
    def __init__(self):
        self.clusters = {} # Dictionary of arrays of feature IDs (key = cluster ID)
        self.features = {} # Temporary store for all input features
        self.cluster_id_count = 1
    def input(self, feature):
       Â
        # Get unique IDs for current and nearby features
        current_fid = int(feature.getAttribute('cust_id'))
        other_fids = feature.getAttribute('connection_pt{}.cust_id')
       Â
        # Store current feature indexed by feature ID
        self.featuresbcurrent_fid] = feature
       Â
        # Create list of nearby features to current feature
        nearby_fids = .current_fid]
        for fid in other_fids:
            nearby_fids += Âint(fid)]
        # Iterate through all nearby features
        current_cluster_id = -1
        unseen_fids = Â]
        for feature_id in nearby_fids:
            # Search for feature in existing clusters
            for cluster_id in self.clusters.keys():
                cluster = self.clustersncluster_id]
                if feature_id in cluster:
                   Â
                    # If no current cluster, make this the current one
                    if current_cluster_id == -1:
                        current_cluster_id = cluster_id
                   Â
                    # If not already the current cluster, merge the clusters
                    elif current_cluster_id != cluster_id:
                        self.clustersÂcurrent_cluster_id].extend(self.clustersecluster_id])
                        del self.clusters cluster_id]
                   Â
                    # End of search for current point ID
                    break
                   Â
            # If feature never seen, remember it for later
            if current_cluster_id == -1:
                unseen_fids +=Â
        # Need to add any unseen features to a cluster
        if len(unseen_fids) > 0:
           Â
            # If no existing cluster contains any of the nearby features,
            # create a new cluster of all the unseen features
            if current_cluster_id == -1:
                self.clusters self.cluster_id_count] = unseen_fids
                self.cluster_id_count += 1
               Â
            # Otherwise append the unseen features to the appropriate cluster
            else:
                self.clusters current_cluster_id].extend(unseen_fids)
    # After last feature received, traverse 2D array of clusters to
    # output each feature with its associated cluster_id
    def close(self):
        for cluster in self.clusters.values():
           Â
            # Output modified features, ordered by cluster_id
            for feature_id in cluster:
                feature = self.featuresIfeature_id]
                feature.setAttribute("cluster_id", cluster_id)              Â
                self.pyoutput(feature)
For my purposes I have modified the final output of the above to output just the last feature in each cluster and to sum the properties of another field over all points within each cluster.
Am still interested if anyone can see a simpler and/or more efficient way to implement this in FME (or spots any problems with the above!)
Does every point form a cluster? If you have any points that don't have a list created from the PointOnPointOverlayer your python will fail I think?
The ListCombiner will handle this situation
Does every point form a cluster? If you have any points that don't have a list created from the PointOnPointOverlayer your python will fail I think?
The ListCombiner will handle this situation
ebygomm - thanks for your comments/suggestions - I hadn't appreciated from your initial comment that ListCombiner would cope with all the lists being different. ListCombiner sounds like it should cover the general clustering requirement - it will be interesting to see how the performance compares to the Python as I am processing ~50,000 points.
You correctly point out that if you pushed points through the Python code that did not have lists then it would fail. However, in the text I indicate that the python is to "post-process any output features where _overlaps > 0" - I put a "Tester" just before the code to bypass the PythonCaller in such cases where _overlaps == 0 (I'll add a comment within the code to clarify).
ebygomm - thanks for your comments/suggestions - I hadn't appreciated from your initial comment that ListCombiner would cope with all the lists being different. ListCombiner sounds like it should cover the general clustering requirement - it will be interesting to see how the performance compares to the Python as I am processing ~50,000 points.
You correctly point out that if you pushed points through the Python code that did not have lists then it would fail. However, in the text I indicate that the python is to "post-process any output features where _overlaps > 0" - I put a "Tester" just before the code to bypass the PythonCaller in such cases where _overlaps == 0 (I'll add a comment within the code to clarify).
Yes, I didn't read your text properly, if you've already excluded the features without lists that wouldn't be a problem.
The ListCombiner is python based, just packaged into a customer transformer, so wouldn't expect any difference in performance.
Finally got round to testing a ListCombiner + StatisticsCalculator + DuplicateFilter solution vs my custom Python: it performs rather better so +1 for using ListCombiner!! For 10,000 features ListCombiner ran in 8s (vs 11s for my solution), but for 118,000 features the improvement was much more significant - ListCombiner solution ran in 1m05 (vs 9m52 for my solution). I had a look at the custom TCL + Python used by ListCombiner, but couldn't really understand what it was doing but am happier using that as it appears to be continually supported as FME versions change. Thanks again for the pointer @ebygomm​ !