Can you add a small sample dataset?
Hi @ebygomm.
Could not resist doing a small test:
4 Creators all creating 10 records with _creation_instance from 0 - 9
4 AttributeCreators creating test=1 or test-2
ChangeDetector with key _creation_instance (duplicates in test set) and checking test attribute.
All output as unchanged:
So, even with duplicate keys, as long as there is a matching record, the output is Unchanged.
Hi @ebygomm.
Could not resist doing a small test:
4 Creators all creating 10 records with _creation_instance from 0 - 9
4 AttributeCreators creating test=1 or test-2
ChangeDetector with key _creation_instance (duplicates in test set) and checking test attribute.
All output as unchanged:
So, even with duplicate keys, as long as there is a matching record, the output is Unchanged.
Yes,I've not been able to create shareable test day that exhibits the issue. But I can send 7 features in to each port of the change detector and get 5 unchanged and 2 edits. If I reverse sort one of the inputs everything exits via the unchanged port....
Feature order definitely impacts the change detector, the original order of the revised features results in a report of 2 features updated, whilst reversing the feature order results in them reporting as unchanged (i consider this correct).
Now to work out how to sort each input so that they are in the same order, I need some sort of sort by all attributes (the order itself isn't important, only that both inputs are sorted in the same way it seems)
Feature order definitely impacts the change detector, the original order of the revised features results in a report of 2 features updated, whilst reversing the feature order results in them reporting as unchanged (i consider this correct).
Now to work out how to sort each input so that they are in the same order, I need some sort of sort by all attributes (the order itself isn't important, only that both inputs are sorted in the same way it seems)
@ebygomm the other possibility to consider is that it may not be the order of features, but rather that Transformers like Counter etc. can break the workflow Features out of Bulk Mode. I have noticed some Transformers misbehave in Bulk Mode and some of my workflows have to put in Transformers to break them out of Bulk Mode into the more traditional Feature-by-Feature processing.
To test this, instead of a Sorter, Counter etc., try instead putting AttributeKeepers (Select All Attributes) to pre-process the Features just before entering the ChangeDetector. AttributeKeeper is guaranteed to split the Features out of Bulk Mode, so this will test whether or not it is actually Feature Order, or Bulk Mode that is causing this.
@ebygomm the other possibility to consider is that it may not be the order of features, but rather that Transformers like Counter etc. can break the workflow Features out of Bulk Mode. I have noticed some Transformers misbehave in Bulk Mode and some of my workflows have to put in Transformers to break them out of Bulk Mode into the more traditional Feature-by-Feature processing.
To test this, instead of a Sorter, Counter etc., try instead putting AttributeKeepers (Select All Attributes) to pre-process the Features just before entering the ChangeDetector. AttributeKeeper is guaranteed to split the Features out of Bulk Mode, so this will test whether or not it is actually Feature Order, or Bulk Mode that is causing this.
Edit: It doesn't appear to be a bulk mode issue, inserting an attributekeeper/pythoncaller before the change detector makes no difference to the output
A workspace that hopefully demonstrates the issue
change_detection_test.fmwt
Edit: It doesn't appear to be a bulk mode issue, inserting an attributekeeper/pythoncaller before the change detector makes no difference to the output
Not Bulk Mode then, so something else... Following this thread with interest since moving some of my more mature workflows from pure Inserts to Change Detection and Updates.
To the group as a whole - what should ChangeDetector do if a duplicate key is encountered?
Change detection is usually carried out on a unique identifier for the dataset - if you're use the Update Detection Key Option. i.e. GUID, OBJECTID , GlobalID etc.
Should ChangeDetector reject features if a duplicate key is encountered in the 'original' input?
Should ChangeDetector reject features if a duplicate key is encountered in the 'revised' input?
To the group as a whole - what should ChangeDetector do if a duplicate key is encountered?
Change detection is usually carried out on a unique identifier for the dataset - if you're use the Update Detection Key Option. i.e. GUID, OBJECTID , GlobalID etc.
Should ChangeDetector reject features if a duplicate key is encountered in the 'original' input?
Should ChangeDetector reject features if a duplicate key is encountered in the 'revised' input?
I'd have to think about that, I know what I don't want it to do is report different changes based on the order of features entering, it strikes me as a little unsafe. If it's a requirement that the key attributes are unique for the changedetector to work correctly then probably features should be rejected on both sides?
Ideally, what i probably want is some sort of child of the matcher and the changedetector as I'm not really using the changedetector as it's intended. I'm not looking for changes but rather to prove things are the same. The matcher works for this where everything matches, but if things don't match it's a game of spot the difference. The changedetector provides a convenient list of differences.
For my use case, now i'm aware of the ordering being important, I can resolve it by sorting the data instead.
To the group as a whole - what should ChangeDetector do if a duplicate key is encountered?
Change detection is usually carried out on a unique identifier for the dataset - if you're use the Update Detection Key Option. i.e. GUID, OBJECTID , GlobalID etc.
Should ChangeDetector reject features if a duplicate key is encountered in the 'original' input?
Should ChangeDetector reject features if a duplicate key is encountered in the 'revised' input?
I believe the preferred option would be to have a parameter that allows you to pick an option.
Like the FeatureMerger has:
Ignore Duplicate keys --> Unchanged if same key(s) exist in revised input
Not ignore Duplicate keys --> First Unchanged, each additional Added
Reject Duplicate keys --> Every Duplicate key Rejected
To the group as a whole - what should ChangeDetector do if a duplicate key is encountered?
Change detection is usually carried out on a unique identifier for the dataset - if you're use the Update Detection Key Option. i.e. GUID, OBJECTID , GlobalID etc.
Should ChangeDetector reject features if a duplicate key is encountered in the 'original' input?
Should ChangeDetector reject features if a duplicate key is encountered in the 'revised' input?
@markatsafe I don't think necessarily there has to be a handler for if there are duplicate keys in the 'original' input. For pure change detection to inform an UPDATE or DELETE you would process all matching Features in the Original Input. So if there were 2 matching Features with an identical Key, then both them would be detected as needing to be potentially updated depending on attribute value comparison options selected. Similarly if there were 2 non-matching Features with an identical Key, then both of them would be detected as needing to be Deleted, and this all matches similar behavior in SQL based DB transactions where the change detection can be done either for Primary Key Joined to Primary Key, of Foreign Key (Original) joined to Primary Key (Revised).
The only question comes to the Default behavior of were there are Duplicates in the Revised Input. In some GIS software, it uses the First feature only by default, and in SQL it is non-deterministic, although it realistically also tends to be influenced by the order of features, depending on the DB software but nonetheless generally treated as a kind of error condition that may or may not work but with unreliable results. Personally I think by default these should be spat out on a Rejected Port as the user should build handlers for this with pre-processing Transformers to make the Revised Input have Unique Keys if they want them to be used in the comparison. This can be done by Eg. Sorter+DuplicateFilter or StatisticsCalculator or Sampler etc. , depending on what Update Logic the user needs.
To the group as a whole - what should ChangeDetector do if a duplicate key is encountered?
Change detection is usually carried out on a unique identifier for the dataset - if you're use the Update Detection Key Option. i.e. GUID, OBJECTID , GlobalID etc.
Should ChangeDetector reject features if a duplicate key is encountered in the 'original' input?
Should ChangeDetector reject features if a duplicate key is encountered in the 'revised' input?
Compare a M:M relationship vs. a 1:1 relationship would solve it I think. Duplicates in the original, revised or both would be handled. As it is now, the 1st match is found and the rest are treated as inserts or deletes.
Probably not a good idea to reject features. It would cause users to have to work around the change detector.
OK - so quite a wide range of use cases here. Just to clarify - we're only referring to the case where Update Detection Key Attribute is used.
- duplicate on the Original. M:1 match. What should be the outcome:
- reject - strict
- allow - very lenient - first match wins with the revised. Match made in the order of arrival in original/revised. Duplicate Original flagged as a DELETE
- Allow - very lenient as 1.2. above but add a WARN message
- other...
- duplicate on the Revised. 1:M match. What should be the outcome:
- reject - strict
- allow - very lenient - first match wins with the revised. Match made in the order of arrival in original/revised. Duplicate Revised flagged as an INSERT
- Allow - very lenient as 2.2. above but add a WARN message
- other...
- duplicate on the Original & Revised M:M. What should be the out come:
- reject- strict
- allow - very lenient - first match wins with each pair of original/revised. Matches made in the order of arrival in original/revised. 'Spare' original / revised flagged as delete or insert
- Allow - very lenient as 3.2. above but add a WARN message
- other...
I'm in favour of strict - ID;s are unique. We're not treating the revised input as a history table I don't think
OK - so quite a wide range of use cases here. Just to clarify - we're only referring to the case where Update Detection Key Attribute is used.
- duplicate on the Original. M:1 match. What should be the outcome:
- reject - strict
- allow - very lenient - first match wins with the revised. Match made in the order of arrival in original/revised. Duplicate Original flagged as a DELETE
- Allow - very lenient as 1.2. above but add a WARN message
- other...
- duplicate on the Revised. 1:M match. What should be the outcome:
- reject - strict
- allow - very lenient - first match wins with the revised. Match made in the order of arrival in original/revised. Duplicate Revised flagged as an INSERT
- Allow - very lenient as 2.2. above but add a WARN message
- other...
- duplicate on the Original & Revised M:M. What should be the out come:
- reject- strict
- allow - very lenient - first match wins with each pair of original/revised. Matches made in the order of arrival in original/revised. 'Spare' original / revised flagged as delete or insert
- Allow - very lenient as 3.2. above but add a WARN message
- other...
I'm in favour of strict - ID;s are unique. We're not treating the revised input as a history table I don't think
Really thinking about it, I'm with strict and rejecting all duplicates via a rejected port(s). My original thought was for comparing multiple sources to a master source but that's easily done by stacking ChangeDetectors for each input.
The point is to compare a 1:1 record. The workaround for the duplicates is pretty simple. Assign a unique value to each unique feature. Something has to be different or it's the same record and one needs to be deleted or made unique. For the spatial side of a duplicate they would be aggregated together or a unique ID assigned to each that corresponds to the record in the database to make them unique.
OK - so quite a wide range of use cases here. Just to clarify - we're only referring to the case where Update Detection Key Attribute is used.
- duplicate on the Original. M:1 match. What should be the outcome:
- reject - strict
- allow - very lenient - first match wins with the revised. Match made in the order of arrival in original/revised. Duplicate Original flagged as a DELETE
- Allow - very lenient as 1.2. above but add a WARN message
- other...
- duplicate on the Revised. 1:M match. What should be the outcome:
- reject - strict
- allow - very lenient - first match wins with the revised. Match made in the order of arrival in original/revised. Duplicate Revised flagged as an INSERT
- Allow - very lenient as 2.2. above but add a WARN message
- other...
- duplicate on the Original & Revised M:M. What should be the out come:
- reject- strict
- allow - very lenient - first match wins with each pair of original/revised. Matches made in the order of arrival in original/revised. 'Spare' original / revised flagged as delete or insert
- Allow - very lenient as 3.2. above but add a WARN message
- other...
I'm in favour of strict - ID;s are unique. We're not treating the revised input as a history table I don't think
@markatsafe , given that the main use case for ChangeDetector is to inform the backend UPDATE SQL statements used by a DB Writer then:
1: "Other" M:1 should be no problem, all M Original records get compared against the 1 Revised record, flagged for Change/No Change. So take a generic scenario where a previous Workflow had built a Building Access Log Event table of Employees accessing company buildings which was "DateTime", "Employee ID", "Employee Name", where "Employee ID" will be obviously non-unique and occur M times, but your Workflow was to detect where an Employee had changed their legal name in the HR system and the Event Log table needed to be retrospectively updated with their updated Employee Name using Employee ID as the Key to compare M records in the Building Access Log Table vs 1 record in the HR Employee Table.
2. "Reject" 1:M I can't see as a valid . Variations I've seen, say in ArcGIS use FIRST, LAST, AVERAGE etc. kind of pre-processer options to the user to resolve M cases, but FME has this capability within its other Transformers. Most cases I think M Revised records will be a Workflow design error, and the user needs to get a clear signal that they exist to flag to them they either add a pre-processor to resolve the Attribute Values to a unique Revised Key, or add a Connector+Transformer to the Rejected Port to handle it post-ChangeDetector
3. Same case as 2. Default is Rejected Port, user to handle it with pre or post Transformers.
OK - so quite a wide range of use cases here. Just to clarify - we're only referring to the case where Update Detection Key Attribute is used.
- duplicate on the Original. M:1 match. What should be the outcome:
- reject - strict
- allow - very lenient - first match wins with the revised. Match made in the order of arrival in original/revised. Duplicate Original flagged as a DELETE
- Allow - very lenient as 1.2. above but add a WARN message
- other...
- duplicate on the Revised. 1:M match. What should be the outcome:
- reject - strict
- allow - very lenient - first match wins with the revised. Match made in the order of arrival in original/revised. Duplicate Revised flagged as an INSERT
- Allow - very lenient as 2.2. above but add a WARN message
- other...
- duplicate on the Original & Revised M:M. What should be the out come:
- reject- strict
- allow - very lenient - first match wins with each pair of original/revised. Matches made in the order of arrival in original/revised. 'Spare' original / revised flagged as delete or insert
- Allow - very lenient as 3.2. above but add a WARN message
- other...
I'm in favour of strict - ID;s are unique. We're not treating the revised input as a history table I don't think
I think it should be reject strict, with a flag indicating whether the rejected feature came from the Original or Revised port.
OK - so quite a wide range of use cases here. Just to clarify - we're only referring to the case where Update Detection Key Attribute is used.
- duplicate on the Original. M:1 match. What should be the outcome:
- reject - strict
- allow - very lenient - first match wins with the revised. Match made in the order of arrival in original/revised. Duplicate Original flagged as a DELETE
- Allow - very lenient as 1.2. above but add a WARN message
- other...
- duplicate on the Revised. 1:M match. What should be the outcome:
- reject - strict
- allow - very lenient - first match wins with the revised. Match made in the order of arrival in original/revised. Duplicate Revised flagged as an INSERT
- Allow - very lenient as 2.2. above but add a WARN message
- other...
- duplicate on the Original & Revised M:M. What should be the out come:
- reject- strict
- allow - very lenient - first match wins with each pair of original/revised. Matches made in the order of arrival in original/revised. 'Spare' original / revised flagged as delete or insert
- Allow - very lenient as 3.2. above but add a WARN message
- other...
I'm in favour of strict - ID;s are unique. We're not treating the revised input as a history table I don't think
thanks everyone. Great feedback here. I'll take this back to our development team and see what changes are feasible.