Skip to main content

Hello,

I am trying to identify any attribute changes between files in two datasets. So, every month - the dataset is refreshed (data = approx. 5,000 files) and I want to identify changes between files (with the same filename) in the updated and previous datasets.

For example, File A has 10 records in Month 1 and File A has 20 records in Month 2. I want to be able to identify how many records have been added/deleted - but the file may not necessarily have a unique identifier - so I would assume it would be based on a count?

I think that ChangeDetector is the way to go, but I am unsure? Can anyone provide some guidance on this?

You mention "records", is this a geographic or a non-geographic dataset?

 

 


You mention "records", is this a geographic or a non-geographic dataset?

 

 

Sorry, yes it's geographic, so by records I mean geographical points in that file.

 

 


Yes, the ChangeDetector is a good place to start, although there's also an UpdateDetector custom transformer in the FME Hub which could be useful. Input is 2 datasets (old and new). It will compare geometries so a unique identifier is not necessary.

Depending on your data you can consider doing lenient geometry matching (i.e. the order of the coordinates is irrelevant)


You could use the Matcher to detect changes in the two datasets, either by attribute, geometry or both. You can analyze the fme_feature_type attribute on the NotMatched port to get the features which does not exist in the other dataset. The Matcher is also really fast.

The FME Hub also has the UpdateDetector, which I think is really useful if you need to differentiate between updated, inserted and deleted records between the datasets. It uses a combination of the ChangeDetector and the Matcher to give a very detailed way detect any changes. In my experience it is more useful if you have a shared, unique identifier between the datasets, however.


Hi @johnwk, I guess the file name indicates a kind of data and there are a number of files for two or more consecutive months with the same name for each data. i.e. if data "A" has 10 months contents, you may have 10 files named "A". Is it correct? or just two months?

And, do you need to get only the difference of number of records between two adjacent months for each file?


Hi @johnwk, I guess the file name indicates a kind of data and there are a number of files for two or more consecutive months with the same name for each data. i.e. if data "A" has 10 months contents, you may have 10 files named "A". Is it correct? or just two months?

And, do you need to get only the difference of number of records between two adjacent months for each file?

Hi @takashi thanks for your response, the plan is to compare this months data with the previous months data (so comparing two datasets in total). For a start, yes I would like the number difference in records in a file between the two months. Some records may have been added to a file, therefore I want to know how many have been added/deleted.

 

Yes, two adjacent months to see what is changing month to month.

 


Hi @johnwk, Yet another option, and a favorite of mine, is to use the CRC calculator as shown in the following article :

https://knowledge.safe.com/articles/1124/creating-a-unique-identifier-crccalculator.html

Hope this helps.


Hi @takashi thanks for your response, the plan is to compare this months data with the previous months data (so comparing two datasets in total). For a start, yes I would like the number difference in records in a file between the two months. Some records may have been added to a file, therefore I want to know how many have been added/deleted.

 

Yes, two adjacent months to see what is changing month to month.

 

Consider a case where there is a single data file per month, for consecutive multiple months. If you create a sequence of features each of which has an attribute (e.g. "num_records") storing the number of records for each month, the AttributeCreator or AttributeManager with the "Enable Adjacent Feature Attributes" option can be used to calculate the difference in number of records between current month and previous month. e.g.

 

Note: In the example above, since the "Default Value" is set to 0, missing attribute will be treated as 0 and therefore the result for the first month will be equal to the number of records.

 

I think this method can also be applied to the case where there are multiple data files per month. The way to create sequences of features for each data depends on the directory structure, folder/file naming rules, file format, etc.

The options described above are summarized in the KnowledgeBase article on Change Detection


Reply