Solved

How to use Change Detector to identify changes in a file between two datasets

8 years ago
October 4, 2016
9 replies
337 views

johnwk
16 replies

Hello,

I am trying to identify any attribute changes between files in two datasets. So, every month - the dataset is refreshed (data = approx. 5,000 files) and I want to identify changes between files (with the same filename) in the updated and previous datasets.

For example, File A has 10 records in Month 1 and File A has 20 records in Month 2. I want to be able to identify how many records have been added/deleted - but the file may not necessarily have a unique identifier - so I would assume it would be based on a count?

I think that ChangeDetector is the way to go, but I am unsure? Can anyone provide some guidance on this?

Best answer by redgeographics

Yes, the ChangeDetector is a good place to start, although there's also an UpdateDetector custom transformer in the FME Hub which could be useful. Input is 2 datasets (old and new). It will compare geometries so a unique identifier is not necessary.

Depending on your data you can consider doing lenient geometry matching (i.e. the order of the coordinates is irrelevant)

View original

Did this help you find an answer to your question?

+50

redgeographics
Celebrity
3643 replies
8 years ago
October 4, 2016

You mention "records", is this a geographic or a non-geographic dataset?

johnwk
Author
16 replies
8 years ago
October 4, 2016

redgeographics wrote:

You mention "records", is this a geographic or a non-geographic dataset?

Sorry, yes it's geographic, so by records I mean geographical points in that file.

+50

redgeographics
Celebrity
3643 replies
Best Answer
8 years ago
October 4, 2016

Depending on your data you can consider doing lenient geometry matching (i.e. the order of the coordinates is irrelevant)

david_r
8355 replies
8 years ago
October 4, 2016

You could use the Matcher to detect changes in the two datasets, either by attribute, geometry or both. You can analyze the fme_feature_type attribute on the NotMatched port to get the features which does not exist in the other dataset. The Matcher is also really fast.

The FME Hub also has the UpdateDetector, which I think is really useful if you need to differentiate between updated, inserted and deleted records between the datasets. It uses a combination of the ChangeDetector and the Matcher to give a very detailed way detect any changes. In my experience it is more useful if you have a shared, unique identifier between the datasets, however.

takashi
7715 replies
8 years ago
October 4, 2016

Hi @johnwk, I guess the file name indicates a kind of data and there are a number of files for two or more consecutive months with the same name for each data. i.e. if data "A" has 10 months contents, you may have 10 files named "A". Is it correct? or just two months?

And, do you need to get only the difference of number of records between two adjacent months for each file?

johnwk
Author
16 replies
8 years ago
October 10, 2016

takashi wrote:

And, do you need to get only the difference of number of records between two adjacent months for each file?

Hi @takashi thanks for your response, the plan is to compare this months data with the previous months data (so comparing two datasets in total). For a start, yes I would like the number difference in records in a file between the two months. Some records may have been added to a file, therefore I want to know how many have been added/deleted.

Yes, two adjacent months to see what is changing month to month.

+17

itay
Supporter
1441 replies
8 years ago
October 10, 2016

Hi @johnwk, Yet another option, and a favorite of mine, is to use the CRC calculator as shown in the following article :

https://knowledge.safe.com/articles/1124/creating-a-unique-identifier-crccalculator.html

Hope this helps.

takashi
7715 replies
8 years ago
October 10, 2016

johnwk wrote:

Yes, two adjacent months to see what is changing month to month.

Consider a case where there is a single data file per month, for consecutive multiple months. If you create a sequence of features each of which has an attribute (e.g. "num_records") storing the number of records for each month, the AttributeCreator or AttributeManager with the "Enable Adjacent Feature Attributes" option can be used to calculate the difference in number of records between current month and previous month. e.g.

Note: In the example above, since the "Default Value" is set to 0, missing attribute will be treated as 0 and therefore the result for the first month will be equal to the number of records.

I think this method can also be applied to the case where there are multiple data files per month. The way to create sequences of features for each data depends on the directory structure, folder/file naming rules, file format, etc.

markatsafe
1891 replies
8 years ago
October 11, 2016

The options described above are summarized in the KnowledgeBase article on Change Detection

Reply

Rich Text Editor, editor1

Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

Cookie settings

We use 3 different kinds of cookies. You can choose which cookies you want to accept. We need basic cookies to make this site work, therefore these are the minimum you can select. Learn more about our cookies.

Basic
Functional

Normal
Functional + analytics

Complete
Functional + analytics + social media + embedded videos + marketing

How to use Change Detector to identify changes in a file between two datasets