Solved

Testing dataset quality


Badge

Hello, I have a query please. The attached table is a snapshot (fictional names etc used) of an Excel dataset of 10000 records, where employees of a company are each given a Card to access the office building, and this dataset logs their details each time they have entered the building, by swiping their card at the entrance door, in the past week.

In the ideal world, every employee should have a unique "Employee Number" and also a unique "Card Number." So in this example, John Smith should have an "Employee Number" of 100, and a "Card Number" of 1. However, we can see that he also has a "Card Number" of 2 on two occasions he has entered the building, for whatever reason.

Also, we can see that for Sarah Brown, her "Card Number" should indeed be 5, but for some reason, she has two "Employee Numbers" (400 and 500) when in fact she should just have one "Employee Number" of 400.

What is the best way, using FME, that I can test a dataset like this, so that it "passes through" (to the next stage of the workbench process) each occurrence where an employee has the expected "Employee Number" and expected "Card Number" for each of their records, but also flags up all occurrences where they have either different "Employee Numbers" or different "Card Numbers" ?

Thanks

icon

Best answer by fmelizard 25 July 2017, 19:48

View original

7 replies

Badge +6

Hello @cliff_wade,

 

I think the best place to start would be to use the TestFilter Transformer, this would allow you to do some conditional tests on the data. For example,

 

If FirstName = John

 

And

 

CardNumber != 1

 

Output Port: Fail

 

Else: Unfiltered

 

This would return 2 features out of the fail (can be named something different) port where the name John Smith used an employee card that was not 1. This doesn't seem 100% efficient but is currently the only way I can envision it. If you know the expected values for each employee it should work.

 

I hope that helps.
Userlevel 4
Badge +13

Hi @cliff_wade A Matcher can flag these occurrences for you employee-test.fmw

Badge

Thanks Trentasafe. However, the assumption is that I do not know the Employee Number or Card Number for every employee (let us say there are 500 employees.) So my question is more about how can I run a test where the output will result in every employee having just one Employee Number and one Card Number (as expected) ? In other words, the Employee Number should always have the same Card Number assigned to it (and any instances where this is not the case are flagged up as "errors")

(To DanAtSafe - thanks also for your answer, but unfortunately the version of FME I have does not support the Matcher transformer.)

Any solutions to my query would be much appreciated....

Userlevel 4

Thanks Trentasafe. However, the assumption is that I do not know the Employee Number or Card Number for every employee (let us say there are 500 employees.) So my question is more about how can I run a test where the output will result in every employee having just one Employee Number and one Card Number (as expected) ? In other words, the Employee Number should always have the same Card Number assigned to it (and any instances where this is not the case are flagged up as "errors")

(To DanAtSafe - thanks also for your answer, but unfortunately the version of FME I have does not support the Matcher transformer.)

Any solutions to my query would be much appreciated....

What version of FME do you have that doesn't support the Matcher?

 

It's available on FME Desktop Base version and up since at least FME 2011.
Badge

Yes apologies. When I first opened the workbench from DanAtSafe, I had a message to say my version did not support everything in his workbench, but as you say Matcher is available. It seems Matcher may be the solution, so I will try testing it numerous ways to see the outputs.

Badge

I have done some testing, and the Matcher is doing what I want apart from one thing.

Where there are 2 or more records which have the same Employee Number and Card Number, they are passed through OK. However, where there is only one record (eg just one instance of an employee) then it is passed to the "NotMatched" port. However, I wish all records to be passed through, even if there is just one instance, so that I am left with all records (employees) which always have the same Employee Number and Card Number - even though for some employees they may have say 20 records, whereas others just have one....

I also thought the "NotMatched" port was for any records where either the Employee Number or Card Number differ in any way from those which may have been passed through OK for the same employee ?

Any ideas please ?

Badge

Hi @cliff_wade, I have attached a workspace with 2 matchers in it. The first matches on names. Those are not matched represent a unique instance of a person and can be passed through. Those that do match are connected to another matcher. These are now compared using cardID, empID, and a matchID (generated from the first matcher). Those that match are ok and can be processed as desired. The ones that don't match are ones that have issues with the card or employee number and can be dealt with seprartely. Hope this helps. matcher.fmw

Reply