Question

Compare XML Datasets


Hi,

 

 

I have an interesting analytical type question that I am trying to solve in FME. I have two (or more) XML files which I need to compare for differing values in the same datatype and report on them. I have created a CSV file with a vertical datamodel of the structure

 

 

PrimaryID, Datafile, AttributeName, AttributeValue

 

 

What I need to accomplish is having FME report

 

 

Where PrimaryID = PrimaryID and DataFile <> Datafile and AttributeName = AttributeName and AttributeValue <> Attribute Value

 

 

Something like this

 

 

PrimaryID, Datafile, AttributeName, AttributeValue

 

1001, FileA, Engine, 4Cylinder

 

1001, FileA, Colour, Blue

 

1001, FileB, Engine, 6 Cylinder

 

1001, FileB, Colour, Blue

 

1002, FileA, Engine, 4Cylinder

 

1002, FileA, Colour, Blue

 

1002, FileB, Engine, 4Cylinder

 

1002, FileB, Colour, Blue

 

 

The result would be that FME would report

 

 

PrimaryID

 

1001, FileA, Engine, 4Cylinder

 

1001, FileB, Engine, 6Cyliner

 

 

As these value are different even though they relate to the same Car (id 1001) but one file reports the car having a 4 cylinder engine and the second file reports the same car having a 6Cylinder engine.

 

 

At the moment, I can have multiple XML files to read, some of which may be missing attributes, so I would need to detect those as well.

 

 

My process is to read them in, I then use an AttributeExploder to expose all of the XML tags. From there I use a matcher to match on the PrimaryID values, and so on to constrian the list to what is not matched. It is at this point that the process begins to fail.

 

 

Any thoughts would be greatly appreciated.

 

 

Thanks,

 

 

Kieren

11 replies

Userlevel 4
Hi,

 

 

While FME is pretty good with XML, personally, I'd look into more specialised tools for this scenario. Here's a discussion about various alternatives (http://blogs.msdn.com/b/dmahugh/archive/2008/06/18/open-xml-diff-tools.aspx).

 

 

David
Userlevel 2
Badge +17
Hi,

 

 

I think PrimaryID + AttributeName can be considered as a complex primary key. That is, the key is unique in a dataset.

 

If my understanding is correct, the Matcher transformer can detect value mismatching among features having same key (ID and attribute).

 

-----

 

Match Geometry: NONE

 

Attribute Matching Strategy: Match Selected Attributes

 

Selected Attributes: PrimaryID AttributeName

 

Attributes That Must Differ: AttributeValue

 

-----

 

 

However, it might not be enough if there are 3 or more datasets. In a case such as the following example, the Matcher will not output either FileB or FileC, because they have the same attribute value (6 Cylinder), although FileA will be output.

 

-----

 

1001, FileA, Engine, 4 Cylinder

 

1001, FileB, Engine, 6 Cylinder

 

1001, FileC, Engine, 6 Cylinder

 

-----

 

 

If you need to get both FileB and FileC in such a case, the FeatureMerger can be used additionally.

 

-----

 

All the original features --> Requestor

 

Matched features from the Matcher --> Supplier

 

Join On: PrimaryID = PrimaryID and AttributeName = AttributeName

 

-----

 

 

Hope this helps,

 

Takashi
Badge +3
 Hi,

 

 

 

You can use a listbuilder grouped on PrimaryID and AttributeName then do a listelementcount. Select elementcount>1 and then use in sequence  listduplicateremovers, 1 for DataFile and 1 for AttributeValue (order does notmatter). Then test for existance of a second record (like _list{1}. AttributeName exists), wich should not and therefore yields your result. Explode it.

 

 

Zoom in picture to see settings.

 

 

Badge +3
Btw, it is indeed as Takashi says: PrimaryID and AttributeName is used as a Key.

 

 

Matcher can be used in this way too.

 

You use the key, then at outpurtport Matched u add again a sequence off matchers.

 

This time u need to use the as key "_matched_id" and "AttributeValue" followed by (order does not matter) "_matched_id" and "DataFile". For the latter 2 u need to use the Not_Matched outputport.

 

 

Actually thats even better, u just have 3 matcehrs in a row!

 

Userlevel 2
Badge +17
Inspired by Gio's first post. How about this workflow?

 

 

Badge +3
According to Kieren's boolean rule this set :

 

 

1001, FileA, Engine, 4 Cylinder

 

1001, FileB, Engine, 6 Cylinder

 

1001, FileC, Engine, 6 Cylinder

 

 

Only first row should pass. 

 

 

I have managed to make a workbench that does it correctly:

 

 

 

and the customtransformer that is the core of it:

 

 

 

I tested it with al possible combo's.

 

 

This is a flexible solution; u can add more booleans variables to it without much change needed.

 

 

Looks simple,took me a while to find this solution tho.

 

I discarded 3 or more techniques.

 

 

Greets

 

Badge +3
My other suggestions only worked for the initial example data.

 

My last solution has no such limit.
Apologies for the late reply. This is an unbelievably awesome community. I had been working with a Inline Querier transformer (actually chaining them) and trying to fight through the SQL.

 

 

But all of your solutions are fantastic. Truly fantastic.

 

 

As my FME skills improve, I certainly hope that I can contribute back (if anyone needs ArcGIS and ArcGIS Mobile help, do let me know, but it's the wrong forum for that)
Gio,

 

 

I cannot quite seem to get the top string concatenator set up with the Query, is it possible to share your workbench? I really like the use of the ListHistogrammer. This was a completely new transformer to me (I had noticed it, just had never thought about using it).
Gio,

 

 

Another question if you don't mind. In the custom transformer you make mention of an output attribute called Koppeling. Where does this get set?

 

 

Thanks,

 

 

Kieren
Badge +3
Hi Entombed,

 

 

 

The attribute koippeling is set in both the concatenators seperately.

 

 

To make this bench, you must make one customtransformer. When its finished you then  copy paste it (the input and output attributes get "locked". It then becomes hard/impossible to change the transformer, only by reducing it's instance to only 1 it can be altered/adapted).

 

It can be tricky to get this bit set up.

 

 

I will share the bench.

 

 

(btw. Koppeling is Dutch for Coupler, Link, connector. It's the attribute to merge on)

Reply