Solved

CSV file with duplicates which contains data that needs to be preserved into new columns.

  • 2 January 2018
  • 7 replies
  • 36 views

Hi all,

I'm new to FME and am trying to deal with a datset that has duplicates in. However each duplicate contains some information that I need to preserve into new columns.

A simplified version is something like this:

PropertyIDAddressContact nameContact details00011 High StreetMr Smith0121232800000011 High StreetMr Jones0121232800100022 Low StreetMr Birmingham0121232800200022 Low StreetMr Warwick01212328003

And what I need to end up with is more like this:

 

PropertyIDAddressContact nameContact detailsContact name 2Contact details 200011 High StreetMr Smith01212328000Mr Jones0121232800100022 Low StreetMr Birmingham01212328002Mr Warwick01212328003

It's been suggested to me to use FeatureMerger after multiple DuplicateFilters, but I can't figure out how to bring each set of filtered duplicates back in to a single dataset whilst preserving the data out of the duplicate entries.

Any help, pointers or example workspaces would be greatly appreciated.

Cheers, Mark

icon

Best answer by fmelizard 2 January 2018, 18:19

View original

7 replies

Userlevel 4
Badge +13

Hi @mrb. I hope the attached workspace template, csvduplicatehandling.fmwt helps you get started. It reads in the CSV file, combines the rows based on PropertyID using a FeatureMerger, then uses the DuplicateFilter to keep only one copy of the same rows.

Userlevel 2
Badge +17

Hi @mrb, if the maximum number of duplicates is known (e.g. 2), the Aggregator (or ListBuilder) and AttributeManager (or AttributeRenamer) might help you. e.g.

Hi @mrb. I hope the attached workspace template, csvduplicatehandling.fmwt helps you get started. It reads in the CSV file, combines the rows based on PropertyID using a FeatureMerger, then uses the DuplicateFilter to keep only one copy of the same rows.

Thank you for your reply @NatalieAtSafe.

 

I am still a little confused, as I can't see how the contact columns from the duplicates are added in to the merged row, thus preserving all of the data.

 

It's interesting that you filter the duplicates after the merge, I was looking at the other way round.

 

 

Hi @mrb, if the maximum number of duplicates is known (e.g. 2), the Aggregator (or ListBuilder) and AttributeManager (or AttributeRenamer) might help you. e.g.

Hi @takashi,

 

Thanks for your reply, I did look at the aggregator, but found out that the amount of duplicates for each row isn't a constant, so I couldn't get it to work.

 

Userlevel 4
Badge +13
Hi @takashi,

 

Thanks for your reply, I did look at the aggregator, but found out that the amount of duplicates for each row isn't a constant, so I couldn't get it to work.

 

If there is a maximum # of duplicates that you might know ahead of time, then you can rename that max # of times in the AttributeManager.

 

 

One could write a workspace that computed that maximum for you, and then you'd know that.

 

 

Having FME do it all automatically is possible but this type of dynamic output schema (i.e. where the columns are unknown when we design) is a bit trickier. If you really need that then let us know -- a wee bit of Python may be needed but victory can be attained for certain.

 

 

Hi @mrb. I hope the attached workspace template, csvduplicatehandling.fmwt helps you get started. It reads in the CSV file, combines the rows based on PropertyID using a FeatureMerger, then uses the DuplicateFilter to keep only one copy of the same rows.

Thank you @NatalieAtSafe

 

The list is where I was going wrong - when I popped a list in, all became clear - Thanks. :-)

 

Badge +2

If you don't know the number of contacts at an address, then a little python might be needed. Also, if this is a large dataset then doing an full outer join in the FeatureMerger will have a detrimental impact on performance. So either use an Aggregator as suggested by @takashi or handle the duplicates before the FeatureMerger.

Example workspace (2017): csvduplicatehandling-v01.fmwt

Reply