Question

Trying to combine duplicate files from an attachment table


Badge

Hello,

I am trying to pull pdf files from an attachment table and avoid having any duplicate documents. There is nothing in either the attachment table nor the related table that I can use as a unique identifier for the documents. I have tried two different methods and both only partially work.

1) Use an aggregator grouped by DATA. This works ok but some of the DATA differs even though the documents themselves are the same.

2) Use an aggregator grouped by DATA_SIZE. This works ok as well but I'm still getting duplicates. I investigated a little and it looks like even though the DATA_SIZE appears the same in the inspector table, when the aggregator runs some of the sizes are slightly different. When I pull up the properties for each of the output duplicate pdfs, the difference appears to be between the file size and size on disk.

Any ideas how to get these to aggregate properly?


2 replies

Userlevel 3
Badge +13

Sounds like you have the attachment PDF file contents in an attribute. Try making a CRC on that attribute -- that should reduce the contents down to a single number. Then use DuplicateFilter on the generated CRC attribute. Is there some reason you need things joined together by an Aggregator?

Badge

Hi Dale,

Thanks for your response. I didn't know about the CRC. That helps with creating a unique id. I have been looking at this some more and I've determined that aggregating on the actual blob data (or use the CRC) is the better way to go. The file size can cause some bad matches. This takes care of quite a few duplicates.

I am aggregating these in order to create a list of all of the features that are linked to the same document. I assign each feature in the list the same id that can be used to tie back to one document. This reduces the number of documents by quite a bit, however, I am still left with some duplicate documents in the end. I think at this point I'll just have to open the documents that are the same file size, verify they are the same document, and change the ids manually.

Thanks,

Justin

Reply