Skip to main content
Hi there, I thought this seemed simple but i've become a bit stuck:

 

 

I have x features (many millions but im running a testset of 20k) and they all have an Parcel number. Each feature has an associated Category (one of eight) that is made up of 1 to 3 letters.

 

 

Now, each parcel number does not necessarily occur just once; there could be up to 9 instances of each parcel number and each instance may, or may not, have the same Category code. 

 

 

I wanted to test the Category attribute - grouped by Parcel - to see if there was a complete set of Category duplicates ('Yes') or if the Category differed at all ('No') for a given parcel. 

 

 

(Bonus qu: In some cases where there are 6 instances of a Parcel and there are 3 different categories, it is clear to see that the 3 different categories have been duplicated by mistake - because there are associated identical areas - does anyone know how to clean those out??)

 

 

thanks all
Hi,

 

 

The requirement and actual data condition is vague, but the Matcher might help you.

 

If you explain more concretely the condition using simplified example, we can think of more concrete solution.

 

 

Takashi

Sorry for any vagueness, here is a crude example of the data with only the relevant attributes filled in;

 

 

 

 

As you can see, some of the ParcelIDs are unique, that is fine, but some of them exist more than once and, for those that occur more than once, sometimes the Category code is the same across instances and sometimes it is different (ID 112 is the same but you can see 113's are not all the same - i am concerned where they are all the same).

 

 

I have put in a column saying 'duplicate' which is what im aiming to automate across many millions of features.

Actually I have come up with a very crude working using Listbuilder to extract Category grouped by ParcelID then AttributeCreator and a long series of testers and composite testers to see if categories match each other or not. There has to be an easier way as i've used about 12 transformers and still havent got it right!


I thought you are trying to find not-unique pair of Parcel ID and Category code. That is, expected result should be:

 

Unique:

 

111 CA

 

113 GP

 

114 F

 

Duplicated:

 

112 CP

 

113 GT ??

 

115 CP

 

 

But in your example, "113 GT" is not classified as "Duplicated".

 

Could you please define the condition where the pair of Parcel ID and Category should be decided as "Duplicated"?
Hi Takashi, no not duplicated pairs, I am looking for duplicates to be identified when all the parcels of a given ID have the same category. If one instance of a parcel ID has a different category to the others, then it is not duplicate.
OK. I think this workflow would be one of ways to do that.

 

 

1) Aggregator_1 counts number of records for every identical ParcelID.

 

-----

 

Group By: ParcelID

 

Mode: Attributes Only

 

Count Attributes: _num1

 

 

2) Sampler selects records having unique combination of ParcelID and Category.

 

-----

 

Group By: ParcelID, Category

 

Sampling Type: First N Features

 

Sampling Amount: 1

 

 

3) Aggregator_2 counts number of the combinations for every identical ParcelID. If all the records of an identical PracelID has the same Category code, _num2 will be 1.

 

-----

 

Group By: ParcelID

 

Mode: Attributes Only

 

Count Attributes: _num2

 

 

4) FeatureMerger merges _num1 and _num2 to original records.

 

-----

 

Join On: ParcelID

 

Feature Merge Type: Attribute Only

 

Process Duplicate Suppliers: Yes  *this is important

 

 

5) AttributeCreator creates "Duplicate" attribute. If _num1 is greater than 1 and also _num2 is equal to 1, set "yes". Otherwise set "no".

 

-----

 

Attribute Name: Duplicate

 

Value: 2 Possible Values *use Conditional Value setting

 

If 1 < @Value(_num1) AND @Value(_num2) = 1  Then  yes

 

Else  no
The workflow can be replaces with an InlineQuerier.

 

SQL Query Example:

 

-----

 

select *,

 

case

 

  when 1 < _num1 and _num2 = 1 then 'yes'

 

  else 'no'

 

end as Duplicate

 

from Parcel as a

 

inner join (select ParcelID, count(*) as _num1

 

  from Parcel group by ParcelID) as b

 

  on b.ParcelID = a.ParcelID

 

inner join (select ParcelID, count(*) as _num2

 

  from (select distinct ParcelID, Category from Parcel) group by ParcelID) as c

 

  on c.ParcelID = a.ParcelID

 

-----

 

 


Thanks very much Takashi, i'll give it a go today 
At first inspection the workflow works correctly, thank you very much indeed. S.

Reply