Solved

Unique port of DuplicateFilter returning duplicates

  • 27 February 2024
  • 8 replies
  • 86 views

Badge +5

Hi,

I’ve got a strange issue.

I am using a FeatureMerger to match values from a lookup table.

Then I use a DuplicateFilter to check that all the terms from the lookup table are used. And strangely, with certain non ASCII caracters, there are duplicates coming out the Unique port !

 

Another strange thing is that these duplicates in Unique port dont appear when I check for duplicates immediately on the lookup table.

So what happens in the FeatureMerger that makes the DuplicateFilter dysfunction ?

 

Here’s what the Excel lookup table looks like.

 

When I replace the non ASCII caracter, the problem doesn’t appear.

What to do ?

 

icon

Best answer by tim_bkr 28 February 2024, 14:55

View original

8 replies

Userlevel 6
Badge +32

Hard to say without data, but it looks like there is a newline in every other attribute? The white space above the text in the cell in the Excel screenshot?

These can be removed using a StringReplacer with regex and \n:

 

Badge +5


Hi @nielsgerrits ,

Thanks.

No, the only newlines or carriage returns there are are when there are two lines of text in the Excel cell.

 

Userlevel 6
Badge +32

It is hard to say without data. Can you share a .ffs with the incorrect output from the DuplicateFilter Unique outputport?

Badge +5

Hey @nielsgerrits ,

Sure. Here it is. I’ve provided two examples.

I had to zip it as .ffs isn’t accepted.

Chears,

Timothée

Userlevel 6
Badge +32

Hey @nielsgerrits ,

Sure. Here it is. I’ve provided two examples.

I had to zip it as .ffs isn’t accepted.

Chears,

Timothée

Hey @tim_bkr,

In 20240228_duplicate_in_unique_CATEG_INV.ffs, when I double click row 4, CATEG_INV, it has the value 

A (Substance d'origine)
B (Structure d'origine)

which is different from the value

A (Substance d'origine)

on row 3.

So these are different right? Or did I get you question wrong?

Badge +5

You need to order on either column.

The duplicates are those with the text “C (Caractère spécifique d'origine)”

 

Userlevel 6
Badge +32

Sorry, I misunderstood. I now see duplicate value “C (Caractère spécifique d'origine)”  in the file “20240228_duplicate_in_unique_CATEG_INV.ffs”.

When I use a DuplicateFilter in 2021.2.6 and 2023.2.2 it just works fine. What version FME do you use?

Badge +5

Hey !

2021.2.0

I solved the problem using an AttributeEncoder.

I find this is a weird behavior.
To my understanding, this means that :

  1. Two different encodings can come from the same input dataset
  2. These two encodings remain different within an fme datastream (or what do you call data that flows through an FME connector ?)

 

Reply