Skip to main content
Solved

Unique port of DuplicateFilter returning duplicates

  • February 27, 2024
  • 8 replies
  • 252 views

tim_bkr
Participant
Forum|alt.badge.img+5

Hi,

I’ve got a strange issue.

I am using a FeatureMerger to match values from a lookup table.

Then I use a DuplicateFilter to check that all the terms from the lookup table are used. And strangely, with certain non ASCII caracters, there are duplicates coming out the Unique port !

 

Another strange thing is that these duplicates in Unique port dont appear when I check for duplicates immediately on the lookup table.

So what happens in the FeatureMerger that makes the DuplicateFilter dysfunction ?

 

Here’s what the Excel lookup table looks like.

 

When I replace the non ASCII caracter, the problem doesn’t appear.

What to do ?

 

Best answer by tim_bkr

Hey !

2021.2.0

I solved the problem using an AttributeEncoder.

I find this is a weird behavior.
To my understanding, this means that :

  1. Two different encodings can come from the same input dataset
  2. These two encodings remain different within an fme datastream (or what do you call data that flows through an FME connector ?)

 

This post is closed to further activity.
It may be an old question, an answered question, an implemented idea, or a notification-only post.
Please check post dates before relying on any information in a question or answer.
For follow-up or related questions, please post a new question or idea.
If there is a genuine update to be made, please contact us and request that the post is reopened.

8 replies

nielsgerrits
VIP
Forum|alt.badge.img+66

Hard to say without data, but it looks like there is a newline in every other attribute? The white space above the text in the cell in the Excel screenshot?

These can be removed using a StringReplacer with regex and \n:

 


tim_bkr
Participant
Forum|alt.badge.img+5
  • Author
  • Participant
  • February 27, 2024


Hi @nielsgerrits ,

Thanks.

No, the only newlines or carriage returns there are are when there are two lines of text in the Excel cell.

 


nielsgerrits
VIP
Forum|alt.badge.img+66

It is hard to say without data. Can you share a .ffs with the incorrect output from the DuplicateFilter Unique outputport?


tim_bkr
Participant
Forum|alt.badge.img+5
  • Author
  • Participant
  • February 28, 2024

Hey @nielsgerrits ,

Sure. Here it is. I’ve provided two examples.

I had to zip it as .ffs isn’t accepted.

Chears,

Timothée


nielsgerrits
VIP
Forum|alt.badge.img+66

Hey @nielsgerrits ,

Sure. Here it is. I’ve provided two examples.

I had to zip it as .ffs isn’t accepted.

Chears,

Timothée

Hey @tim_bkr,

In 20240228_duplicate_in_unique_CATEG_INV.ffs, when I double click row 4, CATEG_INV, it has the value 

A (Substance d'origine)
B (Structure d'origine)

which is different from the value

A (Substance d'origine)

on row 3.

So these are different right? Or did I get you question wrong?


tim_bkr
Participant
Forum|alt.badge.img+5
  • Author
  • Participant
  • February 28, 2024

You need to order on either column.

The duplicates are those with the text “C (Caractère spécifique d'origine)”

 


nielsgerrits
VIP
Forum|alt.badge.img+66

Sorry, I misunderstood. I now see duplicate value “C (Caractère spécifique d'origine)”  in the file “20240228_duplicate_in_unique_CATEG_INV.ffs”.

When I use a DuplicateFilter in 2021.2.6 and 2023.2.2 it just works fine. What version FME do you use?


tim_bkr
Participant
Forum|alt.badge.img+5
  • Author
  • Participant
  • Best Answer
  • February 28, 2024

Hey !

2021.2.0

I solved the problem using an AttributeEncoder.

I find this is a weird behavior.
To my understanding, this means that :

  1. Two different encodings can come from the same input dataset
  2. These two encodings remain different within an fme datastream (or what do you call data that flows through an FME connector ?)