Solved

How to Find Unicode Characters in Text String

10 months ago
August 16, 2024
3 replies
564 views

juliarozema
Contributor
44 replies

We have some text fields that shouldn’t have any UNICODE characters. Turns out at least one has snuck into the data. Is there a way to Test to find UNICODE characters and then remove them?

FME 2022.2.3

Best answer by bwn

Since FME uses a PERL implementation of RegEx, then could use StringSearcher with RegEx to find where there is a string that has a match to a non-ASCII character, and the character positions.

RegEx Pattern looking for is [^[:ascii:]]

Gives

View original

Did this help you find an answer to your question?

david_r
8355 replies
10 months ago
August 19, 2024

You could try the AttributeEncoder with “Replace invalid characters”=Yes to transform the string to e.g. Latin-1 (or whatever you require), then compare the string before and after so see if any extended Unicode characters where replaced/removed.

+26

bwn
Evangelist
562 replies
Best Answer
10 months ago
August 20, 2024

Gives

juliarozema
Author
Contributor
44 replies
10 months ago
August 22, 2024

Thank you @bwn this helps us see the special characters that have snuck into our data.
@david_ Thank you for also taking the time to submit an idea. Your idea works in conjunction with @bwn.

First use the StringSearcher to find a Match for the UNICODE characters.
Then use the Attribute Encoder to remove them.
Then use the StringSearcher (during development) to confirm the UNICODE characters are gone.

Thank you both :)

Reply

Rich Text Editor, editor1

Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

Cookie settings

We use 3 different kinds of cookies. You can choose which cookies you want to accept. We need basic cookies to make this site work, therefore these are the minimum you can select. Learn more about our cookies.

Basic
Functional

Normal
Functional + analytics

Complete
Functional + analytics + social media + embedded videos + marketing

How to Find Unicode Characters in Text String