Skip to main content
Solved

How to Find Unicode Characters in Text String

  • August 16, 2024
  • 3 replies
  • 628 views

juliarozema
Contributor
Forum|alt.badge.img+9

We have some text fields that shouldn’t have any UNICODE characters.  Turns out at least one has snuck into the data.  Is there a way to Test to find UNICODE characters and then remove them?

FME 2022.2.3

Best answer by bwn

Since FME uses a PERL implementation of RegEx, then could use StringSearcher with RegEx to find where there is a string that has a match to a non-ASCII character, and the character positions.

RegEx Pattern looking for is [^[:ascii:]]

 



Gives

 

This post is closed to further activity.
It may be an old question, an answered question, an implemented idea, or a notification-only post.
Please check post dates before relying on any information in a question or answer.
For follow-up or related questions, please post a new question or idea.
If there is a genuine update to be made, please contact us and request that the post is reopened.

3 replies

david_r
Celebrity
  • August 19, 2024

You could try the AttributeEncoder with “Replace invalid characters”=Yes to transform the string to e.g. Latin-1 (or whatever you require), then compare the string before and after so see if any extended Unicode characters where replaced/removed.


bwn
Evangelist
Forum|alt.badge.img+26
  • Evangelist
  • Best Answer
  • August 20, 2024

Since FME uses a PERL implementation of RegEx, then could use StringSearcher with RegEx to find where there is a string that has a match to a non-ASCII character, and the character positions.

RegEx Pattern looking for is [^[:ascii:]]

 



Gives

 


juliarozema
Contributor
Forum|alt.badge.img+9
  • Author
  • Contributor
  • August 22, 2024

Thank you @bwn this helps us see the special characters that have snuck into our data. 
@david_ Thank you for also taking the time to submit an idea.  Your idea works in conjunction with @bwn.

First use the StringSearcher to find a Match for the UNICODE characters.
Then use the Attribute Encoder to remove them.
Then use the StringSearcher (during development) to confirm the UNICODE characters are gone.

 

Thank you both :)