Skip to main content
Question

Check for non ASCII characters

  • March 25, 2019
  • 5 replies
  • 356 views

david.benoit

I am building a data quality control layer and currently checking data formatting.

As part of the format checking, I would like to ensure all data will not contain uncommon characters, which I guess, means anything OUTSIDE the normal ASCII character set (0 - 127).

Efficiency would be imperative here considering this is a very small part of a very large processing "layer" and will likely involve checking every character in a potentially very large dataset.

I'm guessing regex would be ideal but I cant figure out a way to implement regex without explicitly listing every single strange character, and I'm unfamiliar with regex as it is...

Any recommendations?

 

 

Thanks!
This post is closed to further activity.
It may be an old question, an answered question, an implemented idea, or a notification-only post.
Please check post dates before relying on any information in a question or answer.
For follow-up or related questions, please post a new question or idea.
If there is a genuine update to be made, please contact us and request that the post is reopened.

5 replies

ebygomm
Influencer
Forum|alt.badge.img+44
  • Influencer
  • 3429 replies
  • March 25, 2019

Regex to search for non ASCII

[^\x00-\x7F]

Efficiency wise it might be more efficient to do something in python, although depends on exactly what the outcome needs to be. Do you need to identify the non-ASCII characters or just identify values that contain them?


david.benoit
  • Author
  • 16 replies
  • March 25, 2019

Regex to search for non ASCII

[^\x00-\x7F]

Efficiency wise it might be more efficient to do something in python, although depends on exactly what the outcome needs to be. Do you need to identify the non-ASCII characters or just identify values that contain them?

Thanks very much. 

 

 

I notice it also picks up on double quotes (") but that is probably for the best.

ebygomm
Influencer
Forum|alt.badge.img+44
  • Influencer
  • 3429 replies
  • March 25, 2019

Thanks very much.

 

 

I notice it also picks up on double quotes (") but that is probably for the best.
That regex doesn't match double quotes for me

david.benoit
  • Author
  • 16 replies
  • March 25, 2019
That regex doesn't match double quotes for me

ah. My mistake. its actually this character thats getting picked up:

 

as opposed to:

 

 

"


david_r
Celebrity
  • 8394 replies
  • March 25, 2019

Concerning performances and regex, it would be interesting to hear from Safe if the StringSearcher / AttributeValidator compiles the regex once before the first feature, or if the regex is compiled once for each feature. For huge datasets and/or complex regexes the performance difference can be quite noticable.

@mark2atsafe ?