Skip to main content
Question

Check for non ASCII characters


david.benoit

I am building a data quality control layer and currently checking data formatting.

As part of the format checking, I would like to ensure all data will not contain uncommon characters, which I guess, means anything OUTSIDE the normal ASCII character set (0 - 127).

Efficiency would be imperative here considering this is a very small part of a very large processing "layer" and will likely involve checking every character in a potentially very large dataset.

I'm guessing regex would be ideal but I cant figure out a way to implement regex without explicitly listing every single strange character, and I'm unfamiliar with regex as it is...

Any recommendations?

 

 

Thanks!

5 replies

ebygomm
Influencer
Forum|alt.badge.img+31
  • Influencer
  • March 25, 2019

Regex to search for non ASCII

[^\x00-\x7F]

Efficiency wise it might be more efficient to do something in python, although depends on exactly what the outcome needs to be. Do you need to identify the non-ASCII characters or just identify values that contain them?


david.benoit
ebygomm wrote:

Regex to search for non ASCII

[^\x00-\x7F]

Efficiency wise it might be more efficient to do something in python, although depends on exactly what the outcome needs to be. Do you need to identify the non-ASCII characters or just identify values that contain them?

Thanks very much. 

 

 

I notice it also picks up on double quotes (") but that is probably for the best.

ebygomm
Influencer
Forum|alt.badge.img+31
  • Influencer
  • March 25, 2019
david.benoit wrote:

Thanks very much.

 

 

I notice it also picks up on double quotes (") but that is probably for the best.
That regex doesn't match double quotes for me

david.benoit
ebygomm wrote:
That regex doesn't match double quotes for me

ah. My mistake. its actually this character thats getting picked up:

 

as opposed to:

 

 

"


david_r
Evangelist
  • March 25, 2019

Concerning performances and regex, it would be interesting to hear from Safe if the StringSearcher / AttributeValidator compiles the regex once before the first feature, or if the regex is compiled once for each feature. For huge datasets and/or complex regexes the performance difference can be quite noticable.

@mark2atsafe ?

 


Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

 
Cookie settings