Skip to main content

I am building a data quality control layer and currently checking data formatting.

As part of the format checking, I would like to ensure all data will not contain uncommon characters, which I guess, means anything OUTSIDE the normal ASCII character set (0 - 127).

Efficiency would be imperative here considering this is a very small part of a very large processing "layer" and will likely involve checking every character in a potentially very large dataset.

I'm guessing regex would be ideal but I cant figure out a way to implement regex without explicitly listing every single strange character, and I'm unfamiliar with regex as it is...

Any recommendations?

 

 

Thanks!

Regex to search for non ASCII

[^\x00-\x7F]

Efficiency wise it might be more efficient to do something in python, although depends on exactly what the outcome needs to be. Do you need to identify the non-ASCII characters or just identify values that contain them?


Regex to search for non ASCII

[^\x00-\x7F]

Efficiency wise it might be more efficient to do something in python, although depends on exactly what the outcome needs to be. Do you need to identify the non-ASCII characters or just identify values that contain them?

Thanks very much. 

 

 

I notice it also picks up on double quotes (") but that is probably for the best.

Thanks very much.

 

 

I notice it also picks up on double quotes (") but that is probably for the best.
That regex doesn't match double quotes for me
That regex doesn't match double quotes for me

ah. My mistake. its actually this character thats getting picked up:

 

as opposed to:

 

 

"


Concerning performances and regex, it would be interesting to hear from Safe if the StringSearcher / AttributeValidator compiles the regex once before the first feature, or if the regex is compiled once for each feature. For huge datasets and/or complex regexes the performance difference can be quite noticable.

@mark2atsafe ?