Question

Check for non ASCII characters

6 years ago
March 25, 2019
5 replies
303 views

david.benoit
16 replies

I am building a data quality control layer and currently checking data formatting.

As part of the format checking, I would like to ensure all data will not contain uncommon characters, which I guess, means anything OUTSIDE the normal ASCII character set (0 - 127).

Efficiency would be imperative here considering this is a very small part of a very large processing "layer" and will likely involve checking every character in a potentially very large dataset.

I'm guessing regex would be ideal but I cant figure out a way to implement regex without explicitly listing every single strange character, and I'm unfamiliar with regex as it is...

Any recommendations?

Thanks!

+31

ebygomm
Influencer
3241 replies
6 years ago
March 25, 2019

Regex to search for non ASCII

[^\x00-\x7F]

Efficiency wise it might be more efficient to do something in python, although depends on exactly what the outcome needs to be. Do you need to identify the non-ASCII characters or just identify values that contain them?

david.benoit
Author
16 replies
6 years ago
March 25, 2019

ebygomm wrote:

Regex to search for non ASCII

[^\x00-\x7F]

Thanks very much.

I notice it also picks up on double quotes (") but that is probably for the best.

+31

ebygomm
Influencer
3241 replies
6 years ago
March 25, 2019

david.benoit wrote:

Thanks very much.

I notice it also picks up on double quotes (") but that is probably for the best.

That regex doesn't match double quotes for me

david.benoit
Author
16 replies
6 years ago
March 25, 2019

ebygomm wrote:

That regex doesn't match double quotes for me

ah. My mistake. its actually this character thats getting picked up:

”

as opposed to:

david_r
8317 replies
6 years ago
March 25, 2019

Concerning performances and regex, it would be interesting to hear from Safe if the StringSearcher / AttributeValidator compiles the regex once before the first feature, or if the regex is compiled once for each feature. For huge datasets and/or complex regexes the performance difference can be quite noticable.

@mark2atsafe ?

Reply

Rich Text Editor, editor1

Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

Cookie settings

We use 3 different kinds of cookies. You can choose which cookies you want to accept. We need basic cookies to make this site work, therefore these are the minimum you can select. Learn more about our cookies.

Basic
Functional

Normal
Functional + analytics

Complete
Functional + analytics + social media + embedded videos

Check for non ASCII characters