Solved

Identifying from which alphabet a string belongs to.

5 years ago
June 19, 2020
3 replies
172 views

robbie_botha
70 replies

Hi there,

I have a database with a series of different street names from all over the world.

In some regions that use multiple alphabets in their local languages, I have mixes of this. I.e. one name in Latin, and another in Thai, Arabic, Cyrillic, etc.

What I'd like to do is pretty simply on paper: Look at each string in my list and determine what the alphabet used is. As simple as a new attribute indicating the language as "Latin" or "Japanese", and so on.

I've tried the CharacterCodeExtractor, but checking only the first character is not always useful, especially in cases like greek where some of the letters exist in both Greek and Latin alphabets.

Any ideas on this?

Thanks a lot!

Best answer by jdh

I would look into converting the string to unicode points via the TextEncoder and then looking up the code to see what unicode block they fall into. https://www.compart.com/en/unicode/block

View original

Did this help you find an answer to your question?

+28

jdh
Contributor
1981 replies
Best Answer
5 years ago
June 19, 2020

I would look into converting the string to unicode points via the TextEncoder and then looking up the code to see what unicode block they fall into. https://www.compart.com/en/unicode/block

jlbaker2779
194 replies
5 years ago
June 19, 2020

You could download the unicode dictionary from https://www.unicode.org/reports/tr38/, sample the first 3-4 characters from any given attribute using a set of SubStringExtractors, build the list of codes, then run statistics on them grouped by phrase to find out if they all came from the same code set.

Alternately, if you don't have a billion to do you could simply pass the code to a unicode search using a HTTPCaller with a variable as the #code and then parse the information from there.

https://unicodelookup.com/#3937/1

robbie_botha
Author
70 replies
5 years ago
June 24, 2020

jdh wrote:

I would look into converting the string to unicode points via the TextEncoder and then looking up the code to see what unicode block they fall into. https://www.compart.com/en/unicode/block

Thanks, this approach worked for what I'm looking for!

Reply

Rich Text Editor, editor1

Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

Cookie settings

We use 3 different kinds of cookies. You can choose which cookies you want to accept. We need basic cookies to make this site work, therefore these are the minimum you can select. Learn more about our cookies.

Basic
Functional

Normal
Functional + analytics

Complete
Functional + analytics + social media + embedded videos + marketing

Identifying from which alphabet a string belongs to.