Skip to main content
Solved

Identifying from which alphabet a string belongs to.


Forum|alt.badge.img

Hi there,

 

I have a database with a series of different street names from all over the world.

In some regions that use multiple alphabets in their local languages, I have mixes of this. I.e. one name in Latin, and another in Thai, Arabic, Cyrillic, etc.

What I'd like to do is pretty simply on paper: Look at each string in my list and determine what the alphabet used is. As simple as a new attribute indicating the language as "Latin" or "Japanese", and so on.

I've tried the CharacterCodeExtractor, but checking only the first character is not always useful, especially in cases like greek where some of the letters exist in both Greek and Latin alphabets.

Any ideas on this?

 

Thanks a lot!

Best answer by jdh

I would look into converting the string to unicode points via the TextEncoder and then looking up the code to see what unicode block they fall into. https://www.compart.com/en/unicode/block

 

View original
Did this help you find an answer to your question?

3 replies

jdh
Contributor
Forum|alt.badge.img+28
  • Contributor
  • Best Answer
  • June 19, 2020

I would look into converting the string to unicode points via the TextEncoder and then looking up the code to see what unicode block they fall into. https://www.compart.com/en/unicode/block

 


Forum|alt.badge.img+2

You could download the unicode dictionary from https://www.unicode.org/reports/tr38/, sample the first 3-4 characters from any given attribute using a set of SubStringExtractors, build the list of codes, then run statistics on them grouped by phrase to find out if they all came from the same code set.

 

 

 

 

Alternately, if you don't have a billion to do you could simply pass the code to a unicode search using a HTTPCaller with a variable as the #code and then parse the information from there.

 

https://unicodelookup.com/#3937/1

 

 


Forum|alt.badge.img
jdh wrote:

I would look into converting the string to unicode points via the TextEncoder and then looking up the code to see what unicode block they fall into. https://www.compart.com/en/unicode/block

 

Thanks, this approach worked for what I'm looking for!


Reply


Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

 
Cookie settings