Skip to main content
Solved

Identifying from which alphabet a string belongs to.

  • June 19, 2020
  • 3 replies
  • 216 views

Forum|alt.badge.img

Hi there,

 

I have a database with a series of different street names from all over the world.

In some regions that use multiple alphabets in their local languages, I have mixes of this. I.e. one name in Latin, and another in Thai, Arabic, Cyrillic, etc.

What I'd like to do is pretty simply on paper: Look at each string in my list and determine what the alphabet used is. As simple as a new attribute indicating the language as "Latin" or "Japanese", and so on.

I've tried the CharacterCodeExtractor, but checking only the first character is not always useful, especially in cases like greek where some of the letters exist in both Greek and Latin alphabets.

Any ideas on this?

 

Thanks a lot!

Best answer by jdh

I would look into converting the string to unicode points via the TextEncoder and then looking up the code to see what unicode block they fall into. https://www.compart.com/en/unicode/block

 

This post is closed to further activity.
It may be an old question, an answered question, an implemented idea, or a notification-only post.
Please check post dates before relying on any information in a question or answer.
For follow-up or related questions, please post a new question or idea.
If there is a genuine update to be made, please contact us and request that the post is reopened.

3 replies

jdh
Contributor
Forum|alt.badge.img+37
  • Contributor
  • 2002 replies
  • Best Answer
  • June 19, 2020

I would look into converting the string to unicode points via the TextEncoder and then looking up the code to see what unicode block they fall into. https://www.compart.com/en/unicode/block

 


Forum|alt.badge.img+2
  • 194 replies
  • June 19, 2020

You could download the unicode dictionary from https://www.unicode.org/reports/tr38/, sample the first 3-4 characters from any given attribute using a set of SubStringExtractors, build the list of codes, then run statistics on them grouped by phrase to find out if they all came from the same code set.

 

 

 

 

Alternately, if you don't have a billion to do you could simply pass the code to a unicode search using a HTTPCaller with a variable as the #code and then parse the information from there.

 

https://unicodelookup.com/#3937/1

 

 


Forum|alt.badge.img
  • Author
  • 70 replies
  • June 24, 2020

I would look into converting the string to unicode points via the TextEncoder and then looking up the code to see what unicode block they fall into. https://www.compart.com/en/unicode/block

 

Thanks, this approach worked for what I'm looking for!