Skip to main content

Hi there,

 

I have a database with a series of different street names from all over the world.

In some regions that use multiple alphabets in their local languages, I have mixes of this. I.e. one name in Latin, and another in Thai, Arabic, Cyrillic, etc.

What I'd like to do is pretty simply on paper: Look at each string in my list and determine what the alphabet used is. As simple as a new attribute indicating the language as "Latin" or "Japanese", and so on.

I've tried the CharacterCodeExtractor, but checking only the first character is not always useful, especially in cases like greek where some of the letters exist in both Greek and Latin alphabets.

Any ideas on this?

 

Thanks a lot!

I would look into converting the string to unicode points via the TextEncoder and then looking up the code to see what unicode block they fall into. https://www.compart.com/en/unicode/block

 


You could download the unicode dictionary from https://www.unicode.org/reports/tr38/, sample the first 3-4 characters from any given attribute using a set of SubStringExtractors, build the list of codes, then run statistics on them grouped by phrase to find out if they all came from the same code set.

 

 

 

 

Alternately, if you don't have a billion to do you could simply pass the code to a unicode search using a HTTPCaller with a variable as the #code and then parse the information from there.

 

https://unicodelookup.com/#3937/1

 

 


I would look into converting the string to unicode points via the TextEncoder and then looking up the code to see what unicode block they fall into. https://www.compart.com/en/unicode/block

 

Thanks, this approach worked for what I'm looking for!


Reply