Identifying from which alphabet a string belongs to.

Question

Hi there,

I have a database with a series of different street names from all over the world.

In some regions that use multiple alphabets in their local languages, I have mixes of this. I.e. one name in Latin, and another in Thai, Arabic, Cyrillic, etc.

What I'd like to do is pretty simply on paper: Look at each string in my list and determine what the alphabet used is. As simple as a new attribute indicating the language as "Latin" or "Japanese", and so on.

I've tried the CharacterCodeExtractor, but checking only the first character is not always useful, especially in cases like greek where some of the letters exist in both Greek and Latin alphabets.

Any ideas on this?

Thanks a lot!

icon

Best answer by jdh 19 June 2020, 17:45

View original

jdh · Accepted Answer

I would look into converting the string to unicode points via the TextEncoder  and then looking up the code to see what unicode block they fall into.  https://www.compart.com/en/unicode/block

jlbaker2779 · Answer

You could download the unicode dictionary from https://www.unicode.org/reports/tr38/, sample the first 3-4 characters from any given attribute using a set of SubStringExtractors, build the list of codes, then run statistics on them grouped by phrase to find out if they all came from the same code set.

Alternately, if you don't have a billion to do you could simply pass the code to a unicode search using a HTTPCaller with a variable as the #code and then parse the information from there.

https://unicodelookup.com/#3937/1

Identifying from which alphabet a string belongs to.

3 replies

Reply

Community Stats

Reply

Community Stats

Sign up

An FME Account is required to contribute

Login to the community

An FME Account is required to contribute

Scanning file for viruses.

This file cannot be downloaded