Skip to main content
I have a set of doc files and need to retrieve XY. attaching an example here. the problem is worsened by the fact that almost everything is in cyrillic. I tried to read it as txt and then use StringSearcher but my regex knowledge is probably not sufficient. Any help would be appreciated. 
Hi,

 

 

you can use the following regular expression to decompose your coordinate:

 

 

N (\\d+)\\D(\\d+)\\D(\\d+)\\D

 

E (\\d+)\\D(\\d+)\\D(\\d+)\\D

 

 

Example:

 

 

 

 

David
Hi David, thanks. the problem is that when I try to read the doc file as txt, the structure is someway corrupted, so the expression you gave me wouldn't work. I can send you the doc file if you could have a look... I did not find a way to attach it here. thanks again!
Hi,

 

 

I'm interested in this subject.

 

You can upload the sample file to a server like Dropbox or Google Drive, and paste its shared link URL here. We can then share the file.

 

 

Takashi
Iijima-san, thanks for your response and interest.. here you go https://drive.google.com/file/d/0B4_4CnzRy4CvWF91M00yVmsxc2c/view?usp=sharing
I was able to retrieve parts of the coordinates with this data flow after converting the Word doc to a plain text.

 

 

Resulting feature contains these attributes, you can then convert them to degrees.

 

Attribute(encoded: utf-8)         : `_E{0}' has value `152'

 

Attribute(encoded: utf-8)         : `_E{1}' has value `51'

 

Attribute(encoded: utf-8)         : `_E{2}' has value `715'

 

Attribute(encoded: utf-8)         : `_N{0}' has value `59'

 

Attribute(encoded: utf-8)         : `_N{1}' has value `33'

 

Attribute(encoded: utf-8)         : `_N{2}' has value `067'

 

 

David's regex worked as expected 🙂
Pasted it to text. Save encoding to UNICODE, this will keep the cirilics

 

 

Then you can use regexp on it, even if you don't know what the words mean (, you can still search for the word prior to the coordinate sets....

 

 

This wil get both the coordinates:

 

 

(NÂ\\s\\d°??]+)|(E°\\s\\d°??]+)

 

 

Used it in the stringsearcher.

 

 

 

But i had to copy the single and double quote from the text. I initialy entered them trought the keyboard, but the regexp only got to the degree mark. CopyPaste it worked.

 

 

 

 

 

 
Gio,

 

 

you can avoid messing with the weird quotation marks etc by using \\D rather than string literals. The \\D symbol matches any NON-numeric character.

 

 

David

Reply