Question

Read Microsoft Word file

  • 13 March 2017
  • 5 replies
  • 54 views

I have a bunch of Microsoft Word file i want to count word in and the use the most common word with geographic content to generate a polygon where the word file is about.

My problem is that i cant read a word file in FME. I have created all the workspace to do what i want but the problem is that i have like 2 000 word file and i dont want to convert all of the into txt.

Do anyone have a solution to this?


5 replies

Userlevel 4
Badge +30

Hi @linhgg2, are you try to read .doc files?

I have installed FMe Desktop 2017 and i didnt see a Reader for Microsoft Word.

Userlevel 2
Badge +16

No MS Word Reader available in FME 2017 yet.

So, I do not see any other option than converting to Text.

Userlevel 4

If your document is a .docx type file, it is actually a zip file containing several XML files etc. that you can read with FME. Here's what it might look like when opened in 7zip:

But I agree that unless you feel adventurous, it is probably easier to convert it to text first.

Badge +7

I have the same problem with the Ordnance Survey Local Custodians table:

https://www.ordnancesurvey.co.uk/docs/product-schemas/addressbase-products-local-custodian-codes.zip

I tried using the XML Reader but it won't open the .docx.

Converting the file to text in Word seems to result in the loss of the table structure.

I found saving the Word doc as HTML worked quite well (although still a manual step). Once it is in that format, FME will read it using the HTML Table Reader, and even strips out the title and text above the table.

I did find that the column headings got treated as a data row. Maybe the headings are formatted as HTML TD tags rather than TH ones. A handy update to the Reader would be to have something similar to CSV where you can specify whether there's a header row. A workaround is to tell the HTML Table Reader to start at feature 2.

Badge +3

I had soem data handed to me in Word not long ago..

I just stuffed it in a txt/csv file and proceedde from there.

If it is formatted somehow in word i basicaly used variablesetters/and retrievers an a lot of regexp in stringsearchers etc.

Reply