Skip to main content
Question

Read Microsoft Word file

  • March 13, 2017
  • 5 replies
  • 490 views

I have a bunch of Microsoft Word file i want to count word in and the use the most common word with geographic content to generate a polygon where the word file is about.

My problem is that i cant read a word file in FME. I have created all the workspace to do what i want but the problem is that i have like 2 000 word file and i dont want to convert all of the into txt.

Do anyone have a solution to this?

This post is closed to further activity.
It may be an old question, an answered question, an implemented idea, or a notification-only post.
Please check post dates before relying on any information in a question or answer.
For follow-up or related questions, please post a new question or idea.
If there is a genuine update to be made, please contact us and request that the post is reopened.

5 replies

danilo_fme
Celebrity
Forum|alt.badge.img+51
  • Celebrity
  • 2077 replies
  • March 13, 2017

Hi @linhgg2, are you try to read .doc files?

I have installed FMe Desktop 2017 and i didnt see a Reader for Microsoft Word.


erik_jan
Contributor
Forum|alt.badge.img+22
  • Contributor
  • 2179 replies
  • March 13, 2017

No MS Word Reader available in FME 2017 yet.

So, I do not see any other option than converting to Text.


david_r
Celebrity
  • 8392 replies
  • March 13, 2017

If your document is a .docx type file, it is actually a zip file containing several XML files etc. that you can read with FME. Here's what it might look like when opened in 7zip:

But I agree that unless you feel adventurous, it is probably easier to convert it to text first.


tim_wood
Contributor
Forum|alt.badge.img+8
  • Contributor
  • 311 replies
  • November 1, 2017

I have the same problem with the Ordnance Survey Local Custodians table:

https://www.ordnancesurvey.co.uk/docs/product-schemas/addressbase-products-local-custodian-codes.zip

I tried using the XML Reader but it won't open the .docx.

Converting the file to text in Word seems to result in the loss of the table structure.

I found saving the Word doc as HTML worked quite well (although still a manual step). Once it is in that format, FME will read it using the HTML Table Reader, and even strips out the title and text above the table.

I did find that the column headings got treated as a data row. Maybe the headings are formatted as HTML TD tags rather than TH ones. A handy update to the Reader would be to have something similar to CSV where you can specify whether there's a header row. A workaround is to tell the HTML Table Reader to start at feature 2.


gio
Contributor
Forum|alt.badge.img+15
  • Contributor
  • 2252 replies
  • November 1, 2017

I had soem data handed to me in Word not long ago..

I just stuffed it in a txt/csv file and proceedde from there.

If it is formatted somehow in word i basicaly used variablesetters/and retrievers an a lot of regexp in stringsearchers etc.