1 Reply Latest reply on Jun 19, 2006 12:57 PM by Newsgroup_User

    Read contents of Word docs?

    kitty1967
      Hello all:

      I have a directory of word documents that I need to loop through and read the contents and save various parts of the textual content into a database.

      I've used cfdirectory to loop through the directory and then cffile action="read" to read the contents of the file into a variable. However, I have what appears to be binary information stored before and after the text that is saved in the variable specified in the cffile tag.

      How can I get rid of this so that I'm left with just the text contained in the Word file?

      TIA
      Lisa
        • 1. Re: Read contents of Word docs?
          Level 7
          When you read a binary file you get binary data. Word .doc are not text
          files. If you can not convert the files to txt or at least rtf files
          you will have to use the word com object to parse the file. This is a
          very problematic solution as it involves installing MS Word on the
          server. The trouble is the MS Word is not designed to run on a server
          and both Adobe nee Macromedia, and Microsoft warn against doing so.

          If you do so, have good access to the server. Because as you program,
          anytime you do something that causes MS Word to ask a question with a
          dialog box, it is going to send that to the server's screen and lock up
          and wait for somebody sitting at the server to answer the dialog. Since
          it is not a server application it doesn't understand how to send these
          to clients in any way.

          No since you can read some of the text from the binary, you may be able
          to get it out with Regex or other string processing, but that does not
          sound like fun to me.

          kitty1967 wrote:
          > Hello all:
          >
          > I have a directory of word documents that I need to loop through and read the
          > contents and save various parts of the textual content into a database.
          >
          > I've used cfdirectory to loop through the directory and then cffile
          > action="read" to read the contents of the file into a variable. However, I have
          > what appears to be binary information stored before and after the text that is
          > saved in the variable specified in the cffile tag.
          >
          > How can I get rid of this so that I'm left with just the text contained in the
          > Word file?
          >
          > TIA
          > Lisa
          >