3 Replies Latest reply on Jan 30, 2013 10:08 PM by Jörg Hoh

    upload MS word doc to WCM and it automatically convert to a node???

    apark2900 Level 1

      Hi,

       

      Is there anybody who has a solution to let users upload a MS word doc to WCM and it automatically catch a text inside to convert into a node?

        • 1. Re: upload MS word doc to WCM and it automatically convert to a node???
          Sham HC Level 7

          I do not have any sample to share with you. You might have to develop a custom process to extract & create a node and call it in workflow.

          • 2. Re: upload MS word doc to WCM and it automatically convert to a node???
            Gregor Zurowski Level 1

            CQ5 is bundled with Apache Tika (http://tika.apache.org), a content analysis library to extract metadata and content from different file formats. CQ5 uses Tika in combination with Lucene for full-text indexing of assets, but you can also use it in your own projects. Extracting text from a MS Word document is fairly simple using Tika and can be achieved with a few lines of code:

             

              [...]

              ContentHandler contentHandler = new BodyContentHandler();

              Metadata metaData = new Metadata();

              Parser parser = new OfficeParser();

              parser.parse(<Word file as InputStream>, contentHandler, metaData, new ParseContext());

              log.debug("content: {}", contentHandler.toString());

              // create a node and populate with the extracted content

              [...]

             

            Put your content extraction code into a custom process step, e.g. by implementing the com.day.cq.workflow.exec.WorkflowProcess interface. Depending on your use case, you can add the process step in an existing workflow (DAM update asset) or use it with a custom workflow.

             

            If you are using CQ 5.5 and building your project with Maven, add the following dependencies to your POM:

             

              <dependency>

                <groupId>org.apache.tika</groupId>

                <artifactId>tika-core</artifactId>

                <version>1.0</version>

                <scope>provided</scope>

              </dependency>

             

              <dependency>

                <groupId>org.apache.tika</groupId>

                <artifactId>tika-parsers</artifactId>

                <version>1.0</version>

                <scope>provided</scope>

              </dependency>

             

            Hope this helps,

            Gregor

            • 3. Re: upload MS word doc to WCM and it automatically convert to a node???
              Jörg Hoh Adobe Employee

              As addition to Gregor's note:

               

              Be aware, that Apache Tika uses technology, which is based on reverse engineering; so not all features of the MS Office formats are supported, and there's also no guarantee, the parsers can cope with it at all. I heared, that the latest docx format still makes some problems.

               

              kind regards,

              Jörg

              1 person found this helpful