2 Replies Latest reply on Jun 22, 2013 5:05 PM by mike_ottinger

    CQ5.6 and pptx parsing with Tika

    mike_ottinger

      Hi Folks,

       

      Simple issue really; any powerpoint documents (ppt and pptx) that I attempt to parse with tika (version 1.3.0.r1436209 according to Felix) within CQ5.6 gives the following:

       

      Caused by: java.lang.NoClassDefFoundError: org/openxmlformats/schemas/drawingml/x2006/main/CTBlipFillProperties

                at org.apache.tika.parser.microsoft.ooxml.XSLFPowerPointExtractorDecorator.extractContent(XS LFPowerPointExtractorDecorator.java:152)

                at org.apache.tika.parser.microsoft.ooxml.XSLFPowerPointExtractorDecorator.buildXHTML(XSLFPo werPointExtractorDecorator.java:79)

                at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtra ctor.java:105)

                at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory. java:112)

                at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:82)

                at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)

                at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)

                at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)

                at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)

                at org.apache.tika.parser.ParsingReader$ParsingTask.run(ParsingReader.java:221)

       

      My code for parsing is pretty straightforward:

       

      AutoDetectParser parser = new AutoDetectParser();

      Metadata metadata = new Metadata();

      metadata.set(Metadata.RESOURCE_NAME_KEY, asset.getName());

      // stream comes from the asset's original rendition stream

      parser.parse(stream, contentHandler, metadata, new ParseContext());

       

      Is there another approach I should take with parsing PPTs?

       

       

      Thanks!

        • 1. Re: CQ5.6 and pptx parsing with Tika
          Jörg Hoh Adobe Employee

          The org.openxml package is provided not by Tika, but by the "com.adobe.granite.poi" wrapper package. As exporting packages I see these:

           

          org.openxmlformats.schemas.officeDocument.x2006.customProperties,version=1.1.0

          org.openxmlformats.schemas.officeDocument.x2006.extendedProperties,version=1.1.0

          org.openxmlformats.schemas.presentationml.x2006.main,version=1.1.0

          org.openxmlformats.schemas.wordprocessingml.x2006.main,version=1.1.0

           

          looks like the org.openxmlformats.schemas.dawingml packages are missing. Can you fill a Daycare and ask for the inclusion of these packages?

           

          (As temporary workaround you can create a wrapping OSGI bundle for these libs and deploy it to Felix.)

           

          Jörg

          • 2. Re: CQ5.6 and pptx parsing with Tika
            mike_ottinger Level 1

            Thanks Jörg, I'll get a DayCare ticket out and I'll wrap the artifact myself.

             

            Thanks again!