4 Replies Latest reply: Feb 21, 2012 8:46 PM by Excel Soft RSS

    Extracting the PDF contents

    Excel Soft Community Member

      Hello,

       

           I have read in one of the documents that the PDF contents can be extracted using an accessibility plug-in in the library AcrobatAccess.lib. I have searched for this libarary and could not find that. I have the following queries...

       

      1. In one of the posts I read that we need to contact the dev center for the library, Is it licensed, if the purpose of usage is for other than screen readers.

      2. Is it possible to access each and every bit of information on the PDF.

      3. I need to convert PDF to epub, is there any plug-in available for such conversion.

      4. Where can I get the SDK along with the AcrobatAccess.lib for an application development for PDF information extraction.

       

      Regards,

      Excel

        • 1. Re: Extracting the PDF contents
          lrosenth Adobe Employee

          I don’t know who told you specifically about the Accessibility plugin…

           

          But yes, you can write your own plugin to Acrobat (in C/C++) that can extract the contents of a PDF by iterating over all the objects.  You will need a copy of Adobe Acrobat (NOT READER!) and the Acrobat SDK to do this.

          • 2. Re: Extracting the PDF contents
            Excel Soft Community Member

            Hello,

             

                 I am reading a document which explains about "Reading PDF files through the DOM Interface". I am pasting the paragraph from this chapter below. This is from the document named "Reading PDF through Accessibility Interfaces"

             

                 "Acrobat 6.0 and higher defines a document object model(DOM) that provides more complete access to the document structure than the MSAA interface. The Accessibility plug-in defines and exports five COM interfaces in AcrobatAccess.lib that exposes Acrobat's document hierarchy"

             

                 1. Please comment on the above.

                 2. I have one more query, Is this the same DOM you are proposing to use in C/C++ to extract the content?

                

                     

            Regards,

            Excel

            • 3. Re: Extracting the PDF contents
              lrosenth Adobe Employee

              Yes, that’s specifically for use by Accessibility devices (aka screen readers).

               

              What I am proposing is completely different, but gives you a MUCH richer set of APIs to work with.

              • 4. Re: Extracting the PDF contents
                Excel Soft Community Member

                Hello,

                 

                     Thanks for the response. Please elaborate more on the method for PDF content extraction. Also please share the related documents. The purpose of this extraction is to convert the PDF file to epub format.

                 

                Regards,

                Excel