4 Replies Latest reply: Jul 23, 2014 1:40 PM by mfhughesjr RSS

    How to determine (in bulk) whether a document has OCR text and what reader versions are supported?

    mfhughesjr Community Member

      Help!

       

      I've inherited a task which has spanned many years and which was not well thought out over the transitions.  At least three project teams have scanned and stored tens of thousands of documents to PDF.  What was discovered subsequently was that the project teams did not apply a uniform standard for which versions of Adobe would be supported in each PDF, and that not all documents appear to have been OCR'ed as part of the scan process.


      This has resulted in two major problems.  First, PDFs which support all Reader versions are bloated and consuming significant amounts of storage; second, the automated processing tools which depend upon the OCR text are failing once they pass the front and rear cover sheets (which do contain extractable text).  I need to know if there is a way that PDFs can be bulk scanned to determine which Reader versions are supported (say 8.0 to current), and if the OCR'ed / extractable text is not just limited to the first few and last pages of each PDF.

       

      I have been manually fixing individual files with Adobe Acrobat 9.0.  I can force Adobe to re-OCR and save the files, but I would rather not have to re-process the existing bulk that we have unless absolutely necessary.  If I could determine which ones need fixing and just processing those it will save man years of work.

       

      Thanks, in advance, for any assistance.

       

      Michael