2 Replies Latest reply: Jun 25, 2014 8:43 AM by runuz RSS

    How can I OCR only part of a long document?

    runuz Community Member

      OCR with a pdf file is usually very simple.  Not here.  This is a legal-type document (PDF) with with line numbers on the left and vertical lines on both sides of the text I want to OCR.  If OCR includes the line numbers I will have to spend hours removing them. The left side also include a law firm's name and the bottom has the title of the document, both of which I have blocked out.  (Line numbering was fairly common on pre-printed legal stationery in the typewriter days; most courts have gotten beyond the way things were done in those days but obviously.  It obviously makes OCR a challenge!)  In much earlier versions of Acrobat I used a simple tool called Eraser from a British firm (long gone) that neatly erased anything from a pdf file, such as all the stuff I want to get rid of here, before running the OCR and saving the document as a Word file. That tool does not work with Acrobat 9 and the version I am now using, Acrobat X.  Oh, and just retyping all the text is not an option for me.

       

      There MUST be a way to do this!  Any suggestions?

       

      Sample page.jpg

        • 1. Re: How can I OCR only part of a long document?
          CtDave CommunityMVP

          Acrobat 8 through XI (at least the Pro versions) provide the Redaction tool.

          Used properly this tool completely removes the selected PDF page content.

          (so it is good to read the application Help for this).

           

           

          Be well...

          • 2. Re: How can I OCR only part of a long document?
            runuz Community Member

            Redaction will certainly do the job but it is horribly tedious**.  My document is 19 pages.  With 3 such redactions per page I am looking at close to 60 redactions, one at a time.  All the redactions are identical.  That is, the stuff I want to remove on the left side of the page is the same for each of the 19 pages.  Same with the stuff to be removed on the right and the bottom.  I can't see any way to apply my redaction to more than 1 page at a time.  If doing so is impossible is there a way to extract (by cropping or otherwise) the part of the page I want to use?  Obviously it is the same for all 19 pages.  Is it possible that what I want to remove is in some removable "layer" separate from the text in question?

             

            **Even more tedious than the low tech solution I have been using for the occasions when I confronted this problem:

            1. Print a copy of the document
            2. Using scissors cut out the stuff I want to get rid of, a few pages at a time.
            3. Using Acrobat scan the resulting document
            4. Save the result as a PDF file
            5. Open the PDF file and save is as a Word file
            6. Open the Word file and copy/paste the absurdly formatted text into a clean Word document.  This strips out the bizarre formatting due to the scanning process.
            7. Clean up the new Word document, save it and
            8. Finally get to work using the text.

            FYI the test I want to extract or get consists of written interrogatories or questions for which I must prepare responses.  I want the responses to include the questions.