3 Replies Latest reply: Jan 17, 2012 8:09 PM by CtDave RSS

    Acrobat X pro OCR 2 pdf & edit issue?

    bobtwz99

      I'm using the latest rev of Acrobat X Pro.  I have a few small paper docs and I'm creating a PDF via scan.  This works fine, now I want to make a few very minor word changes and correct and OCR errors.  I use Tools-Recognize Text and OCR suspects.  In turn each is found and shown in a bounding box.

       

      I follow the screen prompts, click on the highlighted text and type /correct a few characters and the change is made on screen.  Then as directed I click "Accept and Find" (or Close)..but as soon as I do this, the text I just corrected (no matter how simple) immediately reverts to it original content.

       

      I must be doing something wrong?  But what?  I've tried various options in OCR and now I'm back to the defaults.

       

      Please Help?!

        • 1. Re: Acrobat X pro OCR 2 pdf & edit issue?
          CtDave CommunityMVP

          Unless you OCR with ClearScan the OCR output is a layer of hidden text.

          The Searchable Image OCR process does some "dress up" of the image and leaves the hidden/invisible OCR text layer.

          The Searchable Image (Exact) OCR process leaves the scanned image untouched ("exact") and leaves the hidden/invisible OCR text layer.

          For these, when you find/fix suspects you are touching the hidden text layer. You are not touching the image.

          To edit the image you'd use an image editor.

          So, your suspects corrections reside in the hidden text layer.

           

          If ClearScan is used the OCR process replaces recognized images of characters with an internal font. Scanned images of any characters not recognized by ClearScan at left as a bitmap image.

          You can perform a reasonable measure of edits to ClearScan text but not to the same extent done with a word processing application.

           

          Be well...

          • 2. Re: Acrobat X pro OCR 2 pdf & edit issue?
            bobtwz99 Community Member

            In general I understand what you are saying. I do realize that there are (3?) options that can be set as you mentioned:  Clearscan, Searchable Image and Searchable Image(Exact).  But I still need a pointer I suppose?

            You mention that when I use find/.fix suspects I'm touching (updating) only(?) the hidden layer.  Ok, that sounds plausible....but how do I (or can I) make changes that will somehow, thru some method) cause these hidden changes to become a useful output that will become "visible"===I mean become useful and visible in the updated (and saved) pdf?

             

            I knew that a pdf could have visible/hidden layers...but unless my edits (minor) can be reflected in the newly saved pdf (which then can be printed) I don't see the value?  And on some paper docs I scanned to PDF, my minor edits did "stick"...thus my confusion!

             

            You also mentioned (or suggested) I'd be better off using the ClearScan OCR, but that the altered charactered are replaced with an internal font.?

             

            You also mentioned using an "image editor"  What editor are you referring to?  Hopefully not something at a pixel level.

             

            Maybe I should just say that my goal is to OCR scan to PDF, making only some extremely minor character changes in very tiny documents, no graphics, just text.  In most cases, the words I wnat to change ARE flagged as suspects anyway.  What method process would you suggest?  My end goal is very very minor character corrections (most already flagged as suspects) and then saving the corrected OCR  back to a corrected pdf.

             

            Thanks for your help

             

            Bob

            • 3. Re: Acrobat X pro OCR 2 pdf & edit issue?
              CtDave CommunityMVP

              Backing up here. Some practical demonstrations that are easily performed.


              Take a sheet of paper that has an imprint of some text, less text is better for this.
              Using Acrobat, create PDF from Scanner. Now perform OCR with Searchable Image.
              Use Select All (Ctrl+A) to display where the hidden/invisible text layer from OCR is positioned.
              The blue highlighting, for the most part, will overlap the scanned image of the text.
              Now select the TouchUp Object / Edit Object tool and double-click on the PDF page.
              You'll see a bounding rectangle appear. Drag this down about an inch.
              Leave the TouchUp Object / Edit Object tool and select the Hand tool.
              Now, Select All (Ctrl+A).
              You will observe, by the highlighted hidden text layer, that the OCR output is above the image of text that you dragged down to a lower position on the PDF page.


              Repeat on a fresh PDF and use Searchable Image (Exact).


              Both demonstrate that these two OCR methods' output text is not associated with the image of text.


              Another method to demonstrate this is, after creating the hidden text layer, is to use Examine Document. If Hidden (Invisible) text is present Examine Document will locate it and permit one to preview/view the characters.


              A third method is to Save As / Export to a text file.
              If the PDF is only the scanned image then there is no content exported. If the PDF has a hidden text layer from OCR output then there is content exported.


              Returning to a PDF which has had the image dragged down some try this:
              Using the TouchUp Text or Edit Text tool  go the upper left most region where the first of the hidden text layer is located.
              Select a string of text. Right click for the context menu and select Properties.
              In the TouchUp Properties dialog locate "Fill:" and click on the adjacent square.
              Select a color (say red). Click the Close button and click somewhere outside the PDF page.
              Observe that the image of the text string (now below the no longer hidden text string) has not been changed ('touched').


              What makes the OCR output "hidden"? Acrobat's OCR process creates characters that use  a text rendering mode of "3".
              Mode 3 renders the character glyphs with neither stroke nor fill, thus "invisible".
              (Ref. Section 9.3.6 of ISO 32000-1)


              Changing hidden/invisible characters to "visible".  Do this by using the TouchUp Properties dialog to change the Fill color.
              However, as demonstrated above, changing OCR output characters does not change the image of characters.
              If you were to change the layer of hidden characters such that fill/stroke were present you would still have the image of characters that comprise the PDF page content.


              Keep in mind that the primary intent of OCR (Searchable Image / Searchable Image (Exact)) is to support Search/Find.
              It is not intended as a mechanism to "replace" the scanned image of text.


              To edit images you need to use an image editor.
              In Acrobat : Go Edit > Preferences > select the TouchUp category > in the pane on the right the "Choose Image Editor" button permits browsing to and selecting an Image Editor. If Photoshop is installed then, during install, Acrobat will select it by default.
              n.b., for the "Choose Page/Object Editor" an application such as Adobe Illustrator would be used.


              ClearScan.
              After performing ClearScan open the PDF's Document Properties and view the Fonts tab.
              Fonts provided by ClearScan can be identfied by the "Fdnnnn" entry. It is a font internal to/created by Acrobat/ClearScan.


              I suspect that, for your purposes, you want to use ClearScan.
              When ClearScan recognizes the image of a character the output replaces the image.
              A character that is not recognized by ClearScan is left as a bitmap image.
              Character(s) that are 'almost' there would be the "suspect(s)".
              So, the original image of the hardcopy no longer exists.


              Keep in mind that if the actual scanned image of the hardcopy is required or of importance for some reason than use Searchable Image (Exact) which does not alter the image.  If this is the case tweaks to the hidden text layer are non-productive. The "real" content is the image which has become the in-use substitute for the hardcopy. The OCR output exists to support search/find — no more, no less.


              Be well...

               

              Message was edited by: CtDave