Skip navigation
bobtwz99
Currently Being Moderated

Acrobat X pro OCR 2 pdf & edit issue?

Jan 16, 2012 11:38 AM

Tags: #text #edit #ocr #scan #acrobat_10_pro #correct #suspect

I'm using the latest rev of Acrobat X Pro.  I have a few small paper docs and I'm creating a PDF via scan.  This works fine, now I want to make a few very minor word changes and correct and OCR errors.  I use Tools-Recognize Text and OCR suspects.  In turn each is found and shown in a bounding box.

 

I follow the screen prompts, click on the highlighted text and type /correct a few characters and the change is made on screen.  Then as directed I click "Accept and Find" (or Close)..but as soon as I do this, the text I just corrected (no matter how simple) immediately reverts to it original content.

 

I must be doing something wrong?  But what?  I've tried various options in OCR and now I'm back to the defaults.

 

Please Help?!

 
Replies
  • Currently Being Moderated
    Jan 16, 2012 6:57 PM   in reply to bobtwz99

    Unless you OCR with ClearScan the OCR output is a layer of hidden text.

    The Searchable Image OCR process does some "dress up" of the image and leaves the hidden/invisible OCR text layer.

    The Searchable Image (Exact) OCR process leaves the scanned image untouched ("exact") and leaves the hidden/invisible OCR text layer.

    For these, when you find/fix suspects you are touching the hidden text layer. You are not touching the image.

    To edit the image you'd use an image editor.

    So, your suspects corrections reside in the hidden text layer.

     

    If ClearScan is used the OCR process replaces recognized images of characters with an internal font. Scanned images of any characters not recognized by ClearScan at left as a bitmap image.

    You can perform a reasonable measure of edits to ClearScan text but not to the same extent done with a word processing application.

     

    Be well...

     
    |
    Mark as:
  • Currently Being Moderated
    Jan 17, 2012 8:09 PM   in reply to bobtwz99

    Backing up here. Some practical demonstrations that are easily performed.


    Take a sheet of paper that has an imprint of some text, less text is better for this.
    Using Acrobat, create PDF from Scanner. Now perform OCR with Searchable Image.
    Use Select All (Ctrl+A) to display where the hidden/invisible text layer from OCR is positioned.
    The blue highlighting, for the most part, will overlap the scanned image of the text.
    Now select the TouchUp Object / Edit Object tool and double-click on the PDF page.
    You'll see a bounding rectangle appear. Drag this down about an inch.
    Leave the TouchUp Object / Edit Object tool and select the Hand tool.
    Now, Select All (Ctrl+A).
    You will observe, by the highlighted hidden text layer, that the OCR output is above the image of text that you dragged down to a lower position on the PDF page.


    Repeat on a fresh PDF and use Searchable Image (Exact).


    Both demonstrate that these two OCR methods' output text is not associated with the image of text.


    Another method to demonstrate this is, after creating the hidden text layer, is to use Examine Document. If Hidden (Invisible) text is present Examine Document will locate it and permit one to preview/view the characters.


    A third method is to Save As / Export to a text file.
    If the PDF is only the scanned image then there is no content exported. If the PDF has a hidden text layer from OCR output then there is content exported.


    Returning to a PDF which has had the image dragged down some try this:
    Using the TouchUp Text or Edit Text tool  go the upper left most region where the first of the hidden text layer is located.
    Select a string of text. Right click for the context menu and select Properties.
    In the TouchUp Properties dialog locate "Fill:" and click on the adjacent square.
    Select a color (say red). Click the Close button and click somewhere outside the PDF page.
    Observe that the image of the text string (now below the no longer hidden text string) has not been changed ('touched').


    What makes the OCR output "hidden"? Acrobat's OCR process creates characters that use  a text rendering mode of "3".
    Mode 3 renders the character glyphs with neither stroke nor fill, thus "invisible".
    (Ref. Section 9.3.6 of ISO 32000-1)


    Changing hidden/invisible characters to "visible".  Do this by using the TouchUp Properties dialog to change the Fill color.
    However, as demonstrated above, changing OCR output characters does not change the image of characters.
    If you were to change the layer of hidden characters such that fill/stroke were present you would still have the image of characters that comprise the PDF page content.


    Keep in mind that the primary intent of OCR (Searchable Image / Searchable Image (Exact)) is to support Search/Find.
    It is not intended as a mechanism to "replace" the scanned image of text.


    To edit images you need to use an image editor.
    In Acrobat : Go Edit > Preferences > select the TouchUp category > in the pane on the right the "Choose Image Editor" button permits browsing to and selecting an Image Editor. If Photoshop is installed then, during install, Acrobat will select it by default.
    n.b., for the "Choose Page/Object Editor" an application such as Adobe Illustrator would be used.


    ClearScan.
    After performing ClearScan open the PDF's Document Properties and view the Fonts tab.
    Fonts provided by ClearScan can be identfied by the "Fdnnnn" entry. It is a font internal to/created by Acrobat/ClearScan.


    I suspect that, for your purposes, you want to use ClearScan.
    When ClearScan recognizes the image of a character the output replaces the image.
    A character that is not recognized by ClearScan is left as a bitmap image.
    Character(s) that are 'almost' there would be the "suspect(s)".
    So, the original image of the hardcopy no longer exists.


    Keep in mind that if the actual scanned image of the hardcopy is required or of importance for some reason than use Searchable Image (Exact) which does not alter the image.  If this is the case tweaks to the hidden text layer are non-productive. The "real" content is the image which has become the in-use substitute for the hardcopy. The OCR output exists to support search/find — no more, no less.


    Be well...

     

    Message was edited by: CtDave

     
    |
    Mark as:

More Like This

  • Retrieving data ...

Bookmarked By (0)

Answers + Points = Status

  • 10 points awarded for Correct Answers
  • 5 points awarded for Helpful Answers
  • 10,000+ points
  • 1,001-10,000 points
  • 501-1,000 points
  • 5-500 points