Skip navigation
Currently Being Moderated

Proofread and correct OCR'd text in Acrobat 10 Pro

Jan 10, 2012 2:00 PM

How do you proofread and correct text produced by OCR from a scanned document, in Acrobat 10 Pro?

 

I scan (many, large) paper documents, then use Recognise Text. After the OCR phase, if I save PDFs as text, I can see many scan errors.

 

I would like to be able to correct those errors in the scanned text, so that names etc can be successfully searched. However I cannot find any way to view and correct the scanned text.

 

I experimented with Tools / Content / Edit Document Text, but I cannot see how to display the scanned text to allow correction. It appears to operate on the PDF image. But if I try to change the document image to correct known errors (e.g. in spacing), and then save the PDF as text again, the string where I changed the image becomes gibberish.

 

How is Edit Document Text supposed to work? Is there any way to achieve what I am looking for (fixing many errors in large OCR'd documents)?

 

Regards,

Sue.

 
Replies
  • Currently Being Moderated
    Jan 11, 2012 3:53 AM   in reply to SueJB2

    Select the text and change the color of the text. Then you can see and change the text.

     
    |
    Mark as:
  • Currently Being Moderated
    Jan 11, 2012 4:44 AM   in reply to SueJB2

    For editing the OCRed text, go to Tools pane > Recognize Text > Find First Suspect.

    Now click on any text and edit. The editing will be done on the hidden text layer hence you would not be able to see the modifications. However, there is a workaround to verify your modifications by selecting and copying this text to notepad.

     

    Please note that this workflow would not work for Clear Scan documents. .

     
    |
    Mark as:
  • Currently Being Moderated
    Jan 11, 2012 5:52 AM   in reply to apangasa

    The function will not find all suspects.

     
    |
    Mark as:
  • Currently Being Moderated
    Jan 11, 2012 9:57 PM   in reply to SueJB2

    Thanks for trying out the things.

     

    Workflow till Find Suspects and highlighting look good. But you need not right click on the text. After highlighting the text, start editing the text. Save and close the file. If you reopen the file, text would be retained.

    This worked for me.

     
    |
    Mark as:
  • Currently Being Moderated
    Jan 12, 2012 8:16 PM   in reply to SueJB2

    As I mentioned, you would not be able to see the changes you make.

     

    However, if you save the changes and copy and paste the text by selecting the text via highlighting, the correct text would be displayed to you.

     
    |
    Mark as:
  • Currently Being Moderated
    Jan 14, 2012 10:00 AM   in reply to SueJB2

    Start with a fresh scan of the textual content on the hardcopy and bring it into PDF.

    Make a working copy of the PDF. Use Acrobat's ClearScan. Make a working copy of this changed PDF.

    Using the Edit Document Text tool, select textual content on a PDF page.

    Right click for the context menu. Select "Properties".

    In the dialog you will observe the font (Fd####) that ClearScan used to replace the image of characters it recognized.

    Use the drop-down menu associated with "Font" to select an available system font.

    The text you'd selected previously changes to this new font. Tick the selection "Embed".

    Close the dialog.

    Repeat for each page of the PDF.

    n.b., I'd do Save As often & park an incrementing prefix or suffix to the PDF being processed. Any "oops" & I'd be able to step back to previously saved work rather than lose it all.

     

    Once completed use Save As > Microsoft Word > Word Document or Word 97-2003 Document.

    And/Or Save As > More Options > Rich Text Format

     

    Now, use MS Word to cleanup the content's grammar, spelling, layout, format, & other blivets.

    Once in a "smile, be happy" state of mind with the Word file use PDFMaker ("Acrobat" on the ribbon) to Create PDF.

    Do look over PDFMaker's configuration to assure that it is set to met your needs (selected Distiller job option, etc.)

     

    Create the PDF.

     

    I have found that, for even moderately long hardcopy of textual content, there is greater efficacy achieved by simply rekeying into a new file with a follow up grammar/spell/look-it-over check.

    Of course this pre-supposes one is a somewhat competent "touch typist" (a phrase that dates me, eh?).

     

    Be well...

     
    |
    Mark as:
  • Currently Being Moderated
    Jan 18, 2012 9:06 PM   in reply to SueJB2

    "This is a 76-page document, and the users will expect to see the image looking like the scanned original."

     

    That locks it down. The only way to satisfy this is Searchable Image (Exact).
    The scanned image serves as the an objective replacement for the source hardcopy.
    The OCR output exists to facilitate search/find.

    At the end of the day there is no practical means of editing OCR's Hidden Text layer with it in the PDF.
    That's not to say you cannot work at it and get results. But, the operative word is practical.
    In that context you may want to look over a reply I made here:
    http://forums.adobe.com/thread/950209?tstart=0  

     

    To increase accuracy of OCR recognition:
    Yes, there are dedicated OCR applications (desktop or server). Having used several of each as well as Acrobat's OCR I've learned that also significant is the scanner and the quality of the hardcopy source.

    Regarding the remainder of your post above.

    Ok, I cannot replicate what you describe with a PDF I've been using.
    It is a scanned image of a single page of textual content.
    After ClearScan I can export to Word (&, of course, have some cleanup required).
    I can use the TouchUp Text / Edit Document Text tool to select all the PDF page's content (the ClearScan output).
    Changed the font to TimesNewRoman, saved, and exported to Word.
    The content in Word needed cleanup.


    Next, I selected various words and typed in a replacement word.
    After a Save I Exported to Word. The changed words carried through.


    re: Q1 - What you describe is symptomatic of the Hidden text output of Searchable Image / Searchable Image (Exact) and not ClearScan. So, I'm perplexed.

    re: Q2 - An advantage of ClearScan is being able to edit a text string to correct it. So, sure, why not correct? With that said, it can be a tedious and labor intensive activity. As well, typos are possible during correction which begs the question "Who bells the cat?"  8^) 

    re: Q3 - If corrections to the ClearScan output meets your needs an export to Word may not be needed.
    However, sometimes ClearScan cannot recognize the image of a character and leaves it as a bitmapped image.
    So, to correct you'd have to get into a word processor.

    re: Q4 - Goes back to Q3.


    Here are some useful video tutorials:
    http://acrobatusers.com/tutorials/clearscan-vs-imagetext-ocr 

    A listing of others: http://acrobatusers.com/tutorials/filter/search&keywords=scanning%20oc r&tut_type=Video&channel=tutorials/

    At Adobe TV:
    http://acrobatusers.com/tutorials/filter/search&keywords=scanning%20oc r&tut_type=Video&channel=tutorials/


    Be well...

     
    |
    Mark as:
  • Currently Being Moderated
    Jan 19, 2012 12:19 PM   in reply to SueJB2

    7. Oct 13, 2010 12:13 PM (in response to (Dave_Rado))

    Re: Is it possible to correct Acrobat's OCR errors?

     

    I came across this thread looking for the same information.  After playing around with some settings in Acrobat 8, I discovered the following steps to make the invisible OCR text visible for editing:

     

    To make the OCR hidden text visible, use the Text TouchUp Tool and change the color from No Color to a visible color.  Then edit the text.  Then change the text color back to No Color.

     

    1) With the Text TouchUp tool, select all text on the page (ctrl-A).

    2) Right-click the page and click "Properties".

    3) Go to the "Text" tab and select a Fill color for the font.  Now the overlaid text is visible!

    4) Make any text corrections needed.

    5) Select all text again and choose "No Color" for the fill text to make it invisible again.

    6) Save.

     

    If seeing the scanned image is getting  in the way, you can also go to Edit > Preferences > Page Display and uncheck  the option to "Show large images".

     

     

     

    Hope this helps!

     
    |
    Mark as:
  • Currently Being Moderated
    Jan 24, 2012 10:20 PM   in reply to CtDave

    Does it work in Acrobat 9 ? I found some difference in that? Can U tell me how's it work ? thank you!

     
    |
    Mark as:
  • Currently Being Moderated
    Mar 2, 2012 11:43 AM   in reply to milray2

    milray2, I too found this thread while looking for the same information. Thanks for posting your findings - very helpful! This worked for me, exactly as you described, in Acrobat 9.

     
    |
    Mark as:
  • Currently Being Moderated
    Apr 19, 2012 6:12 PM   in reply to SueJB2

    Sue:

     

    You've really posed a perfect issue.  The problem with the text that AAX produces as a result of its OCR is that when you go to index these files, all this gibberish ends up in the index.  Check dtSearch Desktop sometime, and you will wonder how it gets any work done at all with what it has to contend with.  It's an improvemnt over AA 9, but still not acceptable.  Most people don't scan legal discovery documents, so would not be terribly inconvenienced by this problem -- until they searched for a string and can't find the document.

     

    The use of of the Text Touch up tool is not a solution, as the visuals of the displayed text are terrible, for one thing.

     

    At the moment, what I do is use AAX for scanning to image, not OCR.  Then I take a second pass at the dox with ABBYY FineReader.  At least there you can edit what gets into the hidden text.

     

    In addiition, there is the problem of AAX's poor deskew algorithm when optimizing the pdf.  Only the one it invokes in OCR works worth anything.  I'm still working on this aspect of the problem.

     

    If you come up with an improvement on this problem. with or without AAX, I for one would like to know.

     

    -- Roy Zider

     
    |
    Mark as:

More Like This

  • Retrieving data ...

Bookmarked By (0)

Answers + Points = Status

  • 10 points awarded for Correct Answers
  • 5 points awarded for Helpful Answers
  • 10,000+ points
  • 1,001-10,000 points
  • 501-1,000 points
  • 5-500 points