How do you proofread and correct text produced by OCR from a scanned document, in Acrobat 10 Pro?
I scan (many, large) paper documents, then use Recognise Text. After the OCR phase, if I save PDFs as text, I can see many scan errors.
I would like to be able to correct those errors in the scanned text, so that names etc can be successfully searched. However I cannot find any way to view and correct the scanned text.
I experimented with Tools / Content / Edit Document Text, but I cannot see how to display the scanned text to allow correction. It appears to operate on the PDF image. But if I try to change the document image to correct known errors (e.g. in spacing), and then save the PDF as text again, the string where I changed the image becomes gibberish.
How is Edit Document Text supposed to work? Is there any way to achieve what I am looking for (fixing many errors in large OCR'd documents)?
For editing the OCRed text, go to Tools pane > Recognize Text > Find First Suspect.
Now click on any text and edit. The editing will be done on the hidden text layer hence you would not be able to see the modifications. However, there is a workaround to verify your modifications by selecting and copying this text to notepad.
Please note that this workflow would not work for Clear Scan documents. .
Thanks, Bernd and apangasa.
I tried the method you describe. I OCR'd a scanned file using Readable Image (Exact). I saved it as PDF and as txt; the txt revealed many scannoes.
I found that Find Suspects found nothing, so I highlighted text where I knew there was an error. Right mouse click then gave me a list of options, of which only "Replace text" sounded useful, so I used that. It presented as if I was typing an annotation, and it put a blue line through the text I had "replaced".
I then saved the file as PDF and as txt. The txt had no changes whatsoever - no corrections vs, the original.
Am I doing something wrong?
Thanks for trying out the things.
Workflow till Find Suspects and highlighting look good. But you need not right click on the text. After highlighting the text, start editing the text. Save and close the file. If you reopen the file, text would be retained.
This worked for me.
Just to confirm, should I overtype the text that I see on the image, with the same text?
I am able to highlight text in blue, but it will not allow me to type anything. If I highlihgt in blue and retype the text, nothing happens.
My conclusion is that there is no way to proofread and correct the scanned text using Acrobat X. Am I being too pessimistic?
Are there any tools other than Acrobat X that I could use to proofread and correct the hidden text? Adobe, or from other vendors?
As I mentioned, you would not be able to see the changes you make.
However, if you save the changes and copy and paste the text by selecting the text via highlighting, the correct text would be displayed to you.
I "made the changes" - that is, I clicked and dragged over text in the image, which highlighted it in blue; I then retyped what the text said - which had no audible or visible effect that I could notice.
I then saved the pdf as a pdf, and as a text file.
I then opened the text file and looked for the changes I had supposedly made. But absolutely nothing had changed.
Was I doing it wrong? If so what should I have done instead?
But this question is academic, because this method even if it worked would be far too laborious to be worthwhile.
I want to open and edit the whole text layer and save it, still attached to the PDF. How would I do that?
Start with a fresh scan of the textual content on the hardcopy and bring it into PDF.
Make a working copy of the PDF. Use Acrobat's ClearScan. Make a working copy of this changed PDF.
Using the Edit Document Text tool, select textual content on a PDF page.
Right click for the context menu. Select "Properties".
In the dialog you will observe the font (Fd####) that ClearScan used to replace the image of characters it recognized.
Use the drop-down menu associated with "Font" to select an available system font.
The text you'd selected previously changes to this new font. Tick the selection "Embed".
Close the dialog.
Repeat for each page of the PDF.
n.b., I'd do Save As often & park an incrementing prefix or suffix to the PDF being processed. Any "oops" & I'd be able to step back to previously saved work rather than lose it all.
Once completed use Save As > Microsoft Word > Word Document or Word 97-2003 Document.
And/Or Save As > More Options > Rich Text Format
Now, use MS Word to cleanup the content's grammar, spelling, layout, format, & other blivets.
Once in a "smile, be happy" state of mind with the Word file use PDFMaker ("Acrobat" on the ribbon) to Create PDF.
Do look over PDFMaker's configuration to assure that it is set to met your needs (selected Distiller job option, etc.)
Create the PDF.
I have found that, for even moderately long hardcopy of textual content, there is greater efficacy achieved by simply rekeying into a new file with a follow up grammar/spell/look-it-over check.
Of course this pre-supposes one is a somewhat competent "touch typist" (a phrase that dates me, eh?).
Hello CtDave; Thanks for your reply.
I've followed your instructions down to the point of changing font. When you say, "to an available system font" how do I tell which fonts are system fonts? I chose Arial.
Q1 it only let me select one line of text at a time. Do I have to do this process line by line?
Q2 when I changed the font, it now shows the OCR'd text with errors. You don't say to correct the errors, but do you mean that I should correct them?
Q3 As I understand it, you say that I should then save the PDF as Word, make this Word doc into a PDF and use this as a replacement for the original PDF. Is this what you mean?
I corrected one line of my PDF after I changed its font. When I saved the PDF as a word document, the line I had corrected now displays in the word document as gibberish:
&$ 5XDQDWUL +817 Y 5 &$ 0DUDPD 0$<5,&. Y 5 &$
(should be three sets of case number and case name).
Q4: Should I simply change the font, line by line, throughout the PDF - save it as Word, correct the Word document and then turn it into a PDF? This is a 76-page document, and the users will expect to see the image looking like the scanned original.
"This is a 76-page document, and the users will expect to see the image looking like the scanned original."
That locks it down. The only way to satisfy this is Searchable Image (Exact).
The scanned image serves as the an objective replacement for the source hardcopy.
The OCR output exists to facilitate search/find.
At the end of the day there is no practical means of editing OCR's Hidden Text layer with it in the PDF.
That's not to say you cannot work at it and get results. But, the operative word is practical.
In that context you may want to look over a reply I made here:
To increase accuracy of OCR recognition:
Yes, there are dedicated OCR applications (desktop or server). Having used several of each as well as Acrobat's OCR I've learned that also significant is the scanner and the quality of the hardcopy source.
Regarding the remainder of your post above.
Ok, I cannot replicate what you describe with a PDF I've been using.
It is a scanned image of a single page of textual content.
After ClearScan I can export to Word (&, of course, have some cleanup required).
I can use the TouchUp Text / Edit Document Text tool to select all the PDF page's content (the ClearScan output).
Changed the font to TimesNewRoman, saved, and exported to Word.
The content in Word needed cleanup.
Next, I selected various words and typed in a replacement word.
After a Save I Exported to Word. The changed words carried through.
re: Q1 - What you describe is symptomatic of the Hidden text output of Searchable Image / Searchable Image (Exact) and not ClearScan. So, I'm perplexed.
re: Q2 - An advantage of ClearScan is being able to edit a text string to correct it. So, sure, why not correct? With that said, it can be a tedious and labor intensive activity. As well, typos are possible during correction which begs the question "Who bells the cat?" 8^)
re: Q3 - If corrections to the ClearScan output meets your needs an export to Word may not be needed.
However, sometimes ClearScan cannot recognize the image of a character and leaves it as a bitmapped image.
So, to correct you'd have to get into a word processor.
re: Q4 - Goes back to Q3.
Here are some useful video tutorials:
7. Oct 13, 2010 12:13 PM (in response to (Dave_Rado))
Re: Is it possible to correct Acrobat's OCR errors?
I came across this thread looking for the same information. After playing around with some settings in Acrobat 8, I discovered the following steps to make the invisible OCR text visible for editing:
To make the OCR hidden text visible, use the Text TouchUp Tool and change the color from No Color to a visible color. Then edit the text. Then change the text color back to No Color.
1) With the Text TouchUp tool, select all text on the page (ctrl-A).
2) Right-click the page and click "Properties".
3) Go to the "Text" tab and select a Fill color for the font. Now the overlaid text is visible!
4) Make any text corrections needed.
5) Select all text again and choose "No Color" for the fill text to make it invisible again.
If seeing the scanned image is getting in the way, you can also go to Edit > Preferences > Page Display and uncheck the option to "Show large images".
Hope this helps!
You've really posed a perfect issue. The problem with the text that AAX produces as a result of its OCR is that when you go to index these files, all this gibberish ends up in the index. Check dtSearch Desktop sometime, and you will wonder how it gets any work done at all with what it has to contend with. It's an improvemnt over AA 9, but still not acceptable. Most people don't scan legal discovery documents, so would not be terribly inconvenienced by this problem -- until they searched for a string and can't find the document.
The use of of the Text Touch up tool is not a solution, as the visuals of the displayed text are terrible, for one thing.
At the moment, what I do is use AAX for scanning to image, not OCR. Then I take a second pass at the dox with ABBYY FineReader. At least there you can edit what gets into the hidden text.
In addiition, there is the problem of AAX's poor deskew algorithm when optimizing the pdf. Only the one it invokes in OCR works worth anything. I'm still working on this aspect of the problem.
If you come up with an improvement on this problem. with or without AAX, I for one would like to know.
-- Roy Zider
Europe, Middle East and Africa