-
1. Re: How to correct incorrectly OCR'ed pdf file?
Bill@VT Dec 23, 2010 5:17 AM (in response to asdfabcedasf)Turn on suspects and then go through the file looking for errors. That may still not catch it if Acrobat thought it is correct. In that case, select the Text Touchup tool and edit the text. It takes time and is not always easy. I am assuming you are using ClearScan (AA9 and later) if you are seeing the errors.
-
2. Re: How to correct incorrectly OCR'ed pdf file?
asdfabcedasf Dec 23, 2010 4:56 PM (in response to Bill@VT)Bill@VT wrote:
Turn on suspects and then go through the file looking for errors. That may still not catch it if Acrobat thought it is correct. In that case, select the Text Touchup tool and edit the text. It takes time and is not always easy. I am assuming you are using ClearScan (AA9 and later) if you are seeing the errors.
Text Touchup tool works. But I can not see what I actually type. Is there a mode so that I can see what I type when needed but not to show it when it is not needed?
-
3. Re: How to correct incorrectly OCR'ed pdf file?
CtDave Dec 23, 2010 9:11 PM (in response to asdfabcedasf)Not being able to "see" what is selected by the TouchUp Text tool indicates the OCR method used was Searchable Image or Searchable Image (Exact).
Both provide an OCR output that is a hidden "layer" to the PDF page content.
The first method will do some image "dress up". The second method leaves the image unaltered.
Examine Document, when done lets you preview the hidden text.
Alternatively, the OCR's hidden text (and any edits via TouchUp Text tool) can be viewed by export or save as to a text file.
ClearScan replaces characters it recognizes with an Acrobat created font (Fd####).
Characters not recognized are left as a bit-mapped image.
You'll see any changes to ClearScan output performed with the TouchUp Text tool.
Be well...
-
4. Re: How to correct incorrectly OCR'ed pdf file?
asdfabcedasf Dec 24, 2010 2:38 AM (in response to CtDave)CtDave wrote:
Not being able to "see" what is selected by the TouchUp Text tool indicates the OCR method used was Searchable Image or Searchable Image (Exact).
Both provide an OCR output that is a hidden "layer" to the PDF page content.
The first method will do some image "dress up". The second method leaves the image unaltered.
Examine Document, when done lets you preview the hidden text.
Alternatively, the OCR's hidden text (and any edits via TouchUp Text tool) can be viewed by export or save as to a text file.
ClearScan replaces characters it recognizes with an Acrobat created font (Fd####).
Characters not recognized are left as a bit-mapped image.
You'll see any changes to ClearScan output performed with the TouchUp Text tool.
Be well...
Yes. OCR method was Searchable Image (Exact). Examine Document does give me preview of all the hidden text. But it is too slow to be used, it has to examing the whole pdf file before I can see the preview. What I want is to see the hidden text while I'm using the TouchUp Text tool so that I can be sure what hidden texts are wrong and whether my corrections are typed correctly. Is there such an option in Acrobat to allow me do so?
-
5. Re: How to correct incorrectly OCR'ed pdf file?
Bill@VT Dec 24, 2010 8:42 AM (in response to asdfabcedasf)I am on the wrong machine to check, but on the left icon set there should be a selection to look at the structure of the OCRd document. That allows you to view the text layer. I think you can make changes there, but have never tried it.
-
6. Re: How to correct incorrectly OCR'ed pdf file?
CtDave Dec 24, 2010 10:01 AM (in response to asdfabcedasf)Bill is referring to the Content panel.
View > Navigation Panels > Content.
The Content panel's Options menu can be used to select highlight of selected items.
The TouchUp Text tool permits touchup - but the fonts provided by OCR are not system fonts.
Edits can result in an alert to this effect.
Going into the tool's Properties dialog (under the Text tab) permits changes to font.
Edits do overwrite the selected OCR output so what the Content panel shows goes away.
There's no "refresh" to show what may have be keyed in as new characters.
Expect to save after edit and then go back into viewing the Content panel.
All in all - exceptionally tedious.
Working over the OCR output of Searchable Image / Searchable Image (Exact) can be done directly in the PDF.
It calls for high labor input and poses a very real possibility that the PDF will get fraggled and become unusable.
An alternative (if the OCR text *has* to be spot on) might be to save out the OCR to a text file.
Clean that up and produce a PDF. Append this to the PDF containing the scanned image.
Consider removal of the initial OCR output so as to not have it present as a possible source of confusion.
Be well...
-
7. Re: How to correct incorrectly OCR'ed pdf file?
Bill@VT Dec 24, 2010 4:39 PM (in response to CtDave)On AA9 I found the ClearScan text to be the special outline fonts. I saw this when I saved as a DOC file (looked terrible). However, when I saved as a DOC file from a searchable OCR, the text was a system font. So I am not sure about the font in the searchable format, they may be system fonts. Again, I am on the wrong machine to check. Tend to use this desktop at home and not my tablet with AA9.
-
8. Re: How to correct incorrectly OCR'ed pdf file?
asdfabcedasf Dec 25, 2010 6:26 AM (in response to CtDave)CtDave wrote:
Bill is referring to the Content panel.
View > Navigation Panels > Content.
The Content panel's Options menu can be used to select highlight of selected items.
The TouchUp Text tool permits touchup - but the fonts provided by OCR are not system fonts.
Edits can result in an alert to this effect.
Going into the tool's Properties dialog (under the Text tab) permits changes to font.
Edits do overwrite the selected OCR output so what the Content panel shows goes away.
There's no "refresh" to show what may have be keyed in as new characters.
Expect to save after edit and then go back into viewing the Content panel.
All in all - exceptionally tedious.
Working over the OCR output of Searchable Image / Searchable Image (Exact) can be done directly in the PDF.
It calls for high labor input and poses a very real possibility that the PDF will get fraggled and become unusable.
An alternative (if the OCR text *has* to be spot on) might be to save out the OCR to a text file.
Clean that up and produce a PDF. Append this to the PDF containing the scanned image.
Consider removal of the initial OCR output so as to not have it present as a possible source of confusion.
Be well...
I'm not clear where "the tool's Properties dialog (under the Text tab)" is. Would you please show me?
In the alternative that you mentioned, how to remove the initial OCR output. If I do so, I will not be able to select text as I see the image, right? I'll have to go the attached pdf, which doesn't include the original?
If my understanding of this alternative is correct, this may not be a good choice for me. For example, the pdf has both texts and equations. Since OCR can not process equations correctly, when I export the OCR to a text file. The result text file includes the OCR output of equations which render the output not readable. The pdf file created from the extracted text will need substantial effort to clean up. In the end, this may not worth the effort. Do I undertand your alternative solution correctly?
-
9. Re: How to correct incorrectly OCR'ed pdf file?
CtDave Dec 25, 2010 5:06 PM (in response to asdfabcedasf)TouchUp Properties dialog:
With the TouchUp Text tool selected the cursor becomes an "I-beam" shape.
Select some content or click to achieve a vertical blinking cursor line.
Right click the mouse to open the tool's context menu.
At the bottom, select "Properties..." to open the TouchUp Properties dialog.
The dialog has four tabs: Content, Tag, Text, and Color. The Text tab is presented by default.
The Text tab provides font related information.
Note that Bill is correct - OCR output of Searchable Image or Searchable Image (Exact) is typically a font that can be associated with a system font (Times, Helvetica, Time Roman & variations of these). The rendering mode is such that the characters are invisible (hidden).
(Good to know is that there is a Preflight Checkup that can look for/identify characters that use this rendering mode.)
While within the Text tab you can change to a system font, change font size, and other related characteristics.
To remove OCR output:
Use Acrobat Pro's "Examine Document" feature to remove invisible (hidden) text produced via OCR.
Yes, you are correct, if this is done only the image remains.
OCR and equations/symbols:
Yes these can be problematic.
Sort of re-mastering the source content with an authoring application you may have to accept what OCR provides.
At the end of the day - Edit of OCR output can be gnarly. ClearScan makes it easier; but, if additional "clean up" is need this can become resource intensive.
What are the deliverable's "wants" versus "needs" & the "why" of each.
What resources are in hand?
OCR is, well OCR. Routine textual content can/does come through rather well. Other content, not so well. Any "clean up" can become an aggravating experience rather quickly.
fwiw - When the need is present, I've found it to be more efficacious to re-master scanned technical content rather than to attempt OCR output "clean up".
Be well... -
10. Re: How to correct incorrectly OCR'ed pdf file?
theleftymop Mar 11, 2011 12:03 PM (in response to asdfabcedasf)I am using Adobe Acrobat Pro 9, and I was running into the same problem. I don't believe there is a quick and easy way to make wholesale changes/corrections to the OCR "result text". I have binders of older documents on a shelf at my office, and I would love to be able to start scanning them into PDF format and then run them through OCR so that I can perform key word searches on them. For my purposes, I have been using "Searchable Image (Exact)" output.
For some of the documents, this is a snap. I scan, do OCR, and that is it. However, some of the older documents are too messy for this process to run smoothly. For things like copies of faxes, or 40-year-old pink pages of typewritten fascimile paper, performing OCR results in so many errors as to completely frustrate a word search. I tried for hours to find a solution that would let me open up the hidden OCR text overlay, look at it, and add/delete characters as needed.
Although I couldn't find anything within Acrobat Pro 9, my search eventually led me to a program called Infix. It has a feature that allows you to fully edit the OCR text layer, while still looking at the original image text faded into the background. You can type whatever you want, formatted however you want, and you can move the text around to line up with the original image's text. The program works very well, and I highly recommend it to anyone who needs to do intensive editing to OCR text layers of PDF files.
I know this thread is a few months old, but it popped up repeatedly in my searches for a solution, so I hope the above can be of some use to people who are in the same boat as I am.



