• Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
    Dedicated community for Japanese speakers
  • 한국 커뮤니티
    Dedicated community for Korean speakers
Exit
0

ClearScan over a searchable text? (OCR accuracy)

New Here ,
Jan 21, 2010 Jan 21, 2010

Copy link to clipboard

Copied

Hello,

I've got some PDF files with searchable text but unfortunately the readibility is difficult (the letters are not smooth at all).
If  I use the ClearScan option over a document that is already searchable, would it affect the OCR accuracy?

Any help would be greatly appreciated,
Cheers

PS: I was told the correct DPI for ClearScan was 300 and not 600. Right or wrong?

TOPICS
ADE authorization

Views

29.3K

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Jan 24, 2010 Jan 24, 2010

Copy link to clipboard

Copied

Hi Marianna,

For any OCR, 300 ppi is generally used to give a good balance between OCR accuracy and file size.
While 400 ppi or even 600 ppi can yield somewhat more accuracy of OCR, the file size can get large.
Optimizing the PDF, to reduce file size, typically results in destructive remove of image pixels.
This degrades the visual quality of the image.
As you drop below 300 ppi OCR accuracy falls off dramatically.

Because you mentioned "searchable" in context of OCR and not smooth letters it sounds like the source paper scanned was not of good quality or the scanner needed a clean and inspect. Neither "Searchable Image" nor "Searchable Image (Exact)" affect the existing scanned image (unless down sampling is used - which destructively removes pixels). I mention this because you do not "see" the OCR output from either of these OCR methods.
OCR from Searchable Image or Searchable Image (Exact) adds a layer of hidden text to the PDF page.
It is not part of the PDF page content (which is the scanned image).


ClearScan uses A custom Adobe Font to replace the image's characters while leaving a low resolution copy of the image in the background.
With ClearScan, you can edit "suspect" words. Those "suspects", if not edited remain as a bit mapped image.
Note that sometimes, some characters' scanned image cannot be processed by OCR.
Typically, these are not provided to you as "suspects" by ClearScan and remain in the PDF page content as a bit mapped image.

If you want to edit the ClearScan OCR output later you can.
First, you must change the PDF page content (ClearScan output) to a different font, one that is installed on your system and is not "locked" by license restrictions.


Something else that is good to know and may be useful at some time is that Acrobat 9's Preflights have a Fixup that will embed OCR output's "hidden text".
A Batch Sequence could be built around this.

Something else to configure when bringing TIFF (or any supported file format) into PDF via Acrobat.

Go into Preferences. Select the "create PDF" category. Look for the file format in the window showing file formats.

Select it. Often, there is an "Edit" button. Use it to get the dialog that lets you edit some of the parameters being used for the conversion.


Be well...

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Explorer ,
Aug 16, 2010 Aug 16, 2010

Copy link to clipboard

Copied

Dave. I see you keep giving essentially the same answer whenever any user queries the Suspect functionality in Acrobat 9 OCR, i.e. when they ask how to identify and correct OCR errors.

"With ClearScan, you can edit "suspect" words. Those "suspects", if not edited remain as a bit mapped image.
Note that sometimes, some characters' scanned image cannot be processed by OCR.
Typically, these are not provided to you as "suspects" by ClearScan and remain in the PDF page content as a bit mapped image.

If you want to edit the ClearScan OCR output later you can.
First, you must change the PDF page content (ClearScan output) to a different font, one that is installed on your system and is not "locked" by license restrictions.

"

You keep missing the point. If a user uses ClearScan OCR, then uses 'Find First.." or 'Find All..." suspects, Acrobat 9 never identifies any. Therefore, they can't even start to use the convoluted 'change font and then edit' method.

If you believe I am incorrect, please upload a PDF image file which we can test in Acrobat 9.x with ClearScan and which DOES allow 'suspects' to be identified.

Until then, maybe best to put your OCR advice on hold....

Looking forward to your upload.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Aug 23, 2010 Aug 23, 2010

Copy link to clipboard

Copied

LATEST

Well, the "Find Suspects" sure does not.
Now back in the day of Acrobat 9.0 I played with ClearScan alot. As I recollected, I got to do the 'suspect' thing.
Sure cannot replicate that now.
Maybe the recollection just ain't so.
Maybe the transition from 9.0 to 9.3.4 resulted in an even more refined ClearScan, eh?

"Therefore, they can't even start to use the convoluted 'change font and then edit' method."


But, you know, you don't need it to change ClearScan's "Fd...." fonts.


As I've said before, use the TouchUp Text tool to select some ClearScan output text.
From within the Properties dialog, on the Text tab, you can change Font, Font Size or Font color.

Rather than providing you something out of Captivate, placed in a PDF for access via acrobat.com
that demonstrates this, just view David Mankin's "Scanning and OCR" eSeminar
http://adobechats.adobe.acrobat.com/p49554903/


David demonstrates changing ClearScan Font and Font Size just past 0:46:20 on the timeline.

Really not convoluted at all.

Of course, the "gold standard" would be to transcribe the content on the hard copy to a word processor file, no?

Then the whole OCR thing becomes moot.


Something worth noting.
There have been occassions where I've worked with a scanned image process with ClearScan
and when I've gone to use the TouchUp Text tool to change font configuration I've been greeted by:

     Accumulated text within the attempted selection area is rotated other than horizontal or vertical.
    TouchUp cannot create a text selection.

Be well...

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines