3 Replies Latest reply on Aug 23, 2010 8:38 PM by CtDave

    ClearScan over a searchable text? (OCR accuracy)

    Marianna Sol

      Hello,

       

      I've got some PDF files with searchable text but unfortunately the readibility is difficult (the letters are not smooth at all).
      If  I use the ClearScan option over a document that is already searchable, would it affect the OCR accuracy?

       

      Any help would be greatly appreciated,
      Cheers

       

      PS: I was told the correct DPI for ClearScan was 300 and not 600. Right or wrong?

        • 1. Re: ClearScan over a searchable text? (OCR accuracy)
          CtDave Level 5

          Hi Marianna,

          For any OCR, 300 ppi is generally used to give a good balance between OCR accuracy and file size.
          While 400 ppi or even 600 ppi can yield somewhat more accuracy of OCR, the file size can get large.
          Optimizing the PDF, to reduce file size, typically results in destructive remove of image pixels.
          This degrades the visual quality of the image.
          As you drop below 300 ppi OCR accuracy falls off dramatically.

          Because you mentioned "searchable" in context of OCR and not smooth letters it sounds like the source paper scanned was not of good quality or the scanner needed a clean and inspect. Neither "Searchable Image" nor "Searchable Image (Exact)" affect the existing scanned image (unless down sampling is used - which destructively removes pixels). I mention this because you do not "see" the OCR output from either of these OCR methods.
          OCR from Searchable Image or Searchable Image (Exact) adds a layer of hidden text to the PDF page.
          It is not part of the PDF page content (which is the scanned image).


          ClearScan uses A custom Adobe Font to replace the image's characters while leaving a low resolution copy of the image in the background.
          With ClearScan, you can edit "suspect" words. Those "suspects", if not edited remain as a bit mapped image.
          Note that sometimes, some characters' scanned image cannot be processed by OCR.
          Typically, these are not provided to you as "suspects" by ClearScan and remain in the PDF page content as a bit mapped image.

          If you want to edit the ClearScan OCR output later you can.
          First, you must change the PDF page content (ClearScan output) to a different font, one that is installed on your system and is not "locked" by license restrictions.


          Something else that is good to know and may be useful at some time is that Acrobat 9's Preflights have a Fixup that will embed OCR output's "hidden text".
          A Batch Sequence could be built around this.

           

           

          Something else to configure when bringing TIFF (or any supported file format) into PDF via Acrobat.

          Go into Preferences. Select the "create PDF" category. Look for the file format in the window showing file formats.

          Select it. Often, there is an "Edit" button. Use it to get the dialog that lets you edit some of the parameters being used for the conversion.


          Be well...

          • 2. Re: ClearScan over a searchable text? (OCR accuracy)
            MikkM Level 1

            Dave. I see you keep giving essentially the same answer whenever any user queries the Suspect functionality in Acrobat 9 OCR, i.e. when they ask how to identify and correct OCR errors.

             

            "With ClearScan, you can edit "suspect" words. Those "suspects", if not edited remain as a bit mapped image.
            Note that sometimes, some characters' scanned image cannot be processed by OCR.
            Typically, these are not provided to you as "suspects" by ClearScan and remain in the PDF page content as a bit mapped image.

            If you want to edit the ClearScan OCR output later you can.
            First, you must change the PDF page content (ClearScan output) to a different font, one that is installed on your system and is not "locked" by license restrictions.

            "

             

            You keep missing the point. If a user uses ClearScan OCR, then uses 'Find First.." or 'Find All..." suspects, Acrobat 9 never identifies any. Therefore, they can't even start to use the convoluted 'change font and then edit' method.

             

            If you believe I am incorrect, please upload a PDF image file which we can test in Acrobat 9.x with ClearScan and which DOES allow 'suspects' to be identified.

             

            Until then, maybe best to put your OCR advice on hold....

             

             

            Looking forward to your upload.

            • 3. Re: ClearScan over a searchable text? (OCR accuracy)
              CtDave Level 5

              Well, the "Find Suspects" sure does not.
              Now back in the day of Acrobat 9.0 I played with ClearScan alot. As I recollected, I got to do the 'suspect' thing.
              Sure cannot replicate that now.
              Maybe the recollection just ain't so.
              Maybe the transition from 9.0 to 9.3.4 resulted in an even more refined ClearScan, eh?

               

              "Therefore, they can't even start to use the convoluted 'change font and then edit' method."


              But, you know, you don't need it to change ClearScan's "Fd...." fonts.


              As I've said before, use the TouchUp Text tool to select some ClearScan output text.
              From within the Properties dialog, on the Text tab, you can change Font, Font Size or Font color.

               

              Rather than providing you something out of Captivate, placed in a PDF for access via acrobat.com
              that demonstrates this, just view David Mankin's "Scanning and OCR" eSeminar
              http://adobechats.adobe.acrobat.com/p49554903/


              David demonstrates changing ClearScan Font and Font Size just past 0:46:20 on the timeline.

               

              Really not convoluted at all.

              Of course, the "gold standard" would be to transcribe the content on the hard copy to a word processor file, no?

              Then the whole OCR thing becomes moot.


              Something worth noting.
              There have been occassions where I've worked with a scanned image process with ClearScan
              and when I've gone to use the TouchUp Text tool to change font configuration I've been greeted by:

                   Accumulated text within the attempted selection area is rotated other than horizontal or vertical.
                  TouchUp cannot create a text selection.

               

              Be well...