1 Reply Latest reply on Apr 23, 2012 12:39 PM by Allta Media

    OCR and Reducing file size


      I have a large document (a book) that I am trying to scan. I will be scanning it chapter by chapter. The book was printed in grayscale, so I don't have a pure BLACK AND WHITE document. I would like to optimize the file size, but I have a few questions about that.


      Currently running:

      Windows 7

      Acrobat Pro X

      Epson GT-S80 High-speed scanner


      1. What is a good typical workflow? I have tried scanning the documents to PDF using the scanner's software then opening them up in Acrobat to OCR them. I have tried using Acrobat's Scan feature with OCR being one of the steps in the scanning process. I have tried letting both softwares do their own color mode detection, where they will mix black and white and grayscale to reduce the file size, but have typically told it to stick with grayscale because that gives me the cleanest and clearest document. Does anyone have any recommendations on getting a good quality image and using a mix of black and white, as well as grayscale, or should I keep using just grayscale?


      2. I am having some trouble, I think, with the file size. I have a 12 page document I believe was either scanned at 300 dpi or was scanned at full resolution because I used CLEARSCAN, and downsampled everything to 300 dpi. I don't remember exactly, but that file is about 2.20 MB in size, and I think that runs about 185K per page. I would think there could be a way to get a smaller file.


      3. For text recognition purposes, this document is not ideal because it is a collection of powerpoint slide sheets (2 - 3 slides per page), and in some cases there is text on top of image in the slides, and it seems very hard to discern.


      4. Once a document has been scanned, and OCR has been run on it, I was under the impression that the OCR is in a separate layer, and that (if Searchable Text is chosen), you basically have a scanned image with another layer of searchable text. Because the OCR'd text is "there somewhere", is it possible to remove the scanned image text, and have just the raw recognized text, similar to if I created the document in Word, and created a PDF?


      5. Sort of back to number 1, suppose I am stuck with leaving the scanned image behind, and just running OCR, what is the optimal way to reduce the file size of the PDF? I had read that running your scan at 600 dpi may help with the text recognition. The same article suggested doing the higher resolution scan and using the ClearScan because it would  a) recognize the text better and  b) convert the text image to actual text and reduce the file size. From there, should I then just run the PDF optimizer to downsample the images to a certain DPI to further reduce the size?


      Hopefully you all can understand what I am saying and help fill in some gaps.