I'm using Adobe Acrobat X.
I was pretty happy with the size of a 216 page PDF that I am working on, approximately 6 MB.
The PDF consists of only CCITT G4 encoded black and white pages, scanned from photocopies. I used Recognize Text, which seemed to work pretty well, and increased the file size by an insignificant amount.
Then I spent a couple of hours using Find All Suspects. Recognize Text interpreted a lot of stray marks on the original photocopies as text, and I went through and marked all of them as "not text".
I didn't realize the file grew in size to 22.9 MB, until I tried to email it to myself.
What do you suppose happened? Did I miss a step?
When you run the Recoginize Text command, there is a Settings section and if you click the Edit .. button you can select Searchable image, Searchable image (Exact), or ClearScan. Here is a blog titled Better PDF OCR. ClearScan is smaller, looks better that helps to explain the differences.
I read the blog. I think ClearScan is against the principles of digital archiving. I don't want to use ClearScan on the current document that I am working on. I want to preserve the scanned image of the original document.
As mentioned, the file size that I ended up with after using Text Recognition, of around 6 MB, is satisfactory. The percentage shown by Audit Space Usage for the fonts is acceptable. I still don't why all that document overhead got added.
Tinkering with optimizing a copy of the 22.9MB file is a good suggestion. However, I've gone back to an earlier version of the PDF from before I used Text Recognition, and would rather start over from that point.
Here are a couple screenshots to demonstrate the process that is causing the bloat. You can see that Text Recognition picked up stray marks on the edge of the scanned photocopy paper:
I started over with a copy of the PDF version from before using Text Recognition. I did Text Recognition; once again, that increased the file size by only a small amount. Then I started marking these Suspects as Not Text again. Although I didn't do all 216 pages again, I could see the file significantly growing in size after every Save.
Also, whenever I would save, I would see a message "Replacing Fonts", which doesn't make sense. I want to delete these Suspects, as if they were never recognized by Text Recognition in the first place.
LoriAUC wrote:
Have you tried just doing a simple Save As > Reduced File Size?
I went back to the 22.9MB file to try this. Indeed, it reduced the file size to 3.5MB, and got rid of most of the bloat.
However, it think it also reduced the quality of the PDF because it added another re-encoding and/or down sampling step, which I don't want to do.
I'd really like to know what designating Suspects as Not Text is supposed to do, and possibly avoid using it if the purpose of it is different from what I want it to do.
We tried marking suspects as "Not Text" in 3-4 files but the file size increase was marginal and that too can be owed to the font information that gets embedded while correcting suspects.
Would it be possible for you to share the file on which you are encountering this issue for our investigation?
How about throwing it up on Adobe Sendnow (free) and then posting the link?
Hi,
On incrementally saving the file, file size grows and this is in accordance with how feature is designed.
I tried saving the file again with a different name and this time, I could not reproduce the issue. Could you please confirm if you are also seeing the same behavior for non-incremental file save.
Thanks
apangasa wrote:
Hi,
On incrementally saving the file, file size grows and this is in accordance with how feature is designed.
I tried saving the file again with a different name and this time, I could not reproduce the issue. Could you please confirm if you are also seeing the same behavior for non-incremental file save.
Thanks
I am not sure that I understand your question. Do you mean Save Versus Save As?
My final step is usually Save As...PDF/A. However, the bloated file was probably from before doing that final step. So I just tried opening the bloated file, and doing Save As...PDF/A, which does produce a file of significantly reduced size. However, I get the spinning beach ball, and have to force quit Acrobat. (I tried it twice.)
"Incremental save" means an ordinary Save, not Save as.
By design, this always makes the file bigger, sometimes much bigger.
It is designed to be very quick, so it never deletes anything. And if you make a change you have both the old and the new.
Save As, ordinary Save as without optimization or PDF/A or anything, removes the deleted stuff without touching the quality. I think it is this vital step you have mised. Nothing to do with scanning or OCR specifically.
apangasa and Test Screen Name, thank you for bringing the difference between Save and Save As… to my attention. This may be my first experience with a program where Save As… means something other than simply making a copy, which can be given a different name and location.
However, now I am wondering whether something else altogether explains the bloating phenomenon.
The majority of the elements found by OCR Suspects are actually correct. As I've been going through all Suspects one at a time, whenever no correction is required, I've been clicking "Find Next", rather than "Accept and Find". Could this be triggering the "Replacing Font" phenomenon, and causing the bloating, when I Save?
Here's an example of an OCR Suspect that is correct, where I clicked Find Next:
North America
Europe, Middle East and Africa
Asia Pacific