My first ever post here so please be gentle!
I am trying to work out how to convert a scanned PDF document into a PDF/A-2b document with OCR.
I have thousands of scanned PDFs that i am compressing and converting to PDF/A-2b. I am using actions within Adobe Acrobat Pro x. I have converted a few thousand so far with no problems.
However, what I now want to do additionally is not only compress and convert to PDF/A-2b but compress, OCR and convert to PDF/A-2b to make a searchable PDF/A-2b compliant file.
At the moment when I try to do this using Actions within Acrobat X Pro and using the searchable image output style it produces an error at the preflight stage. The error is: Character references .notdef glyph
Now I am assuming this because within the OCR text layer it is obviously detecting / picking up bits of text/rubbish that it cannot convert / embed into the document as a font.
When doing the same but using the clearscan output style (which I don’t want to use anyway as compression is disabled when using this) it also produces an error at the preflight stage. The error is: Width information for rendered glyphs is inconsistent
My question therefore is how do you convert an OCR’ed PDF to a PDF/A-2b compliant PDF?
Note: It doesn’t HAVE to be PDF/A-2b, PDF/A-2u would be fine as well (but same problems when trying to do this).
What am I doing wrong?
The complete action of the batched job I have configured is as follows:
Step 1 – Remove Hidden Information (everything checked).
Step 2 – Add Document Description (Title, Subject, Author Keywords filled in).
Step 3 – Optimised Scanned PDF (JPEG2000 / JBIG2 Compression, Text Sharpening High, OCR – Searchable Image)
Step 4 – Preflight (Convert to PDF/A-2b)
Save To: Output Format – PDF Optimiser (compatible 1.7 version (acrobat 8) + various settings here, too many to mention)
Is it even possible to have a PDF/A-2a/b/u compliant PDF that is searchable? I am assuming it is?!
Any help and assistance would be greatly appreciated!
This article may be useful.
As the "a" and "b" mean the same as in PDF/A-1 & "u" relates to Unicode mapping I suspect that, from a approach of practicality, you can attempt to obtain "b".
Trying to get "a" from OCR output is rather problematic.
Trying for "u" means you'd have to get the OCR output to properly map to Unicode.
That's not to say something workable is not possible if you are willing to spend the money to obtain one of the high-end third party applications.
Just a guess, but you might be facing the 2-ton in a 1-ton pickup truck scenario what with everything the Action is doing.
Are you back fitting PDFs you have already processed to 2b?
Doing the OCR of such after the fact might be contributing to the issue(s).
fwiw, On a small sample of PDF of scanned textual content that had already be processed with Searchable Image (Exact) the Acrobat X Pro Preflight "Convert to PDF/A-2b (sRGB)" analyze and fix provided valid "2b".
I too am experiencing the exact same thing that you are. I can turn a scanned document into PDF/A-2b, but if I then use OCR to recognize text in the document, I can't turn that OCR'd document into PDF/A-2b. If I use Searchable Image (exact) I get the .notdef problem; and if I use ClearScan I get the Width information problem.
Is there any way to convert scanned OCR documents into PDF/A with Acrobat X?
Sorry for my the lateness in this reply.
I managed to fix / solve my problem by doing something that in my humble opinion is illogical.
What i did essentially is change the order of the OCR and PDF/A steps.
Originally i had the Preflight (Convert to PDF/A-2b) as the very last step, as in my mind that is logical and where you would want to do it.
I however as mentioned changed the order of the steps so the Preflight (Convert to PDF/A-2b) step i put BEFORE the Reconise Text (OCR) step, i.e you OCR the document AFTER you convert it to PDF/A2b. (i know.. weird!! how can be PDF/A2b compliant still?? but it is!!)
So my complete batched processing action is now as follows:
Step 1 - Remove Hidden Information (everything checked!)
Step 2 - Add Document Description
Step 3 - Preflight (Convert to PDF/A-2b)
Step 4 - Recognize Text (using OCR) (English UK, Searchable Image (Exact)
Save to (step 5 essentially) I have made it add "_Processed" to the original file name and used PDF Optimizer on the Output Format (loads of settings here, too many to mention).
This now compresses my PDF's by a good 50-60%, OCR's them and makes them fully PDF/A-2b compliant.
I have processed in the region of 3000 PDF's (another 5000 to go!) with no issues.
I hope this helps.
This problem still seems to exist!
I'm using Acrobat pro 11.0.06
Your "workaround" doesn't work for me (anymore?). I can't use OCR on a PDF/A – neither in an action nor manually without removing the PDF/A-compliance.
The problem seems to be the use of .notdef-glyphs in Acrobat's OCR.
On some documents the OCR produces .notdef-glyphs (which it probably shouldn't in 2014). Those glyphs aren't allowed in PDF/A-2 and PDF/A-3 anymore.
There is a function in the Preflight PDF/A-presets which is supposed to replace .notdef-glyphs but doesn't do anything after all. I even created a custom Preflight profile which should only replace .notdef-glyphs but even this doesn't work!
That's why Preflight reconverts every single PDF via PostScript, thereby losing all the OCR-text.
PDF/A-1b still works with OCRed scans since PDF/A-1 still allows .notdef-glyphs. But it doesn't allow e.g. jpeg2000!
I guess Acrobat shouldn't use .notdef-glyphs in it's OCR anymore and also should make the Preflight-function which is supposed to replace them working!
I reported those problems with PDF/A conversions in May 2012 but Adobe didn't fix anything.
Is there anybody else still ahving these problems?
Or a new workaround?