Skip navigation
Currently Being Moderated

How do you convert a searchable (OCR) PDF to a PDF/A-2b compiant file?

Mar 30, 2012 8:41 AM

Tags: #ocr #searchable #actions #preflight #pdf/a

Hello all,

 

My first ever post here so please be gentle!

 

I am trying to work out how to convert a scanned PDF document into a PDF/A-2b document with OCR.

 

I have thousands of scanned PDFs that i am compressing and converting to PDF/A-2b. I am using actions within Adobe Acrobat Pro x. I have converted a few thousand so far with no problems.

 

However, what I now want to do additionally is not only compress and convert to PDF/A-2b but compress, OCR and convert to PDF/A-2b to make a searchable PDF/A-2b compliant file.

 

At the moment when I try to do this using Actions within Acrobat X Pro and using the searchable image output style it produces an error at the preflight stage. The error is: Character references .notdef glyph

 

Now I am assuming this because within the OCR text layer it is obviously detecting / picking up bits of text/rubbish that it cannot convert / embed into the document as a font.

 

When doing the same but using the clearscan output style (which I don’t want to use anyway as compression is disabled when using this) it also produces an error at the preflight stage. The error is: Width information for rendered glyphs is inconsistent

 

My question therefore is how do you convert an OCR’ed PDF to a PDF/A-2b compliant PDF?

 

Note: It doesn’t HAVE to be PDF/A-2b, PDF/A-2u would be fine as well (but same problems when trying to do this).

 

What am I doing wrong?

 

The complete action of the batched job I have configured is as follows:

 

Step 1 – Remove Hidden Information (everything checked).

Step 2 – Add Document Description (Title, Subject, Author Keywords filled in).

Step 3 – Optimised Scanned PDF (JPEG2000 / JBIG2 Compression, Text Sharpening High, OCR – Searchable Image)

Step 4 – Preflight (Convert to PDF/A-2b)

Save To: Output Format – PDF Optimiser (compatible 1.7 version (acrobat 8) + various settings here, too many to mention)

 

Is it even possible to have a PDF/A-2a/b/u compliant PDF that is searchable? I am assuming it is?!

 

Any help and assistance would be greatly appreciated!

 

Cheers

 

Dan

 

 
Replies
  • Currently Being Moderated
    Mar 30, 2012 10:45 AM   in reply to Dan337

    This article may be useful.

    http://www.pdfa.org/2011/08/pdfa-%E2%80%93-a-look-at-the-technical-sid e/ 

     

    As the "a" and "b" mean the same as in PDF/A-1 & "u" relates to Unicode mapping I suspect that, from a approach of practicality, you can attempt to obtain "b".

    Trying to get "a" from OCR output is rather problematic.

    Trying for "u" means you'd have to get the OCR output to properly map to Unicode.

    That's not to say something workable is not possible if you are willing to spend the money to obtain one of the high-end third party applications.

     

    Be well...

     
    |
    Mark as:
  • Currently Being Moderated
    Mar 30, 2012 2:26 PM   in reply to Dan337

    Just a guess, but you might be facing the 2-ton in a 1-ton pickup truck scenario what with everything the Action is doing.

    Are you back fitting PDFs you have already processed to 2b?

    Doing the OCR of such after the fact might be contributing to the issue(s).

     

    fwiw, On a small sample of PDF of scanned textual content that had already be processed with Searchable Image (Exact) the Acrobat X Pro Preflight "Convert to PDF/A-2b (sRGB)" analyze and fix provided valid "2b".

     

    Be well...

     
    |
    Mark as:
  • Currently Being Moderated
    May 7, 2012 12:16 AM   in reply to Dan337

    I too am experiencing the exact same thing that you are.  I can turn a scanned document into PDF/A-2b, but if I then use OCR to recognize text in the document, I can't turn that OCR'd document into PDF/A-2b.  If I use Searchable Image (exact) I get the .notdef problem; and if I use ClearScan I get the Width information problem.

     

    Is there any way to convert scanned OCR documents into PDF/A with Acrobat X?

     

    Thanks,

      John

     
    |
    Mark as:
  • Currently Being Moderated
    Jun 11, 2012 1:39 PM   in reply to jwwiegley

    I halso have this problem and I'm a little bit confused, that Acrobat X Pro seems to be unable to make its own OCR compatible to PDF/A-2b.

    I'd be very thankful for any hints and workarounds!

     
    |
    Mark as:
  • Currently Being Moderated
    Mar 1, 2014 3:58 AM   in reply to Dan337

    This problem still seems to exist!

     

    I'm using Acrobat pro 11.0.06

     

    Your "workaround" doesn't work for me (anymore?). I can't use OCR on a PDF/A – neither in an action nor manually without removing the PDF/A-compliance.

     

    The problem seems to be the use of .notdef-glyphs in Acrobat's OCR.

    On some documents the OCR produces .notdef-glyphs (which it probably shouldn't in 2014). Those glyphs aren't allowed in PDF/A-2 and PDF/A-3 anymore.

    There is a function in the Preflight PDF/A-presets which is supposed to replace .notdef-glyphs but doesn't do anything after all. I even created a custom Preflight profile which should only replace .notdef-glyphs but even this doesn't work!

    That's why Preflight reconverts every single PDF via PostScript, thereby losing all the OCR-text.

    PDF/A-1b still works with OCRed scans since PDF/A-1 still allows .notdef-glyphs. But it doesn't allow e.g. jpeg2000!

     

    I guess Acrobat shouldn't use .notdef-glyphs in it's OCR anymore and also should make the Preflight-function which is supposed to replace them working!

     

    I reported those problems with PDF/A conversions in May 2012 but Adobe didn't fix anything.

     

    Is there anybody else still ahving these problems?

     

    Or a new workaround?

     

    Thank you!

     

    Best

     

    Bastian

     
    |
    Mark as:

More Like This

  • Retrieving data ...

Bookmarked By (0)

Answers + Points = Status

  • 10 points awarded for Correct Answers
  • 5 points awarded for Helpful Answers
  • 10,000+ points
  • 1,001-10,000 points
  • 501-1,000 points
  • 5-500 points