I have a document management system that full text indexes all of our documents, but if a PDF is simply a picture then the full text indexing becomes useless unless I run an 'OCR' on the PDF...in essence adding text.
In order to make PDFS fulltext searchable I run an OCR process on an entire folder. I'd rather not run that process on PDFs that ALREADY have embedded text. Is there a way to identify whether or a PDF has embedded text without opening it.
You can use a Preflight to check the PDFs via an Action (Acrobat X) or Batch Sequence (pre-Acrobat X).
The Preflight would use a Custom check.
You could use "Can be mapped to Unicode"
To be searchable the PDF pages' glyphs must map to Unicode.
Similarly you can create a Preflight, Custom check to evaluate for "Invisible text objects".
"Invisible text objects" are text objects using text rendering mode 3 (invisible text).
Text rendering mode 3 (no glyph/font fill or stroke) is used for the output of OCR (Searchable Image and Searchable Image (Exact)).
Acrobat Pro has two out-of-the-box Preflights that may also be of interest.
One to use a fix-up to embed fonts another to embed fonts (including text rendering mode 3).