    identifying PDF portfolios and OCRed PDFs




      We have an application with about 10'000 PDF file attachments.


      Many of those were run through OCR.. Stupidly the users weren't instructed well enough prior doing so. It now occurs that most of those OCR texts have bad quality.


      Another issue is: our application can do fulltext search on the PDFs, many of the files are PDF portfolios which the fulltext enginen cannot "read" (technically, I have been told by the fulltext search engine programmer, PDF portfolios are NOT PDFs ;-) stupid but I can't change that)


      What I now require is help on how to:


      identify PDFs with images which have been run through OCR, so that we can rerun OCR through those PDFs


      identify PDFs which are atually PDF portfolios, so that we can (maybe automatically, maybe manually) convert them to normal PDFs


      I don't expect any prebuilt solution...


      we would even pay someone to help us out here. The data within those PDFs is crucial for our whole enterprise.


      I tried already some of the javascript apis... but no luck... maybe there are other tools which can help us here?


      I am thankful for any pointers and help in this topic