We have an application with about 10'000 PDF file attachments.
Many of those were run through OCR.. Stupidly the users weren't instructed well enough prior doing so. It now occurs that most of those OCR texts have bad quality.
Another issue is: our application can do fulltext search on the PDFs, many of the files are PDF portfolios which the fulltext enginen cannot "read" (technically, I have been told by the fulltext search engine programmer, PDF portfolios are NOT PDFs ;-) stupid but I can't change that)
What I now require is help on how to:
identify PDFs with images which have been run through OCR, so that we can rerun OCR through those PDFs
identify PDFs which are atually PDF portfolios, so that we can (maybe automatically, maybe manually) convert them to normal PDFs
I don't expect any prebuilt solution...
we would even pay someone to help us out here. The data within those PDFs is crucial for our whole enterprise.
I am thankful for any pointers and help in this topic
Testing for whether OCR has been performed may be tough. Preflight can report on hidden text objects, but this probably wouldn't be useful to you.
Thanks for your help
I found someone who can create some Java code for me doing exactly what I require