I have a PDF made from images of a set of scanned pages; I also have the text of those pages in a Word document. I could make a searchable PDF of the images by using Acrobat's OCR and correction tools, but this would be a long and imprecise process (since the text is in Old English and a number of the characters are unusual and not recognized correctly by the Acrobat OCR), and seems unnecessary since I have a good copy of the text in another file. Is there a way that I can take the image PDF and import the Word document and have the two merged to create a searchable image PDF?
Thanks to you both. Creating a PDF from the Word file would, indeed, be easy enough. There are, though, some non-textual elements in the original page images that I had hoped to keep, which is why I was looking to make a searchable image PDF. But if there is no way to "merge" my files, then creating a PDF from the Word file is what I will do.
I had great hopes for the hidden text layer Acrobat creates when OCR-ing text, but I never figured out how to get sufficient control over that text. In particular, I couldn't get corrections to work smoothly -- always a challenge when OCR-ing mixed languages, not to mention hand-written Chinese. I've heard Abbyy FineReader might allow easier access to the hidden OCR text, but I decided to try making my own text layer when digitizing back issues of a scholarly journal, Early China. Testers found Reader's menu for swapping layers awkward, so I put a button for this on each page -- actually, two buttons superimposed, each visible in the appropriate layer. This example (3.5 MB) includes quite a few archaic Chinese characters and other snippets from the scans set in-line with text in the text layer, which is practical only because InDesign CS4's export to PDF is smart enough to re-use a single copy of an image that occurs repeatedly.
In fact, I used Acrobat's OCR to recover as much of the text as it could. The copy still required extensive fixing up, nor was re-setting it in InDesign trivial. But the result does meet the goal of giving the reader both searchable text and a convenient way to check the original publication.
Thanks, David. It sounds as though you do, indeed, have a parallel--and much more complex--situation. Your experience seems to confirm that there is no easy way to do this, but also that I am not the only one who wishes that there were some easy and direct way to correct/replace the text layer.
Over the years there have been quite a few complaints about Acrobat's OCR, many relating to correcting OCR errors (a.k.a. "OCR suspects" in Acrobat terminology). Some difficulties seem to arise from mis-understandings of how Acrobat's OCR is supposed to work or how the capability has evolved. Software manuals tend to provide "cookbook recipes" demonstrating common — and invariably successful — situations rather than explain how things work, thus allowing the reader to address unusual situations. CTDave's 2012 "practical demonstrations" of Acrobat X OCR are a breath of fresh air; he concludes, "The OCR output exists to support search/find — no more, no less." Thank you, CTDave (a year late), and "be well". Next time I have to deal with this issue that is where I'll start.
David
North America
Europe, Middle East and Africa
Asia Pacific