Skip navigation
Stephen-R-R
Currently Being Moderated

Acrobat creating searchable image PDF

Jan 19, 2013 11:43 AM

Tags: #acrobat #searchable_image

I have a PDF made from images of a set of scanned pages; I also have the text of those pages in a Word document.  I could make a searchable PDF of the images by using Acrobat's OCR and correction tools, but this would be a long and imprecise process (since the text is in Old English and a number of the characters are unusual and not recognized correctly by the Acrobat OCR), and seems unnecessary since I have a good copy of the text in another file.  Is there a way that I can take the image PDF and import the Word document and have the two merged to create a searchable image PDF?

 
Replies
  • Currently Being Moderated
    Jan 19, 2013 12:03 PM   in reply to Stephen-R-R

    It might be easier to just create a new PDF from a reconstructed WORD document.

     
    |
    Mark as:
  • Currently Being Moderated
    Jan 19, 2013 1:54 PM   in reply to Bill@VT

    Yes I agree that it would be much easier to create PDF from Original Document.

     
    |
    Mark as:
  • Currently Being Moderated
    Jan 21, 2013 7:43 AM   in reply to Stephen-R-R

    You can click on any images and save them as separate image files.

    Then you could add them back to the word file.

     
    |
    Mark as:
  • Currently Being Moderated
    Jan 22, 2013 9:40 AM   in reply to Stephen-R-R

    I had great hopes for the hidden text layer Acrobat creates when OCR-ing text, but I never figured out how to get sufficient control over that text.  In particular, I couldn't get corrections to work smoothly -- always a challenge when OCR-ing mixed languages, not to mention hand-written Chinese.  I've heard Abbyy FineReader might allow easier access to the hidden OCR text, but I decided to try making my own text layer when digitizing back issues of a scholarly journal, Early China.  Testers found Reader's menu for swapping layers awkward, so I put a button for this on each page -- actually, two buttons superimposed, each visible in the appropriate layer.  This example (3.5 MB) includes quite a few archaic Chinese characters and other snippets from the scans set in-line with text in the text layer, which is practical only because InDesign CS4's export to PDF is smart enough to re-use a single copy of an image that occurs repeatedly.

     

    In fact, I used Acrobat's OCR to recover as much of the text as it could.  The copy still required extensive fixing up, nor was re-setting it in InDesign trivial.  But the result does meet the goal of giving the reader both searchable text and a convenient way to check the original publication.

     
    |
    Mark as:
  • Currently Being Moderated
    Jan 27, 2013 9:58 AM   in reply to Stephen-R-R

    Over the years there have been quite a few complaints about Acrobat's OCR, many relating to correcting OCR errors (a.k.a. "OCR suspects" in Acrobat terminology).  Some difficulties seem to arise from mis-understandings of how Acrobat's OCR is supposed to work or how the capability has evolved.  Software manuals tend to provide "cookbook recipes" demonstrating common — and invariably successful — situations rather than explain how things work, thus allowing the reader to address unusual situations.  CTDave's 2012 "practical demonstrations" of Acrobat X OCR are a breath of fresh air; he concludes, "The OCR output exists to support search/find — no more, no less."  Thank you, CTDave (a year late), and "be well".  Next time I have to deal with this issue that is where I'll start.

     

    David

     
    |
    Mark as:

More Like This

  • Retrieving data ...

Bookmarked By (0)

Answers + Points = Status

  • 10 points awarded for Correct Answers
  • 5 points awarded for Helpful Answers
  • 10,000+ points
  • 1,001-10,000 points
  • 501-1,000 points
  • 5-500 points