2 Replies Latest reply on Feb 5, 2014 3:58 AM by JADarnell

    InDesign CC Javascript

    JADarnell Level 1

      Our company is importing a lot of PDFs into InDesign documents for further processing down the line.  I need to be able to scan the text that was imported with the PDF to identify locations for documentation and other possible requirements down the road.


      I looked at the number of text frames one InDesign document had that was the container for a PDF with lots of text (i.e. the PDF was PLACEd in the InDesign document) and Javascript said that the InDesign document had no text frames.


      Is there a way to access the text in an InDesign document that was imported as part of a PDF document?  If all you have is an example, that would be sterling.






        • 1. Re: InDesign CC Javascript
          [Jongware] Most Valuable Participant

          JA, some bad news and then some fairly good news (but, Spoiler Alert, ultimately not that good), and then some more bad news again.


          A PDF in InDesign is just another image, and, as for other images, you cannot 'read' text out of it. A PDF is not imported as a set of graphics, text frames, and bitmap images, but as a whole -- exactly as an Illustrator .AI file is imported in its entirety only so you cannot edit text "in" it. That not as much put limits on what you can do inside InDesign itself, but is in fact a show-stopper right here and now.


          The not-that-good 'good' news is that JavaScript is a full programming language. Ultimately, PDFs are only files; JavaScript can read files, and so JavaScript can read PDFs. That's good, right? Alas: not ... entirely. All text inside a PDF may be (and, mind, normally is) heavily compressed, and its data encoding may have been changed to allow for font subsetting. The specifications of how to correctly read a PDF is freely available, thanks to Adobe, but it needs a complicated program. All of this comes down to 'it's "doable" in JavaScript, but only for a very generous interpretation of "doable". (Also, it would not be fast by any interpretation. Even the word "slow", as in "Okay I could live with 'slow'", does not describe sufficiently how slow it would be.)


          Is that it, then? No -- there are external libraries and programs to extract text from a PDF. It might be possible to build a toolchain, independent of InDesign, that examines series of PDFs and write out data of interest into a simple ASCII file. At that point, a relatively simple JavaScript could read that file in turn and act upon its contents inside InDesign.


          Back to Bad: due to all possible compress/subset optimizations in the plain text inside a PDF, there is no guarantee that any tool can actually read it -- up to and including Adobe Acrobat itself! If you work with PDFs on a regular basis, I'm sure you've encountered the odd one from which you simply cannot copy text out of it, and all you get is random garbage.


          By far the easiest way to tackle this particular problem would be to entirely avoid post-processing PDFs. After all, they came from somewhere else -- it may well be easier to use the original data files for your purpose.

          • 2. Re: InDesign CC Javascript
            JADarnell Level 1

            Thanks for the bad news (grin).  At least it will save me a lot of time and perhaps a few ulcers as well.


            Now if I could only write some Javascript that knows how to shovel snow...