1 Reply Latest reply on May 15, 2013 9:00 AM by JADarnell

    (Acrobat XI) Using Javascript to scan PDF text

    JADarnell Level 1

      I am scanning text in a PDF and reading it into memory using doc::getPageNthWord().  Reading the text from the file using the mark 1 mod 0 eyeball looks like this:

       

      References

      Greenpeace. 2012. Safeway charts new course for

        sustainable tuna. www.greenpeace.org/usa/en/media-

        center/news-releases/Travis-Nichols.

      Monterey Bay Aquarium. 2012a. Wild seafood issue:

         Overfishing, www.montereybayaquarium.org/cr/cr_

         seafoodwatch/issues/wildseafood_overfishing.aspx.

       

      (Please note that I am unable to determine how the PDF is causing the indent.  I am using three spaces, but a hex editor...well read on, please).

       

      Using the above function (i.e getPageNthWord()) the information comes in like   this:

       

      Wild

      seafood

      issue:

      • Ł Ł • • Ł Ł • • Ł Ł • • Ł Ł •

      www.

       

      Copying it to a text editor using the clipboard, it looks like this:

       

      "Wild seafood issue:"

      ", OOHJDO              AVKLQJ.

       

      Can anyone suggest a possible solution to getting the text in a readable format?

       

      TIA!

      John

        • 1. Re: (Acrobat XI) Using Javascript to scan PDF text
          JADarnell Level 1

          I have some additional information.  When I copy the text to a programming editor I see the latter transformation.  When I compare the text, letter for letter, each letter in the cryptic version  has been reduced in value by 29 with a few exceptions; periods and forward slants do not seem to have been affected.  Additionally, the letters f and i, when placed together, are interpreted as a single letter with an ascii version of 195. 

           

          I have looked at possible XOR/OR bit fiddling conversions, but have not been able to figure out what is causing the transformation (or causing the transformation to fail).