1 Reply Latest reply on May 15, 2013 9:00 AM by JADarnell

    (Acrobat XI) Using Javascript to scan PDF text

    JADarnell Level 1

      I am scanning text in a PDF and reading it into memory using doc::getPageNthWord().  Reading the text from the file using the mark 1 mod 0 eyeball looks like this:



      Greenpeace. 2012. Safeway charts new course for

        sustainable tuna. www.greenpeace.org/usa/en/media-


      Monterey Bay Aquarium. 2012a. Wild seafood issue:

         Overfishing, www.montereybayaquarium.org/cr/cr_



      (Please note that I am unable to determine how the PDF is causing the indent.  I am using three spaces, but a hex editor...well read on, please).


      Using the above function (i.e getPageNthWord()) the information comes in like   this:





      • Ł Ł • • Ł Ł • • Ł Ł • • Ł Ł •



      Copying it to a text editor using the clipboard, it looks like this:


      "Wild seafood issue:"

      ", OOHJDO              AVKLQJ.


      Can anyone suggest a possible solution to getting the text in a readable format?




        • 1. Re: (Acrobat XI) Using Javascript to scan PDF text
          JADarnell Level 1

          I have some additional information.  When I copy the text to a programming editor I see the latter transformation.  When I compare the text, letter for letter, each letter in the cryptic version  has been reduced in value by 29 with a few exceptions; periods and forward slants do not seem to have been affected.  Additionally, the letters f and i, when placed together, are interpreted as a single letter with an ascii version of 195. 


          I have looked at possible XOR/OR bit fiddling conversions, but have not been able to figure out what is causing the transformation (or causing the transformation to fail).