0 Replies Latest reply on Apr 29, 2014 7:11 AM by JADarnell

    Parsing an Acrobat file

    JADarnell Level 1

      Hello fellow Javascripters:

       

         I have a document that for all intents and purposes looks like an index.  The format follows a layout similar to this:

        

         abc...............1,2,3            //  becomes records "abc 1", "abc 2", "abc 3"

         def........21, 22, 23         //  becomes records "def 21", "def 22", "def 23"

         ghi

            jkl.........101, 102            //   becomes records "ghi jkl 101", "ghi jkl 102"

            mno..........39, 61          //   becomes records "ghi mno 39", "ghi mno 61"

            pqr.........129,190          //   becomes records "ghi pqr 129", "ghi pqr 190"

         stu..123,145,167    //   becomes records "stu 123", "stu 145", "stu 167", "stu 181", "stu 182"

              .........

      Hello fellow Javascripters:

       

         I have a document that for all intents and purposes looks like an index.  The format follows a layout similar to this:

        

         abc.............1,2,3   //  becomes records "abc 1", "abc 2", "abc 3"

         def......21, 22, 23   //  becomes records "def 21", "def 22", "def 23"

         ghi

            jkl.......101, 102   //   becomes records "ghi jkl 101", "ghi jkl 102"

            mno........39, 61   //   becomes records "ghi mno 39", " ghi mno 61"

            pqr.......190,129   //   becomes records "ghi pqr 190", "ghi pqr 129"

         stu...123,145,167,   //   becomes records "stu 123", "stu 145", "stu 167", "stu 181", "stu 182"

              .........181, 182

             

        

        

         I am required to write a script that will parse this document.  Further I am required to recognize that "ghi" is a supercategory of

         "jkl" and "mno" and "pqr," but I am somehow supposed to recognize that "stu" is not a member of this supercategory.  On top of that I am supposed to

         recognize that the line following "stu" is a spillover line in the page listing of "stu."

        

         There are certain tricks that I can use to determine some of this information, for example, a line without multiple dots is likely a supercategory. 

         A line that starts with dots is likely a spillover line.  Looking for a LF (charcode of 10) tells me that I have pulled in a full line when I find it.  But I have no clue

         as to how I am going to determine when the listing of subcategories belonging to the supercategory ends (length of line is no help).  If I could detect

         indentation that would help a lot, but as far as I can tell the doc function that reports words in an Acrobat document (this.getPageNthWord(0,n,false)

         where 0 is the absolute page number, n is the number of the word in the page, and false indicates that the function should return punctuation and

         white space with the word) does not report white space at the beginning of a lines.

        

         Is there a function that will help me determine format or layout in an Acrobat document?

        

         TIA!

        

         John