2 Replies Latest reply on Apr 6, 2013 7:36 AM by Maruthi80

    How can I find the words which spans across end of line to next line in pdf ?

    Maruthi80

      I am using Acrobat Adobe X Pro version for our form development and maintanence. I am writting a Acrobat JAVA batch script which reads through all the words and execute spell check and reports the mispelled words in a excel sheet. Since I am running this script in batch mode for more than 1000 pdfs - I am getting many words joined together. When I looked in to those pdfs all such words are looking okay because it is appearing in end of right margin and the next word is in the next line. Since there was no space between them it was extracted as a single word. Hence the failure.

       

      I used wordf = this.getPageNthWordQuads(i,j)  to get the word begin and end coordinates. when I closely observe the values are creating a rectangle and that doesnt span across lines. I got the coordinates for the regular word and the word which span acoross two lines. both of the coordinates are same.

       

      I think I am screwed - I have 8000 such words and no clue of how to get rid of them from the actual misspelled words.

       

      please help. let me know if any class /method if I call will give me the end of line or do I need to go to next layer to find this split.

       

      the addnot is somehow marking the words using this coordinates - please hellp me understand how this works. Thanks.

       

       

      // for all pages

      for (var i = 0; i < this.numPages; i++ )

      {

      // For all the words

      pg += 1;

      numWords = this.getPageNumWords(i);

      for ( j = 0; j < numWords; j++)

      {

      //get the spell check 

      ckWord = spell.checkWord(this.getPageNthWord(i,j))

       

      if ( ckWord != null )

      {

      jn=0

      ml=0

      // if mispelled word found.

       

      wordf = this.getPageNthWordQuads(i,j)

      swordf = wordf.toString()

       

      var st = swordf.split(",")

       

      var diffx0 = parseInt(st[0])-8

      var diffx1 = parseInt(st[1])-8

      var diffx2 = parseInt(st[2])-8

      var diffx3 = parseInt(st[3])-8

      var diffx4 = parseInt(st[4])-8

      var diffx5 = parseInt(st[5])-8

      var diffx6 = parseInt(st[6])-8

      var diffx7 = parseInt(st[7])-8

       

      if (cWord == csword)

      {

      jn = 1

      }

      if ( st[1] != st[3] )

      {

      ml = 1

      }

      //dataLine += "\r\n write "

      }

      else

      {

      ml=2

      }

      dataLine += "\r\n"+this.documentFileName

      + "\t" + this.getPageNthWord(i,j)

      + "\t" + pg

      + "\t" + j

      + "\t" + ml

      + "\t" + jn

      + "\t st[0] " + diffx0 + "\t st[1] " + diffx1 + "\t st[2] " + diffx2 + "\t st[3] " + diffx3 

      + "\t st[4] " + diffx4 + "\t st[5] " + diffx5 + "\t st[6] " + diffx6 + "\t st[7] " + diffx7 

      ck=1

      }

      }

      }

        • 1. Re: How can I find the words which spans across end of line to next line in pdf ?
          try67 MVP & Adobe Community Professional

          If Acrobat is reading each word part as separate words, you have a problem.

          The way I approached it in some of my tools was to check if a word ends

          with a hyphen, and if so, to check if it's the last on the line. If both

          conditions are true, combine with the next word on the next line. This is

          not fool proof, of course, as there are documents with columns are other

          structural elements that prevent this from working. Better than nothing,

          though...

          However, it is also possible that Acrobat does see both parts as parts of

          the same word. In that case, getPageNthWordQuads() will return multiple

          quads arrays. As you know, that method returns an array of quad arrays.

          There's usually only one, but in principle there could be more... Something

          to check before giving up.

          • 2. Re: How can I find the words which spans across end of line to next line in pdf ?
            Maruthi80 Level 1

            You are awesome - I now undestand the reason that the quads are kept in array. It can identfy the text even it span across more than one line - it is identifying each line as a rectangle and eight x,y are mentioned per item. so the annot can take item by item and marks as a single comment.

             

            Thank you very much - you saved me from going in to forward parsing and reverse parsing the combined words and identifying the meaningful words.

            I will try the code and update my results here.