11 Replies Latest reply on Jul 4, 2016 10:43 PM by Jo_2013

    Find words (wild cards) using regular expression

    Jo_2013

      I am testing to see if words are present for revision 1 of a drawing title block.

       

      The script searches for the number 1 followed by a date, a title and 4 sets of initials.

       

      The number 1 is static, (the date,  title and initials are all wild cards as they are different for each drawing).

       

      I am using regular expressions to match the words.

       

      The regular expression highlighted in blue will find the number 1 and date.

       

      The rest highlighted in orange is not matching the title and initials.

       

      If anyone can help with the regular expression that will be most appreciated.

       

      Once I have got this to work will be adding form fields to the initials, just reporting to the console at this stage for testing.

       

      numWords = this.getPageNumWords(0);

      // number of words on page

      // loop through the words on page

      for (var j = 0; j < numWords-1; j++)

      { // get word pair to test

      ckWords = this.getPageNthWord(0, j) + ' ' + this.getPageNthWord(0, j + 1); // test words

      // check to see if word string 1 26.05.16 THE REINFORCEMENT REVISED MM SB AE GM string is present

       

      if (ckWords.match(/^1\s[0-9]{1,2}.[0-9]{1,2}.[0-9]{2}\s\w+(\s+\w+){1,7}/))

      {

      console.println(ckWords);

      }

      }

        • 1. Re: Find words (wild cards) using regular expression
          Test Screen Name Most Valuable Participant

          Looks as if what you are trying to match isn't necessarily exactly two words. If so you need much more challenging logic, especially if the title is multiple words; the reg.ex is not the problem.

          • 2. Re: Find words (wild cards) using regular expression
            gkaiseril MVP & Adobe Community Professional

            You need to see how the words you are looking for are handled by Acrobat JavaScript. It is not the reading order but the order in wich the word or word character strings have been placed or plotted onto the page. This order of the words will be determined by the authoring program and any edits after the PDF is created. Usually the manual edits to the PDF after the conversion are the last words displayed. Unless you know the number of words in the title, you might not be able to find the text reliably. There are symbols for white space, so it is possible to span several words with the RegExp but how Acrobat JavaScirpt splits the words may be thi biggest problem.

            • 3. Re: Find words (wild cards) using regular expression
              Jo_2013 Level 1

              Thank you for your response. I have copied and pasted the text in an advanced search on the pdf and it looks like the words are split with spaces. Can you please modify the code with the RegExp so it will match multiple whole words? Your help will be most appreciated.

              • 4. Re: Find words (wild cards) using regular expression
                Test Screen Name Most Valuable Participant

                That's an interesting programming challenge for YOU to write. Since you need to match an arbitrarily long sequence of words with backtracking. Personally I'd use a state table, but you can use other programming techniques.

                • 5. Re: Find words (wild cards) using regular expression
                  Jo_2013 Level 1

                  I need to get the ckWords (to test for between 5 and 12 words )

                  The script below tests for two words, can you help how to test for multiple words between 5 words to 12 words?

                  ckWords = this.getPageNthWord(0, j) + ' ' + this.getPageNthWord(0, j + 1);

                  Your help will be greatly appreciated thank you

                  • 6. Re: Find words (wild cards) using regular expression
                    Test Screen Name Most Valuable Participant

                    The simplest approach is to write an inner loop which loops from 5 to 12 and for each of these has still another loop which loops to fetch and join words. That sounds as if it could work but it would be very slow indeed. Acceptable performance would require a much more complex solution with backtracking and perhaps a state machine. There has been a lot of published research on efficient search algorithms, showing it isn't just a question of following a simple template.

                    • 7. Re: Find words (wild cards) using regular expression
                      gkaiseril MVP & Adobe Community Professional

                      Look at the getPageNthWord. There is an optional pbetarameter "bStrip" which defaults to true if omitted. This parameter causes the white space or punctuation to be removed or returned. Since you are concatenating multiple words, you might want to keep this additional character and not assume it is a space.

                       

                      Again I would do some testing and see what this method considers a word and the order of how the words are pulled from the page. You may need to do some editing of the words on the page to see if the words are returned in reading order or not. Many programs will convert the text in reading order, but if one inserts new words between lines or words, those added words will not be returned in reading order.

                      • 8. Re: Find words (wild cards) using regular expression
                        gkaiseril MVP & Adobe Community Professional

                        To find the words you are looking for, you need to have a string made up from 9 words on the page not two.

                         

                        Try something like this:

                         

                        numWords = this.getPageNumWords(0);

                        // number of words on page

                        // loop through the words on page

                        for (var j = 0; j < (numWords - 1) - 9; j++)

                        { // get word pair to

                        ckWords = this.getPageNthWord(0, j, false) + this.getPageNthWord(0, j + 1, false) + this.getPageNthWord(0, j + 2, false) +

                        this.getPageNthWord(0, j + 4, false) + this.getPageNthWord(0, j + 5, false) + this.getPageNthWord(0, j + 6, false) +

                        this.getPageNthWord(0, j + 7, false) + this.getPageNthWord(0, j + 8, false); // test words

                        // check to see if word string 1 26.05.16 THE REINFORCEMENT REVISED MM SB AE GM string is present

                         

                        if (ckWords.match(/^1\s[0-9]{1,2}.[0-9]{1,2}.[0-9]{2}\s\w+(\s+\w+){1,7}/))

                        {

                        console.println(ckWords);

                        }

                        }

                        1 person found this helpful
                        • 9. Re: Find words (wild cards) using regular expression
                          Jo_2013 Level 1

                          Thank you for your help.

                          I have used the following script which will report back to the console if there are a total of 9 words,

                          1 26.05.16 THE REINFORCEMENT REVISED MM SB AE GM

                          if there is a total of 8 words the script will not report back to the console

                          1 26.05.16 APPROVED DRAWING MM SB AE GM

                          I have added a word count into the script but am unsure of how to get this to work in combination with the getPageNthWord.

                          Your advise will be most appreciated, thank you.

                          var ckWords; // word pair to test

                          var count; // count number of words

                          numWords = this.getPageNumWords(0); // number of words on page

                          // loop through the words on page

                          for (var j = 0; j < numWords-1; j++) {

                          // get word pair to test

                          ckWords = this.getPageNthWord(0, j ) + ' ' + this.getPageNthWord(0, j + 1) + ' ' + this.getPageNthWord(0, j + 2) + ' ' + this.getPageNthWord(0, j + 3) + ' ' + this.getPageNthWord(0, j + 4) + ' ' + this.getPageNthWord(0, j + 5)  + ' ' + this.getPageNthWord(0, j + 6) + ' ' + this.getPageNthWord(0, j + 7)  + ' ' + this.getPageNthWord(0, j + 8);

                          if (ckWords.match(/^1\s\d{1,2}\.\d{1,2}\.\d{2}\s\w+(?:\s+\w+){1,8}\s([A-Z]{2})\s([A-Z]{2})\s ([A-Z]{2})\s([A-Z]{2})$/))

                          {

                          var count = ckWords.split(/\s+/).length;

                          console.println(ckWords + " " + count);

                          }

                          }

                          • 10. Re: Find words (wild cards) using regular expression
                            Bernd Alheit Adobe Community Professional & MVP

                            You can use something like this:

                             

                            ckWords = this.getPageNthWord(0, j ) + ' ' + this.getPageNthWord(0, j + 1) + ' ' + this.getPageNthWord(0, j + 2) + ' ' + this.getPageNthWord(0, j + 3) + ' ' + this.getPageNthWord(0, j + 4) + ' ' + this.getPageNthWord(0, j + 5)  + ' ' + this.getPageNthWord(0, j + 6);

                             

                            if (!ckWords.match(/^1\s\d{1,2}\.\d{1,2}\.\d{2}\s\w+(?:\s+\w+){1,8}\s([A-Z]{2})\s([A-Z]{2})\ s ([A-Z]{2})\s([A-Z]{2})$/)) ckWords + ckWords + ' ' + this.getPageNthWord(0, j + 7);

                             

                            if (!ckWords.match(/^1\s\d{1,2}\.\d{1,2}\.\d{2}\s\w+(?:\s+\w+){1,8}\s([A-Z]{2})\s([A-Z]{2})\ s ([A-Z]{2})\s([A-Z]{2})$/)) ckWords + ckWords + ' ' + this.getPageNthWord(0, j + 8);

                             

                            if (ckWords.match(/^1\s\d{1,2}\.\d{1,2}\.\d{2}\s\w+(?:\s+\w+){1,8}\s([A-Z]{2})\s([A-Z]{2})\s ([A-Z]{2})\s([A-Z]{2})$/))

                            {

                            ...

                            • 11. Re: Find words (wild cards) using regular expression
                              Jo_2013 Level 1

                              Thank you very much for your assistance, your help is very much appreciated .

                              I was able to match either 9 words together or 8 words together successfully as follows:

                              var ckWords8; // 8 words to test 0 based count

                              var ckWords9; // 9 words to test 0 based count

                              var count; // count number of words

                              numWords = this.getPageNumWords(0); // number of words on page

                              // loop through the words on page

                              for (var j = 0; j < numWords-1; j++)

                              { // get 8 words to test     

                              ckWords8 = this.getPageNthWord(0, j) + ' ' + this.getPageNthWord(0, j + 1) + ' ' + this.getPageNthWord(0, j + 2) + ' ' + this.getPageNthWord(0, j + 3) + ' ' + this.getPageNthWord(0, j + 4) + ' ' + this.getPageNthWord(0, j + 5)  + ' ' + this.getPageNthWord(0, j + 6) + ' ' + this.getPageNthWord(0, j + 7); // test words   

                               

                              ckWords9 = this.getPageNthWord(0, j) + ' ' + this.getPageNthWord(0, j + 1) + ' ' + this.getPageNthWord(0, j + 2) + ' ' + this.getPageNthWord(0, j + 3) + ' ' + this.getPageNthWord(0, j + 4) + ' ' + this.getPageNthWord(0, j + 5)  + ' ' + this.getPageNthWord(0, j + 6) + ' ' + this.getPageNthWord(0, j + 7) + ' ' + this.getPageNthWord(0, j + 8); // test words   

                               

                              if (ckWords8.match(/^1\s\d{1,2}\.\d{1,2}\.\d{2}\s\w+(?:\s+\w+){1,8}\s([A-Z]{2})\s([A-Z]{2})\ s([A-Z]{2})\s([A-Z]{2})$/))

                              {

                              var count = ckWords8.split(/\s+/).length;

                              console.println(ckWords8 + " " + count);

                              break;

                              }

                              else if (ckWords9.match(/^1\s\d{1,2}\.\d{1,2}\.\d{2}\s\w+(?:\s+\w+){1,8}\s([A-Z]{2})\s([A-Z]{2})\ s([A-Z]{2})\s([A-Z]{2})$/))

                              {

                              var count = ckWords9.split(/\s+/).length;

                              console.println(ckWords9 + " " + count);

                              break;

                              }

                              }