7 Replies Latest reply on Sep 20, 2016 12:10 AM by try67

    Extract certain pages from a document based on key words

    forresth46081687

      Hi everyone,

       

      I am trying to extract pages from a large document based on certain keywords. So if a keyword is found on one specific page, then that page number is pushed to an array, and used to create a new document. However, the issue I am having is with my script, it seems to be very inconsistent and cannot seem to create multiple new documents. Please note - almost all of this script I found online that someone else had made, and I am trying to adapt it to my purposes.

       

      // Iterates over all pages and find a given string and extracts all

      // pages on which that string is found to a new file.

       

       

      var pageArray = [];

      var pageA = [];

       

       

      var stringToSearchFor = "keyword1";

      var stringToSearch = "keyword2";

      for (var p = 0; p < this.numPages; p++) {

        // iterate over all words

        for (var n = 0; n < this.getPageNumWords(p); n++) {

        if (this.getPageNthWord(p, n) == stringToSearchFor) {

        pageArray.push(p);

        break;

        }

              else if (this.getPageNthWord(p,n) == stringToSearch) {

                  pageA.push(p);

                  break;

           }

          }

      }

      console.println("Test 2 of pageArray " + pageArray);

      if (pageArray.length > 0) {

        // extract all pages that contain the string into a new document

        var d = app.newDoc();    // this will add a blank page - we need to remove that once we are done

        for (var n = 0; n < pageArray.length; n++) {

        d.insertPages( {

        nPage: d.numPages-1,

        cPath: this.path,

        nStart: pageArray[n],

        nEnd: pageArray[n],

        } );

             console.println(n + " pageArray " + pageArray) }

       

       

          // remove the first page

          d.deletePages(0);

         

      }

       

       

       

       

      if (pageA.length > 0) {

        // extract all pages that contain the string into a new document

        var q = app.newDoc();    // this will add a blank page - we need to remove that once we are done

        for (var n = 0; n < pageA.length; n++) {

        q.insertPages( {

        nPage: q.numPages-1,

        cPath: this.path,

        nStart: pageA[n],

        nEnd: pageA[n],

        } );

              console.println(n + " pageA " + pageA)

      }

       

       

      console.println(pageA)

          // remove the first page

         

      }

       

       

      Thanks!

       

      -Forrest

        • 1. Re: Extract certain pages from a document based on key words
          try67 MVP & Adobe Community Professional

          Is the issue that some pages that contain both words only appear in one of the final files?

           

          By the way, you're missing the command to delete the first page of the second file, after generating it.

          • 2. Re: Extract certain pages from a document based on key words
            forresth46081687 Level 1

            Thanks for the quick response - Unfortunately no. I am using this script as part of a way to sort invoices, so the keyword I am searching for is the vendor's name - so two vendor's names will not appear on the same page.

             

            And thanks for pointing that out - I had done that as a trouble shooting mechanism. Oddly enough the script seems to work for certain words but not others, even though I can find both words by searching (cmd + f) the document. Very confusing.

            • 3. Re: Extract certain pages from a document based on key words
              forresth46081687 Level 1

              I should also point out that I put the console.println() to check that the arrays have values, which both of them do. So I think the issue may have something to do with the newDoc creation?

              • 4. Re: Extract certain pages from a document based on key words
                try67 MVP & Adobe Community Professional

                You seem to be describing different kinds of issues. One is with the detection of the keywords, another with the extraction of the pages to the new file (if I understood correctly). These are unrelated issues. You should focus on each one of them separately and try to solve it.

                Start by disabling the extraction process. Print to the console the list of pages for each search term. If they are not correct, investigate further. If a page that is supposed to appear in the list doesn't, go back to that page and print out all the words in it, and try to find out what the issue is.

                This is how you debug code: You focus on a specific issue and eliminate causes until you find the cause of the problem, and then look for a solution for it. Then you move on to the next issue.

                • 5. Re: Extract certain pages from a document based on key words
                  try67 MVP & Adobe Community Professional

                  I'm seeing a potential bug in your code that might cause all kinds of strange behaviors and that will be very difficult to spot if you don't know to look for it.

                  You should not use the "this" keyword after you create a new document, as it will probably point to that document instead of to the original one. Instead you should keep a separate reference to the original file, something like this as the first line of your code:

                   

                  var originalDoc = this;

                   

                  Then replace all instances of "this" in your code with "originalDoc".

                  • 6. Re: Extract certain pages from a document based on key words
                    forresth46081687 Level 1

                    Thanks again for the suggestion try67! Unfortunately I am still not getting the script to work - sometimes it will create newDoc for one of the words, but never for both and it does not seem to create either consistently.

                     

                    // Iterates over all pages and find a given string and extracts all

                    // pages on which that string is found to a new file.

                     

                     

                    var pageArray = [];

                    var pageA = [];

                    var originalDoc = this;

                    var stringToSearchFor = "keyword1";

                    var stringToSearch = "keyword2";

                    for (var p = 0; p < originalDoc.numPages; p++) {

                      // iterate over all words

                      for (var n = 0; n < originalDoc.getPageNumWords(p); n++) {

                      if (originalDoc.getPageNthWord(p, n) == stringToSearchFor) {

                      pageArray.push(p);

                      break;

                      }

                            else if (originalDoc.getPageNthWord(p,n) == stringToSearch) {

                                pageA.push(p);

                                break;

                         }

                        }

                    }

                    console.println("Test 2 of pageArray " + pageArray);

                    console.println("Test 1 of pageA " + pageA);

                    if (pageArray.length > 0) {

                      // extract all pages that contain the string into a new document

                      var d = app.newDoc();    // this will add a blank page - we need to remove that once we are done

                      for (var n = 0; n < pageArray.length; n++) {

                      d.insertPages( {

                      nPage: d.numPages-1,

                      nStart: pageArray[n],

                      cPath: originalDoc.path,

                      nEnd: pageArray[n],

                      } );

                           console.println(n + " pageArray " + pageArray) }

                     

                     

                        // remove the first page

                        d.deletePages(0);

                       

                    }

                     

                     

                     

                     

                    if (pageA.length > 0) {

                      // extract all pages that contain the string into a new document

                      var q = app.newDoc();    // this will add a blank page - we need to remove that once we are done

                      for (var n = 0; n < pageA.length; n++) {

                      q.insertPages( {

                      nPage: q.numPages-1,

                      nStart: pageA[n],

                      cPath: originalDoc.path,

                      nEnd: pageA[n],

                      } );

                           

                    }

                     

                    console.println(pageA)

                      

                       

                    }

                    • 7. Re: Extract certain pages from a document based on key words
                      try67 MVP & Adobe Community Professional

                      To help you further I'll need to see the actual file.

                       

                      On Sep 20, 2016 1:11 AM, "forresth46081687" <forums_noreply@adobe.com>