2 Replies Latest reply on Sep 16, 2017 5:43 AM by jt77474

    Split large pdf on repeated text pattern, and save new pdf with custom filename

    jt77474

      I have Acrobat Pro DC

       

      I have a problem in my current organisation which uses a very old fashioned HR system for recruitment. Our HR system compiles one massive report of all the job applications for a recent post: the pdf is 1700+ pages long, containing distinct sections (of variable length) for over 200 applicants.

       

      I want to split this into one pdf per applicant, with the filename of each document being the applicant's name.

       

      For each new application, a consistently formatted divider page exists as follows:

       

      Applicant : Smith, John

       

      Vacancy ID : 15535

       

      The text 'Vacancy ID' only exists on these divider pages, so it can be used to identify where to split the document.

       

      The applicant's name, which occurs on a previous line, starts at character 10 and is variable length. In fact it can be acquired with getPageNthWord(page,3) and getPageNthWord(page,4)

       

      How easy would it be to create some javascript to run in an action which would do the following:

      1. Identify text "Vacancy ID"
      2. Split document at that point, saving the pages from current page (typically 5, though not always) up to page before next instance of "Vacancy ID"
      3. Extract applicant name from previous line
      4. Save individual pdf for each applicant, using applicant name

       

      Can this be done, or has it been done already? Thanks

        • 1. Re: Split large pdf on repeated text pattern, and save new pdf with custom filename
          try67 MVP & Adobe Community Professional

          If the pages are consistent and the text readable (ie, not part of a scanned image), then yes, it can most likely be done.

          I've developed many similar tools for my clients in the past, so if you wish to send me some sample pages (to try6767 at gmail.com) I'll be happy to let you know if I think it's doable or not, and if so, for how much.

          • 2. Re: Split large pdf on repeated text pattern, and save new pdf with custom filename
            jt77474 Level 1

            Thanks. Unfortunately I don't have a budget for this work so I figured it out myself. Here is the solution in case anyone else needs to do something similar. Obviously you will need to tweak the code for your scenario. I ran this in the javascript debugger using instructions (eg select code and press ctrl enter) from this site https://acrobatusers.com/tutorials/javascript_console

             

            In short, this script does the following:

            1. For each page in document, look for the word "Vacancy" at word number 8
            2. If that exists, check the next work (9) is ID. This means we've found the text "Vacancy ID"
            3. Extract first name and surname from fixed positions on the same page
            4. Now continue through the document until we find the next instance of "Vacancy ID"
            5. Make a note of it's page number (p2). This will help to define how to use the extractpages() function
            6. Finally, extract the last item

             

            I'm sure there are lots of better ways of doing it, but this works for me, it took about an hour, and I didn't have to pay anyone (sorry try67). Also, someone else might be able to use this for free in future. Let me know if you have any problems and I'll try to help. I've never used JavaScript before but it doesn't seem to be too hard. Debugging in acrobat however is AWFUL! Good luck.

             

            var firstName = ""
            var surName = ""
            var finalpage = 0
            var count = 0
            
            
            //For each page in document, check whether specific words meet criteria
            for (var p = 0; p < this.numPages; p++) {
            
            
              if (this.getPageNthWord(p, 8) == "Vacancy") {
                if (this.getPageNthWord(p, 9) == "ID") {
            
            
                  count++;
                  firstName = getPageNthWord(p, 3);
                  surName = getPageNthWord(p, 2);
                  finalpage = p;
            
            
                  //Find page position of next break point
                  for (var p2 = p + 1; p2 < this.numPages; p2++) {
                    if (this.getPageNthWord(p2, 8) == "Vacancy") {
                      if (this.getPageNthWord(p2, 9) == "ID") {
                        this.extractPages({
                          nStart: p,
                          nEnd: p2-1,
                          cPath: count + " " + firstName + " " + surName + ".pdf"
                        });
                        console.println("Extracted " + firstName + " " + surName + " pp " + p + " to " + p2)
                        break
                      }
                    }
                  }
            
            
                }
              }
            
            
            }
            
            
            //Save final section after last time run through
            this.extractPages({
              nStart: finalpage,
              nEnd: this.numPages - 1,
              cPath: count + " " + firstName + " " + surName + ".pdf"
            });
            
            
            console.println("Extracted " + firstName + " " + surName + " pp " + finalpage + " to " + (this.numPages - 1))