36 Replies Latest reply on Aug 2, 2017 9:52 AM by try67

    Grabbing text data from a pdf to use in javascript

    iu-user

      I need to be able to grab the invoice number from pdfs and add to filename.  Customer always sends their invoices in the same format.  Is there a way to get the text from the pdf and add it to the filename while resaving the document?

       

      I am using DC professional

        • 1. Re: Grabbing text data from a pdf to use in javascript
          try67 MVP & Adobe Community Professional

          Assuming this is "real" text and not an image of text then yes, it might be possible.

          However, it requires a way of identifying the invoice number, for example based on its format, location on the page or context, or a combination of these methods. Each one will require a different kind of script, though, and of course it will only work if the files are fairly consistent with each other.

          • 2. Re: Grabbing text data from a pdf to use in javascript
            iu-user Level 1

            I have the x, y position of the text on the page.  It is real text that can be highlighted and the pdfs from this vendor are very consistent in their format.  I would like to grab the text (actually a number) and add it to the beginning of the filename.

            • 3. Re: Grabbing text data from a pdf to use in javascript
              try67 MVP & Adobe Community Professional

              OK, in that case it should be possible, but it's a tricky task. You will need to create a loop that iterates over all the words in the page (or the entire file, if it's not always on a specific page), get their location on the page (using the getPageNthWordQuads method), and then compare it to the area where you expect the target text to be located. Definitely not a simple task if you don't have experience with Acrobat JS...

               

              I've developed many similar tools in the past so if you're interested in hiring someone to do it for you, for a small fee, feel free to contact me privately at try6767 at gmail.com.

              • 4. Re: Grabbing text data from a pdf to use in javascript
                iu-user Level 1

                So, you can't just point to the x-y position of the text even if its page and  position does not change from document to document?

                • 5. Re: Grabbing text data from a pdf to use in javascript
                  Karl Heinz Kremer Adobe Community Professional

                  If you know exactly where the text is, you can crop the page down to just that portion, and then iterate over all words in that area using Doc.getPageNthWord() (Acrobat DC SDK Documentation) you should be able to extract just the text you are interested in. If you look through the archives, and search for getPageNthWord, you should find a number of examples.

                  • 6. Re: Grabbing text data from a pdf to use in javascript
                    Karl Heinz Kremer Adobe Community Professional

                    Actually, I just realized that most of these examples are over at the old AcrobatUsers.com site. Take a look here: Reverse Crop With Javascript (JavaScript)

                    • 7. Re: Grabbing text data from a pdf to use in javascript
                      try67 MVP & Adobe Community Professional

                      You can, but it's not a trivial task. There's no command that says "give me the text in location x,y on page z"...

                      • 8. Re: Grabbing text data from a pdf to use in javascript
                        iu-user Level 1

                        So, I ran this script from an example - thanks.

                         

                        var PageText = "";

                         

                        for (var j = 0; j < 30;j++) {
                                      var word = this.getPageNthWord(1,j,false);
                                      PageText += word;
                        }

                         

                        app.alert(PageText);

                         

                        I found the text I need to be the 13th word on the page.  I can now just use the getPageNthWord function and assign a variable then insert the variable in a filename function to put the invoice number into the filename.

                         

                        Thank you I think I can muddle on now.

                         

                        I don't see a need for cropping or iterating over the whole document.  Am I wrong in this?

                        • 9. Re: Grabbing text data from a pdf to use in javascript
                          try67 MVP & Adobe Community Professional

                          Are you sure the number will always be the 13th word on each page of each

                          file? If so then you can do it like that...

                          • 10. Re: Grabbing text data from a pdf to use in javascript
                            Test Screen Name Most Valuable Participant

                            The 13th word rule might work for you, but it seems risky to me. Are you quite sure that every word there today will always be there? That there will never be another word? And that you might not get extra words (for example an extra space)?

                             

                            The "canonical" way to solve this is to use getPageNthWord and getPageNthWordQuads. The Quads give the location of a quadrilateral containing the word. You can't use the size exactly, nor the X,Y directly, but you could use some fuzzy logic to see if this information seems to be from about the right part of the page.

                            • 11. Re: Grabbing text data from a pdf to use in javascript
                              iu-user Level 1

                              A small sampling shows these documents to be fairly consistent and software generated.  Possibly a form that has been flattened or some other structured document. 

                               

                              I will go with this - and move on to tackling the problem of making this rename batches of 20 - 100 files at a time.  If the documents prove to be inconsistent, I will need to muddle through the more formal way - right now, down and dirty seems to be working and fits my time schedule.  I'm sorry if this proves to be an anathema those wholly vested in the process.  Thank you all for your help.  I may be back with batch renaming issues.

                              • 12. Re: Grabbing text data from a pdf to use in javascript
                                try67 MVP & Adobe Community Professional

                                If it works, that's all that matters...

                                • 13. Re: Grabbing text data from a pdf to use in javascript
                                  iu-user Level 1

                                  arrgh

                                  I've got it stamping and  renaming files properly and using the 13th word in the filename even.  But, I am getting this error when it tries to execute this.saveas; "exception in line 56 of function top level, script Batch:exec  Raise error: the file may be read only blah, blah, blah"  The path is good, tried many different approaches - even local.

                                   

                                  Here is what I am working with:

                                   

                                   

                                  // Begin job

                                  if ( typeof global.counter == "undefined" || global.date_reply == null  ) {

                                  console.println("Begin Job Code");

                                  global.counter = 0;

                                  // Grab date from User to be stamped

                                  var dialogNumber = "Number of Files";

                                  global.FileCnt = app.response("Number of Files to be Processed:", dialogNumber);

                                   

                                  var dialogTitle = "Date Received";

                                  var defaultAnswer = util.printd("mm-dd", new Date());

                                  global.date_reply = app.response("Date Received:",

                                  dialogTitle, defaultAnswer);

                                   

                                  }

                                   

                                  // Main code to process each of the selected files

                                  try {

                                  global.counter++

                                  console.println("Processing File #" + global.counter);

                                  // insert batch code here.

                                   

                                      this.addWatermarkFromText({

                                      cText: "GHC Received " + global.date_reply,

                                      nTextAlign: app.constants.align.left,

                                      nHorizAlign: app.constants.align.left,

                                      nVertAlign: app.constants.align.bottom,

                                      nHorizValue: 1, nVertValue: 1,

                                      nFontSize: 8,});

                                   

                                      this.addWatermarkFromText({

                                      cText: "Finance Inbox",

                                      nTextAlign: app.constants.align.right,

                                      nHorizAlign: app.constants.align.right,

                                      nVertAlign: app.constants.align.bottom,

                                      nHorizValue: -4, nVertValue: 1,

                                      nFontSize: 8,

                                      aColor: ["G",.5]

                                      });

                                   

                                  } catch(e) {

                                  console.println("Batch aborted on run #" + global.counter);

                                  delete global.counter; // Try again, and avoid End Job code

                                  event.rc = false; // Abort batch

                                  }

                                   

                                  var pronmbr = getPageNthWord(0,13,false)

                                  var re = /\.pdf$/;

                                  var date_replace = global.date_reply.replace(/[?:\\/|<>"*]/g,"");

                                  var fname = this.documentFileName.replace(re,"_");

                                   

                                   

                                  var filename =  pronmbr + "ART INV" + date_replace + ".pdf";

                                  console.println(filename);

                                   

                                  // File path must be changed manually to correct directory

                                  this.saveAs("/O/1_invoice staging/" + filename);

                                   

                                   

                                  // End job

                                  if ( global.counter == global.FileCnt ) {

                                  console.println("End Job Code");

                                  // Insert endJob code here

                                   

                                  // Remove any global variables used in case user wants to run

                                  // another batch sequence using the same variables

                                  delete global.counter;

                                  delete global.date_reply;

                                  delete global.FileCnt;

                                  }

                                  • 14. Re: Grabbing text data from a pdf to use in javascript
                                    try67 MVP & Adobe Community Professional

                                    What's the full file-name that you're trying to use?

                                    • 15. Re: Grabbing text data from a pdf to use in javascript
                                      iu-user Level 1

                                      pronmbr + "ART INV" + date_replace + ".pdf";

                                       

                                      would be something like "105063 ART INV 08-01.pdf"

                                       

                                      with pronmbr being the 13th word, ART INV being inserted text and date_replace being the user date entered in the dialogue box.  I get an appropriate filename in the console screen with each error message.  One for each file batched - always the same error, but it saves as the original filename.

                                      • 16. Re: Grabbing text data from a pdf to use in javascript
                                        try67 MVP & Adobe Community Professional

                                        From what context are you running the code?

                                        Does it work if you only execute the saveAs command from the console with the full path, hard-coded into the code?

                                        • 17. Re: Grabbing text data from a pdf to use in javascript
                                          iu-user Level 1

                                          part of an action in Acrobat X pro.  I took one I use that works and added the var pronmbr = getPageNthWord(0,13,false) command.

                                           

                                          Actually, i get an undefined error when I try the console:

                                          saveAs("/O/1_invoice staging/" test filename)

                                           

                                          undefined

                                          • 18. Re: Grabbing text data from a pdf to use in javascript
                                            try67 MVP & Adobe Community Professional

                                            "Undefined" is not an error message. It just means the code executed without returning any values.

                                            Do you see the file saved in the target folder?

                                            • 19. Re: Grabbing text data from a pdf to use in javascript
                                              iu-user Level 1

                                              sorry - no it is not saving to the target folder.

                                              • 20. Re: Grabbing text data from a pdf to use in javascript
                                                try67 MVP & Adobe Community Professional

                                                Can you post the exact code you're executing?

                                                • 21. Re: Grabbing text data from a pdf to use in javascript
                                                  iu-user Level 1

                                                   

                                                  // Begin job

                                                  if ( typeof global.counter == "undefined" || global.date_reply == null  ) {

                                                  console.println("Begin Job Code");

                                                  global.counter = 0;

                                                  // Grab date from User to be stamped

                                                  var dialogNumber = "Number of Files";

                                                  global.FileCnt = app.response("Number of Files to be Processed:", dialogNumber);

                                                   

                                                  var dialogTitle = "Date Received";

                                                  var defaultAnswer = util.printd("mm-dd", new Date());

                                                  global.date_reply = app.response("Date Received:",

                                                  dialogTitle, defaultAnswer);

                                                   

                                                  }

                                                   

                                                  // Main code to process each of the selected files

                                                  try {

                                                  global.counter++

                                                  console.println("Processing File #" + global.counter);

                                                  // insert batch code here.

                                                   

                                                      this.addWatermarkFromText({

                                                      cText: "GHC Received " + global.date_reply,

                                                      nTextAlign: app.constants.align.left,

                                                      nHorizAlign: app.constants.align.left,

                                                      nVertAlign: app.constants.align.bottom,

                                                      nHorizValue: 1, nVertValue: 1,

                                                      nFontSize: 8,});

                                                   

                                                      this.addWatermarkFromText({

                                                      cText: "Finance Inbox",

                                                      nTextAlign: app.constants.align.right,

                                                      nHorizAlign: app.constants.align.right,

                                                      nVertAlign: app.constants.align.bottom,

                                                      nHorizValue: -4, nVertValue: 1,

                                                      nFontSize: 8,

                                                      aColor: ["G",.5]

                                                      });

                                                   

                                                  } catch(e) {

                                                  console.println("Batch aborted on run #" + global.counter);

                                                  delete global.counter; // Try again, and avoid End Job code

                                                  event.rc = false; // Abort batch

                                                  }

                                                   

                                                  var pronmbr = getPageNthWord(0,13,false)

                                                  var re = /\.pdf$/;

                                                  var date_replace = global.date_reply.replace(/[?:\\/|<>"*]/g,"");

                                                  var fname = this.documentFileName.replace(re,"_");

                                                   

                                                   

                                                  var filename =  pronmbr + "ART INV" + date_replace + ".pdf";

                                                  console.println(filename);

                                                   

                                                  // File path must be changed manually to correct directory

                                                  this.saveAs("/O/1_invoice staging/" + filename);

                                                   

                                                   

                                                  // End job

                                                  if ( global.counter == global.FileCnt ) {

                                                  console.println("End Job Code");

                                                  // Insert endJob code here

                                                   

                                                  // Remove any global variables used in case user wants to run

                                                  // another batch sequence using the same variables

                                                  delete global.counter;

                                                  delete global.date_reply;

                                                  delete global.FileCnt;

                                                  }

                                                  • 22. Re: Grabbing text data from a pdf to use in javascript
                                                    try67 MVP & Adobe Community Professional

                                                    No, I mean when you test just the saveAs command from the console window,

                                                    what code did you execute, exactly?

                                                    • 24. Re: Grabbing text data from a pdf to use in javascript
                                                      Test Screen Name Most Valuable Participant

                                                      1. I don't like the look of trying to save as test. Even if it succeeds it will just be called test and won't automatically open in Acrobat. Try test.pdf.

                                                       

                                                      2. Are you able to save to the folder "O:\1_invoice staging" manually?

                                                      • 25. Re: Grabbing text data from a pdf to use in javascript
                                                        try67 MVP & Adobe Community Professional

                                                        You can't be executing the code, because it should have failed (because you didn't include the ".pdf" suffix).

                                                        To execute it you must first select it and then press Ctrl+Enter.

                                                        • 26. Re: Grabbing text data from a pdf to use in javascript
                                                          iu-user Level 1

                                                          Okay,

                                                          Adding the ".pdf" extension to the code makes it work in the console window.  So, executing that line with named test file works.  I can't use the script line exactly because the filename contains one variable and a user entered value, (+ .pdf)

                                                           

                                                          The error seems to me to be that the file is viewed as open -  "exception in line 56 of function top level, script Batch:exec  Raise error: the file may be read only ...."  or the pronmbr variable is not changing with the iteration through the selected files, so it thinks it is trying to save the exact same name again - maybe???  I'm at a loss.

                                                          • 27. Re: Grabbing text data from a pdf to use in javascript
                                                            Test Screen Name Most Valuable Participant

                                                            You can use app.alert to write the file name to the console and see what is going on as the script runs.

                                                            • 28. Re: Grabbing text data from a pdf to use in javascript
                                                              iu-user Level 1

                                                              Thanks, just tried this and app.alert brings up each filename correctly, I click OK and then I get the error message with no file save.

                                                              • 29. Re: Grabbing text data from a pdf to use in javascript
                                                                try67 MVP & Adobe Community Professional

                                                                Copy the actual file name that you see in the alert (or output it to the console, and then copy it from there) into your saveAs command and run it manually from the console. Does it work?

                                                                If a file with the same name exists it will simply be overwritten. However, if that file is open, locked or is set as read-only it will fail and an error message will appear.

                                                                • 30. Re: Grabbing text data from a pdf to use in javascript
                                                                  iu-user Level 1

                                                                  this.saveAs("/O/1_invoice staging/" + "105119 ART INV 8-02" + ".pdf")

                                                                  does not work in console.

                                                                   

                                                                  Target folder does not have original files that are being batched or any other files.  Right now I can not get the console to repeat the test --  this.saveAs("/O/1_invoice staging/" + "test" + ".pdf")  -- which worked yesterday.  I am beginning to think console mode is unstable or I am not doing something right.

                                                                   

                                                                   

                                                                  CORRECTION TO THE ABOVE:  both examples of script ran and saved as expected.  My console was locked up, closed and reopened Acrobat and now these commands work fine.

                                                                  • 31. Re: Grabbing text data from a pdf to use in javascript
                                                                    iu-user Level 1

                                                                    I think I am on to something:

                                                                     

                                                                    When I run this in console it saves just fine:

                                                                    var pronmbr = 105882

                                                                    var date_replace = "8-01";

                                                                    var filename =  pronmbr + " ART INV " + date_replace + ".pdf";

                                                                    console.println(filename);

                                                                    app.alert(filename, 3);

                                                                    this.saveAs("/O/1_invoice staging/" + filename)

                                                                     

                                                                    when I run this in console app.alert shows the right filename but it throws the error: (does the getPageNthWord command hold onto the document in such a way as to make the saveas think protected or read-only?)

                                                                     

                                                                    var pronmbr = getPageNthWord(0,13,false)

                                                                    var date_replace = "8-01";

                                                                    var filename =  pronmbr + " ART INV " + date_replace + ".pdf";

                                                                    console.println(filename);

                                                                    app.alert(filename, 3);

                                                                    this.saveAs("/O/1_invoice staging/" + filename)

                                                                    • 32. Re: Grabbing text data from a pdf to use in javascript
                                                                      try67 MVP & Adobe Community Professional

                                                                      Why are you specifying the last third parameter of getPageNthWord as false?

                                                                      That means it's not stripping any white-space characters from it, which

                                                                      could mean you're including something like a line-break in the file-name,

                                                                      which is not allowed.

                                                                      Try printing out the filename like this:

                                                                      console.println(filename.toSource());

                                                                      This will help you find any unwanted characters that might be hiding in

                                                                      it...

                                                                      1 person found this helpful
                                                                      • 33. Re: Grabbing text data from a pdf to use in javascript
                                                                        iu-user Level 1

                                                                        yes!!!

                                                                        (new String("105882 \n ART INV 8-01.pdf"))

                                                                         

                                                                        got a pesky \n in the filename.  So, change attribute to "true" and this will work?

                                                                        • 34. Re: Grabbing text data from a pdf to use in javascript
                                                                          try67 MVP & Adobe Community Professional

                                                                          Either that or make sure to remove any such characters from the string before using it in the file-name.

                                                                          • 35. Re: Grabbing text data from a pdf to use in javascript
                                                                            iu-user Level 1

                                                                            Just tested and retested this.  It works perfectly now.  thank you very much for your help!!

                                                                             

                                                                            true/false in Excel vlookup and other attributes is just the opposite.

                                                                            • 36. Re: Grabbing text data from a pdf to use in javascript
                                                                              try67 MVP & Adobe Community Professional

                                                                              Well, the name of that parameter is bStrip. So if you specify it as true the white-space characters are stripped. If you specify it as false, they are retained... This is all documented in the Acrobat JS API Reference. Anyway, glad to hear you were able to sort it out!