13 Replies Latest reply on Nov 12, 2008 5:43 AM by Patrick Leckey

    How to Extract Text coordinates from PDF

      Hi,

      can anyone tell me how to get coordinates in pdf document using VB or .NET, suppose if some text is written in pdf document then how can i get coordinates of that text. Its very Urgent.

      Thanks in Advance.
        • 1. Re: How to Extract Text coordinates from PDF
          I think you could use the .doc method getPageNthWordQuads to get the coordinates.

          Best regards
          Stanley
          • 2. Re: How to Extract Text coordinates from PDF
            Level 1
            hi,

            Thanks for ur reply. But I didnt understand, plz explain briefly.

            Thanks...
            • 3. Re: How to Extract Text coordinates from PDF
              gkaiseril MVP & Adobe Community Professional
              The method described is the Acrobat JavaScript method, you should have a similar method or function within the SDK or should be able to issue an Acrobat JavaScript command and be returned the array of quads for the word.
              • 4. Re: How to Extract Text coordinates from PDF
                Thom Parker Level 3
                Acrobat JavaScript can be used from VB through the JSO object, which is part of the Acrobat IAC (Interapplication Communication) interface. You'll find all the documentation here:

                http://www.adobe.com/devnet/acrobat/

                There is an example of using the doc.getPageNthWordQuads() in the Acrobat JavaScript Reference. It's the Example #2 for the "doc.addLink()" function.

                Also take a look at this article, it covers coordinates in Acrobat JavaScript.

                http://www.acrobatusers.com/tech_corners/javascript_corner/tips/2006/page_bounds/

                Your best strategy is to write a Folder Level JavaScript function to acquire the coordinates you want. Concatonate all the coord info into a string that's returned from the function. VB easily handles strings returned from the JSO. More complex data types can be problematic. Use the JSO to call the function from the VB program.
                • 5. Re: How to Extract Text coordinates from PDF
                  Level 1
                  Hello Thom Parker,

                  Very Thanks for ur message.
                  In VB I did many applications. But working with PDF this is the first time. Also I didnt use any SDK previously. So i didnt get the clear idea about how to use Abobe SDK in VB to get coordinates. Also i read documentation, but ididnt got the idea. Please give me the starting code for how to use SDK in VB code to get coordinates. It will move my next step in right way.

                  Thanks...
                  • 6. Re: How to Extract Text coordinates from PDF
                    Is your purpose to use this information to encrypt something? That sounds cool. What is the answer to MG Balaji issue ?
                    • 7. Re: How to Extract Text coordinates from PDF
                      Bernd Alheit Adobe Community Professional & MVP
                      > What is the answer to MG Balaji issue ?

                      Did you read the messages?
                      • 8. Re: How to Extract Text coordinates from PDF
                        Level 1
                        Hi Maleck and Bernd,

                        We are doing some applications using PDF. In some applications we need coordinate informaions. The purpose is to highlight each word in a line. I tried some ways without using SDK in VB, But I dint got result. So I put request in this forum.
                        • 9. Re: How to Extract Text coordinates from PDF
                          I am trying to use the getPageNthWordQuads information to determine if a word on the page is within a region that I am interested in.

                          I have a limited knowledge of javascript and have been looking up text manipulation functions and array manipulation functions in an attempt to figure out how to separate the values that are returned from the Quads routine. The Adobe documentation indicates that the Quads function returns an array, but when I try to access one of the values in the array, it gives me the entire contents of the array as though it is a string. If I use the .length function to try to determine the length of it, it tells me it is length of 1! I obviously am mis-handling this reference, but I have yet to find any specific examples that work with the quads array the way I am trying to work with it....

                          Here is my code...I am running it against an open file in batch processing mode(maybe this has something to do with it)...
                          ----------------------------------------------------------
                          var sourceDoc = this
                          var tx1=492.5;
                          var ty1=761.5;
                          var tx4=563;
                          var ty4=726.2;
                          try {
                          for (var j = 0; j < (this.numPages); j=j+2){

                          var cnt=0;
                          var rcvrnum="";
                          cnt = sourceDoc.getPageNumWords(j);

                          if (j == 0) {

                          try {for (var i = 0; i < cnt; i++) {

                          var quads = sourceDoc.getPageNthWordQuads(j,i);
                          var x1 = quads[0];
                          console.println("Page(" + j + "),Word(" + i + ") = " + sourceDoc.getPageNthWord({nPage: j, nWord: i}));
                          console.println("Quads length is " + quads.length);
                          console.println("X1 = " + x1);
                          if ( x1 >= tx1 & x1 <= tx4 & y1 >= ty4 & y1 <= ty1 ) {
                          console.println("Q1 is good");
                          console.println("Page(" + j + "),Word(" + i + ") = " + sourceDoc.getPageNthWord({nPage: j, nWord: i}));
                          };
                          };
                          } catch (e) { console.println("Aborted: " + e) };
                          };
                          };
                          } catch (f) { console.println("Aborted: " + f) };

                          ----------------------------------------------------------
                          I have tried several variations of the code above to try to extract my values so that I can compare them, but to no avail. The above code outputs to the console the following...
                          ----------------------------------------------------------

                          Page(0),Word(0) = OTTO
                          Quads length is 1
                          X1 = 19.350006103515625,782.15087890625,126.51744079589844,782.15087890625,19.350006103515625, 721.5038452148438,126.51744079589844,721.5038452148438
                          Page(0),Word(1) =
                          Quads length is 1
                          X1 = 125.17047119140625,782.15087890625,153.91525268554688,782.15087890625,125.17047119140625, 721.5038452148438,153.91525268554688,721.5038452148438
                          ----------------------------------------------------------

                          and so on...
                          x1 becomes the entire output from the array and yet I can not perform a simple split function on x1. If I try to split X1 into an array by splitting on the comma, I get the following error.

                          ------------------------------------------
                          Aborted: TypeError: x1.split is not a function
                          ------------------------------------------

                          Am I supposed to import some libraries or something?

                          Thanks for any help....

                          Kevin Ailes
                          • 10. Re: How to Extract Text coordinates from PDF
                            Level 1
                            Never mind. I figured out a solution to my problem. I knew that I wasn't handling the output of the getPageNthWordQuads method correctly even though it says it outputs an array. So, for starters, I looked up a few text parsing routines for Javascript and found that you can typically use a .toString() method if you have an array and that appeared to work. I called the getPageNthWordQuads and immediately converted it to a string with the above method like so....

                            var q = sourceDoc.getPageNthWordQuads(j,i).toString();

                            Now, I know it is a string so I call the .split() method like so...

                            var qa = q.split(',');

                            and tada!!!!! I can now refer directly to any of the quad values which happen to all be x,y coordinates of the box that would enclose what ever word you have referred to.

                            BTW, in order to begin comparing these quads as numbers, I need to use the parseInt() method like below....

                            diffx1 = parseInt(qa[0])-8;

                            The above gave me an X coordinate that allowed me to target a specific part of the page for a specific word so that I could be sure it was the word I wanted.

                            I'm sure there are easier ways to do this, but I was able to make it work this way so I figured I'd share it with others since I didn't find too many examples of using quads for this type of thing.

                            Kevin
                            • 11. Re: How to Extract Text coordinates from PDF
                              gkaiseril MVP & Adobe Community Professional
                              You might want to look at Automating redaction with Acrobat JavaScript by Thom Parker, http://www.acrobatusers.com/tutorials/2008/07/auto_redaction_with_javascript/ , or Automating placement of annotations, Converting coordinates in Acrobat by Thom Parker, http://www.acrobatusers.com/tutorials/2007/10/auto_placement_annotations/ . Both article show the use of the little Matrix2D object and its "formRotated" method.

                              One has to realize that a PDF' space can be visually rotated and there are different of page boxes that describe different parts of the page and have different sizes.
                              • 12. Re: How to Extract Text coordinates from PDF
                                hi ,

                                I am trying to Annotate a pdf using itext, But the problem is that i couldnt find a viewer to annotate the pdf,i am new to this, so plz help me out, i dont under stand the above steps but it seems promising.so plz help me ou

                                regards
                                Navin
                                • 13. Re: How to Extract Text coordinates from PDF
                                  Patrick Leckey Level 3
                                  > I am trying to Annotate a pdf using itext

                                  You may want to ask on the iText forums. These forums are for help with using Adobe's Acrobat SDK and scripting in Acrobat.