By the way, your code does not extract the coordinates of the word you're
looking for, but of the mouse cursor...
Searching for "words" or text strings gets tricky since there are many types of characters that may or may not need to be accounted for. The biggest issue will be what is called white space. There are usually non-printable characters like space, new line, carriage return, horizontal tab, vertical tab, form feed, ";", ":", ".", etc.
You should review the getPageNthWord method and pay close attention to the "bStrip" parameter. I expect you will need to search for single words and then also write code to search for multiple word combinations.
Without the "bStrip" parameter or it set to "false"" I get a sample output of:
0 word: |Word | length: 5
1 word: |30.24-| length: 6
2 word: |0 | length: 2
3 word: |California. | length: 12
4 word: |test | length: 5
5 word: |word
| length: 6
With the "bstrip" parameter se to true I get a sample output of:
0 word: |Word| length: 4
1 word: |30.24| length: 5
2 word: |0| length: 1
3 word: |California| length: 10
4 word: |test| length: 4
5 word: |word| length: 4
Hi, thanks for the quick response. Do you think it's a good/right approach to get coordinates from the mouse location? If not, what is the preferred way to get so.
No, it's not. There's no relation between the mouse's location and the location of the word.
You need to use the getPageNthWordQuads method to get an array that defines the location(s) of the word on the page.
i had successfully extract the coordinates using getPageNthWordQuads. But i have a problem with extracting special charecters like " - ".
For ex: 13-jul-2011 will extrat like 3 words. Is there any possibuility to extract Quads with special charecters also included in a word.
Thanks in Advance,
The quad always includes the associated punctuation. Try this
Find the index of a word that contains punctuation with this code
len = getPageNumWords(pageNum)
console.println(i+ ": " + getPageNthWord(pageNum,i,false));
Run it in the cosole window
Then run this code on the word that includes a comma or dash. In this case it's word number 3
qds = getPageNthWordQuads(pageNum,3)
rect = [qds,qds,qds,qds]
You'll see that the added rectangle surrounds the punctuation
Thanks Thom. One more doubt. How do i export the output values to a text file (anywhere in the local drive)?
Following is my code for finding a text and generating the xy co-ordinates.
for (var p = 0; p < this.numPages; p++)
var numWords = this.getPageNumWords(p);
for (var i=0; i<numWords; i++)
var ckWord = this.getPageNthWord(p, i, true);
if ( ckWord == "Adobe")
var q = this.getPageNthWordQuads(p, i);
var a = q.toString();
var b = new Array();
b = a.split(",");
var x1= b;
var y1= b
var x4= b;
var y4= b;
My question is to how do i write the x and y value to a text file .
There used to be a way to do this with the "doc.exportDataObject" function, but it has since been restricted.
1. The Easy way. Write the text data to a file attachement with the "doc.createDataObject" function. I do this a lot. Although the data is not written to file system, its very easy for the user to drag and drop it anywhere they want. And there is an advantage to having the data attached to the PDF where it was created.
2. Create a new PDF with the "Report" object. Write the xy text to the "Report", then save it as text. This writes the text data to a specific location on the file system.
There are several variations on this theme. For example, you could create a blank PDF with "app.newDoc" then add one large form field or text annotation and write all the text data to the field/annot. Then flatten and save as text.
3. There are other tricks if you can write a plug-in or an IAC App. I once wrote a VBA add-in to Excell that sucked data out of a PDF, then saved the excel file.