If Acrobat is reading each word part as separate words, you have a problem.
The way I approached it in some of my tools was to check if a word ends
with a hyphen, and if so, to check if it's the last on the line. If both
conditions are true, combine with the next word on the next line. This is
not fool proof, of course, as there are documents with columns are other
structural elements that prevent this from working. Better than nothing,
However, it is also possible that Acrobat does see both parts as parts of
the same word. In that case, getPageNthWordQuads() will return multiple
quads arrays. As you know, that method returns an array of quad arrays.
There's usually only one, but in principle there could be more... Something
to check before giving up.
You are awesome - I now undestand the reason that the quads are kept in array. It can identfy the text even it span across more than one line - it is identifying each line as a rectangle and eight x,y are mentioned per item. so the annot can take item by item and marks as a single comment.
Thank you very much - you saved me from going in to forward parsing and reverse parsing the combined words and identifying the meaningful words.
I will try the code and update my results here.