I'm trying to create an accurate PDF highlight file as specified by Adobe in this technical note: http://www.adobe.com/devnet/pdf/pdfs/HighlightFileFormat.pdf
I have OCR text from my scanned documents that was generated from the program I used to create searchable PDFs. I tried calculating character offsets using the text, but it isn't entirely an accurate representation of what Adobe Reader thinks the text is, so at times my offsets are wrong. I've used PDFBox to generate the highlight file and found that it's very accurate, but it has to load PDFs and extract the text from them before calculating the offsets, which takes a lot of time (I want to display my PDFs with highlighted search hits when a user clicks on my search results).
I'm exploring using word offsets, even though the technical note warns against it because of a bug. The note says the bug is in Reader 3.01. Does anyone know if the bug's been fixed for 8.1?
Has the algorithm been published? While testing word finding, I've found that sometimes punctuation is considered a word and sometimes it's not. For example, a person's initial, written as a letter and a period ("A."), could be one word or two words. The Reader just decides which with no rhyme or reason that I can see. Knowing the algorithm would be a great help.
I know I'll probably have to pre-extract accurate text using PDFBox, but I just wanted to play around with word finding. Any help would be greatly appreciated! Thanks.
The algorithm isn't published. There is a great deal of fuzzy logic,
by necessity, since spaces are not stored in a PDF. What is seen as a
space, and hence a word separator, is actually Reader noticing that
two consecutive letters are further apart than is reasonable.