I am scanning text in a PDF and reading it into memory using doc::getPageNthWord(). Reading the text from the file using the mark 1 mod 0 eyeball looks like this:
Greenpeace. 2012. Safeway charts new course for
sustainable tuna. www.greenpeace.org/usa/en/media-
Monterey Bay Aquarium. 2012a. Wild seafood issue:
(Please note that I am unable to determine how the PDF is causing the indent. I am using three spaces, but a hex editor...well read on, please).
Using the above function (i.e getPageNthWord()) the information comes in like this:
• Ł Ł • • Ł Ł • • Ł Ł • • Ł Ł •
Copying it to a text editor using the clipboard, it looks like this:
"Wild seafood issue:"
", OOHJDO AVKLQJ.
Can anyone suggest a possible solution to getting the text in a readable format?
I have some additional information. When I copy the text to a programming editor I see the latter transformation. When I compare the text, letter for letter, each letter in the cryptic version has been reduced in value by 29 with a few exceptions; periods and forward slants do not seem to have been affected. Additionally, the letters f and i, when placed together, are interpreted as a single letter with an ascii version of 195.
I have looked at possible XOR/OR bit fiddling conversions, but have not been able to figure out what is causing the transformation (or causing the transformation to fail).