-
1. Re: How is it that a searchable PDF text returns found words misspelled when the text is converted to an rtf file?
Test Screen Name May 31, 2014 1:53 AM (in response to WRBulmer)I'm not sure what you are saying. OCR is unreliable and if you don't correct it then text will be wrong. This seems simple and unavoidable, what do you suggest instead?
-
3. Re: How is it that a searchable PDF text returns found words misspelled when the text is converted to an rtf file?
Test Screen Name May 31, 2014 7:45 AM (in response to WRBulmer)Your reply was empty. (You cannot attach files)
-
4. Re: How is it that a searchable PDF text returns found words misspelled when the text is converted to an rtf file?
WRBulmer May 31, 2014 8:27 AM (in response to Test Screen Name)Not sure what happened there, but I did not attach anything, replied to the e-mail as per instructions. Anyway ...Thank you for your prompt response.
Here's the thing,
I have downloaded a historical document (1936) that is in PDF format. There were no restrictions, it is searchable. No OCR was done on my end.
As an experiment, a "Find" was done for a keyword, and returned 10 results. There were no overlooked keywords.
The document was converted into RTF, and a search for the same keyword was done. The results returned only 7.
A spellcheck showed that the remaining 3 were spelled incorrectly and therefore could not be recognised.
For some technical reason, the PDF search recognised all words even tho 3 of them, according to the rtf equivalent, were spelled correctly.
My question is why does that happen?
One would think that if 10 words were recognised in a PDF, they would all be spelled correctly in the rtf equivalent.
How is it that the rtf equivalent returns 3 misspelled words (and of course does not recognise them) when the PDF is blind to their misspellings?
I'm hoping that someone who understands how the PDF format is structured would be able to explain why this strange behaviour occurs.
Wayne
-
5. Re: How is it that a searchable PDF text returns found words misspelled when the text is converted to an rtf file?
Test Screen Name May 31, 2014 8:33 AM (in response to WRBulmer)I can't fault your reasoning. I suggest a closer examination by selecting text and doing a copy/paste.
-
6. Re: How is it that a searchable PDF text returns found words misspelled when the text is converted to an rtf file?
WRBulmer May 31, 2014 8:40 PM (in response to Test Screen Name)Thank you for your suggestion TSN.
At the moment I'm more curious to know what, in terms of technical aspects, why a PDF format returns a found word as if it were correctly spelled, whereas when converted to an rtf that same word comes back misspelled ... making it invisible to an rtf search. When that is understood, then we can take steps to fix.
Anyone understand the intricacies of PDF structure out there??
W/
-
7. Re: How is it that a searchable PDF text returns found words misspelled when the text is converted to an rtf file?
Test Screen Name Jun 1, 2014 12:54 AM (in response to WRBulmer)I do understand the intricacies of PDF structure. And I can tell you it's baffling. Hence my suggestion of a deeper investigation. Unless you can share the file publicly.
-
8. Re: How is it that a searchable PDF text returns found words misspelled when the text is converted to an rtf file?
lrosenth Jun 2, 2014 9:20 AM (in response to Test Screen Name)I'll back up TSN - this isn't a PDF format question, but more about what is present in the file that is being used by search but not by "save as RTF". The only way to know is to examine the file. If you can post it, we can look.



