What options do you use for OCR?
Hi Bernd, the PDFs in question are scanned and OCR'd by a third party. I don't know what options they used. are the options used for OCRng likely to make a difference for XML output?
I am having a similar issue, except OCR was done in Adobe DC. Essentially, I have been correcting the layer of OCR using the "correct recognized text" option. Because I am using a scan of an old document, the text of 76014 might have been OCR'd as "7B014." The scanned document is 1,000 pages so I have made numerous corrections like this. However, when I export the pdf, those changes to the OCR are not exported. Instead, the export would still show 7B014.
If I select all in Adobe, I can copy and paste the corrected OCR. But is there a way to export the corrected OCR to xml?
What did you get when you save as Word?
I get the same error in word or in excel. I think the issue is that when I export, if I uncheck "recognize text if needed," the pdf is exported as images without any OCR. If I do check that box, the changes I made to the OCR in Adobe are lost as the OCR is redone and thus back to the original errors.
I am new to Adobe so I may have made an error. I had a scanned document, used enhance, and then correct recognized text. In Adobe, the OCR is now correct. Any suggestions?
Try an alternative. Export an "uncorrected" PDF to a text editor (Word, what ever).
Use the text editor to do corrections.
Export the text editor's file to xml.
Still not having any luck at all in exporting OCR into any kind of file at all. XML, Word, DOC, RTF etc.
Is it even possible?
Depending on the settings, in many PDF files you have the original scanned document (picture only) combined with invisible text (for searching and copy/paste). Anything that exports XML is quite likely to ignore invisible stuff. Instead of XML export see if there's a way to extract text that works for you. Simplest is save as TXT.
Test Screen Name, thanks for your reply. Saving as TXT doesn't yield any content at all.
I have found a solution, though - use Abbyy FineReader instead of Adobe Acrobat Pro to export.
It looks like OCR'd text in Adobe PDFs can only be exported by using whatever software generated the OCR in the first place.
Can you elaborate on the solution you found? Is it as simple as opening the pdf in Abbyy FineReader and choosing export?
(I do not currently have Abbyy FineReader so I can't see for myself. I am deciding whether to purchase it for this specific reason as I have cleaned up hundreds of pages using the invisible OCR in Adobe, which I am currently unable to export cleanly).
Hi Alex, looks like I've made an erroneous assumption.
I thought AbbyyFR was using the existing OCR layer in OCR'd PDFs; but turns out it was scanning them anew. So my assumption that Abbyy was reading the existing OCR is not correct. Drat, it made sense at the time.
Now to try to figure out how to uncorrect that "Correct Answer".
We apologize for the delay in response and the inconvenience thus caused to you.
Please try the following steps:
1. Open the PDF file
2. Go to "Tools" -> "Enhance Scans"
3. Select "Recognize Text" -> "In this File" -> "Settings"
4. Select "Editable Text and Images" from the "Output" dropdown and Click "OK"
5. Click on the "Recognize Text" button
6. Select "Recognize Text" from the menu again -> "Correct Recognized text"
7.Make the corrections and save the PDF
8. Now, try exporting the saved file to any format you want (The corrected OCR'ed text should be exported)
Please let us know if this helps.
Thanks and Regards,