13 Replies Latest reply on May 17, 2018 5:28 AM by MarkDickerson

    Export OCR in PDF to XML

    bernadetteh11333229 Level 1

      Is it possible to export the OCR in a PDF into an XML file?

       

      I've tried using File - Save As - XML (with various settings), but that doesn't save the OCR'd text.

       

      With Content Editing - Export, there is no XML option.

       

      OCR seems to be output with every other file type, except for XML.

       

      I've also tried saving as a Word document, and then saving the resulting Word doco as XML. But that doesn't work either as it seems to turn images into XML. I just want the OCR'd text.

       

      I'm using Adobe Acrobat XI Pro.

        • 1. Re: Export OCR in PDF to XML
          Bernd Alheit Adobe Community Professional & MVP

          What options do you use for OCR?

          • 2. Re: Export OCR in PDF to XML
            bernadetteh11333229 Level 1

            Hi Bernd, the PDFs in question are scanned and OCR'd by a third party. I don't know what options they used. are the options used for OCRng likely to make a difference for XML output?

            • 3. Re: Export OCR in PDF to XML
              alexw71856384 Level 1

              @Bernd Alheit

              I am having a similar issue, except OCR was done in Adobe DC. Essentially, I have been correcting the layer of OCR using the "correct recognized text" option. Because I am using a scan of an old document, the text of 76014 might have been OCR'd as "7B014." The scanned document is 1,000 pages so I have made numerous corrections like this. However, when I export the pdf, those changes to the OCR are not exported. Instead, the export would still show 7B014.

               

              If I select all in Adobe, I can copy and paste the corrected OCR. But is there a way to export the corrected OCR to xml?

              • 4. Re: Export OCR in PDF to XML
                Bernd Alheit Adobe Community Professional & MVP

                What did you get when you save as Word?

                • 5. Re: Export OCR in PDF to XML
                  alexw71856384 Level 1

                  I get the same error in word or in excel. I think the issue is that when I export, if I uncheck "recognize text if needed," the pdf is exported as images without any OCR. If I do check that box, the changes I made to the OCR in Adobe are lost as the OCR is redone and thus back to the original errors.

                   

                  I am new to Adobe so I may have made an error. I had a scanned document, used enhance, and then correct recognized text. In Adobe, the OCR is now correct. Any suggestions?

                  • 6. Re: Export OCR in PDF to XML
                    CtDave Level 5

                    Try an alternative. Export an "uncorrected" PDF to a text editor (Word, what ever).

                    Use the text editor to do corrections.

                    Export the text editor's file to xml.

                     

                    Be well...

                    • 7. Re: Export OCR in PDF to XML
                      bernadetteh11333229 Level 1

                      Still not having any luck at all in exporting OCR into any kind of file at all. XML, Word, DOC, RTF etc.

                       

                      Is it even possible?

                      • 8. Re: Export OCR in PDF to XML
                        Test Screen Name Most Valuable Participant

                        Depending on the settings, in many PDF files you have the original scanned document (picture only) combined with invisible text (for searching and copy/paste). Anything that exports XML is quite likely to ignore invisible stuff. Instead of XML export see if there's a way to extract text that works for you. Simplest is save as TXT.

                        • 9. Re: Export OCR in PDF to XML
                          bernadetteh11333229 Level 1

                          Test Screen Name, thanks for your reply. Saving as TXT doesn't yield any content at all.

                           

                          I have found a solution, though - use Abbyy FineReader instead of Adobe Acrobat Pro to export.

                           

                          It looks like OCR'd text in Adobe PDFs can only be exported by using whatever software generated the OCR in the first place.

                          • 10. Re: Export OCR in PDF to XML
                            alexw71856384 Level 1

                            Can you elaborate on the solution you found? Is it as simple as opening the pdf in Abbyy FineReader and choosing export?

                             

                            (I do not currently have Abbyy FineReader so I can't see for myself. I am deciding whether to purchase it for this specific reason as I have cleaned up hundreds of pages using the invisible OCR in Adobe, which I am currently unable to export cleanly).

                            • 11. Re: Export OCR in PDF to XML
                              bernadetteh11333229 Level 1

                              Hi Alex, looks like I've made an erroneous assumption.

                               

                              I thought AbbyyFR was using the existing OCR layer in OCR'd PDFs; but turns out it was scanning them anew. So my assumption that Abbyy was reading the existing OCR is not correct. Drat, it made sense at the time.

                               

                              Now to try to figure out how to uncorrect that "Correct Answer".

                              • 12. Re: Export OCR in PDF to XML
                                girijaAgarwal Adobe Employee

                                Hi Bernadette/Alex,

                                 

                                We apologize for the delay in response and the inconvenience thus caused to you.

                                 

                                Please try the following steps:

                                1. Open the PDF file

                                2. Go to "Tools" -> "Enhance Scans"

                                3. Select "Recognize Text" -> "In this File" -> "Settings"

                                4. Select "Editable Text and Images" from the "Output" dropdown and Click "OK"

                                5. Click on the "Recognize Text" button

                                6. Select "Recognize Text" from the menu again -> "Correct Recognized text"

                                7.Make the corrections and save the PDF

                                8. Now, try exporting the saved file to any format you want (The corrected OCR'ed text should be exported)

                                 

                                Please let us know if this helps.

                                 

                                Thanks and Regards,

                                Girija

                                • 13. Re: Export OCR in PDF to XML
                                  MarkDickerson Level 1

                                  Hey Bernadette, Alex, Girija,

                                   

                                  I've attempted three different methods to exported corrected OCR'ed text, with three different, and ultimately unsatisfactory results.

                                   

                                  Method 1:

                                  1. Open the PDF file
                                  2. Go to "Tools" -> "Enhance Scans"
                                  3. Select "Recognize Text" -> "In this File" -> "Settings"
                                  4. Select "Searchable Image" from the "Output" dropdown and Click "OK"
                                  5. Click on the "Recognize Text" button
                                  6. Select "Recognize Text" from the menu again -> "Correct Recognized text"
                                  7. Make the corrections and save the PDF
                                  8. Export the saved file to .doc and .txt

                                  Result: I got the same results as  alexw71856384. The exported text is uncorrected. I would guess that Test Screen Name is correct. The export is ignoring the invisible later (that contains the corrections), and just re-OCRing the entire document.

                                   

                                  Method 2 (Based on girijaAgarwal suggestion):

                                  1. I started with my corrected OCR text from Method 1 (steps 1 -7)
                                  2. Select "Editable Text and Images" from the "Output" dropdown and Click "OK"
                                  3. Click on the "Recognize Text" button.
                                  4. This succesfully converted my corrected OCR text from a Searchable Image to editable Text and Images (see: Better PDF OCR. ClearScan is smaller, looks better )
                                  5. Export the saved file to .doc and .txt

                                  Result: girijaAgarwal, this was by far the worst option. I got an unusable mess: invisible characters/words, out of order etc.

                                   

                                  Method 3:

                                  1. I started with my corrected OCR text from Method 1 (steps 1 -7)
                                  2. I used the "Preflight" -> "Make OCR text visible" (detailed instructions: Hidden Gems in Acrobat DC: How to Optimize Hidden OCR Text | Adobe Blog )
                                  3. Open the Layer panel on the left to reveal the new layers.
                                  4. Toggle the 'Invisible text' layer to on, and the 'Visible page content' layer to off.
                                  5. Change layer settings so that the 'Invisible text' layer always exports, and the 'Visible page content' to never exports
                                  6. Export the saved file to .doc (without images) and .txt

                                  Result: So the good news, is that this method exported OCR-ed text with corrections. Unfortunately, it introduce new errors into the exported text. Mainly missing spaces and extra spaces that weren't in Method 1's output or the corrected OCR text in the PDF document.