How does Adobe Export PDF recognize text and support different languages?

Version 3

    Does Adobe Export PDF support OCR (Optical Character Recognition)?

    Yes. Adobe ExportPDF supports optical character recognition or OCR, when you convert a PDF file to Word (.doc and .docx), Excel (.xlsx), or RTF (rich text format).  OCR is the conversion of images of text (scanned text) into editable characters, so that you can search, correct and copy the text.


    When should I use OCR in Adobe ExportPDF?

    If you open your PDF file in Adobe Reader, can you select the text with your mouse?  If so, you likely don't need to apply OCR to your file when converting it with ExportPDF.


    How do I set up Adobe ExportPDF to perform OCR when converting a PDF file?

    You don't need to do anything! OCR is automatically enabled. You will usually see better results  by leaving this (see step 2 below) enabled.

    1. Log into Adobe Export PDF and choose a file type from the Export PDF  file to: drop down menu.
    2. Make sure the Check box Recognize scanned text in: is selected (this is the default setting)
    3. Choose a language from the drop-down menu located under "Recognize scanned text in:"
    4. Select the PDF file that you want to convert, and click Open.
    5. Click Save to save the exported file.

    What elements in a PDF file are interpreted by the ExportPDF OCR engine?

    When OCR is enabled, Adobe ExportPDF performs OCR on PDF files that contain:

    • Images
    • Vector art
    • Hidden text
    • Text that cannot be interpreted due to incorrectly encoded text in the source application


    Can OCR be turned on/off in ExportPDF?

    Yes. If you do not want Adobe ExportPDF to convert images of text to editable/renderable text, consider disabling OCR.  When OCR is disabled, Adobe ExportPDF can process the PDF document more quickly and the resulting Office file will have embedded images instead of editable.renderable text.  Likewise, if the PDF you are converting contains a mix of formatted text and graphics, and you want to leave the graphics as they are, disable OCR.  Finally, you can disable OCR if you find that Adobe ExportPDF does not interpret text correctly (because it was encoded incorrectly). The screen capture below shows how to enable/disable OCR by toggling the checkbox labeled "Recognize scanned text in:". 



    What languages are supported for OCR by Adobe ExportPDF?

    Adobe ExportPDF supports the following languages for OCR:

    • English (US)
    • English (UK)
    • German
    • Spanish
    • French
    • Italian
    • Japanese



    By default, OCR uses the language selected in the My Information dialog box (see screen capture below).  



    The OCR engine uses the selected language to interpret the scanned text.  Selecting the correct language improves the accuracy of the conversion, as the OCR engine user language-specific dictionaries for conversion.  For non-Latin languages like Japanese, the OCR engine cannot interpret and convert the text unless you've selected the appropriate language.


    What if the language in my PDF is not supported for OCR by Adobe ExportPDF?

    If OCR is enabled, but the language in the PDF is not supported for OCR by Adobe ExportPDF, the Office file that results from the conversion will contain incorrectly interpreted text.