3 Replies Latest reply on Oct 27, 2016 3:25 AM by coldbreeze16

    Extracting text from exported PDF

    coldbreeze16

      When text in an East Asian complex script (unicode) is exported from inDesign CC to Adobe PDF, text can't be extracted properly (it displays fine). I have tested mostly with Indic scripts. If I open with any 3rd party PDF reader, the text is garbled to a large extent (many letters get replaced by other letters or even Latin characters). If I open it in Adobe PDF reader/Acrobat most of the text copies fine with some occasional character misplacement and unintended glyphs. The biggest problem in this case though are the extra white spaces that appear between letters and break the searchability of PDFs.

       

      Here is a PDF for example with the Universal declaration of human rights, with the text in Odia language. Document uses Kalinga font default on Windows since Vista.

      universal.pdf - Google Drive

      And here is the source inDesign file universal.indd - Google Drive

      Also I couldn't find anything particular in /ToUnicode table.


      I had asked this before on superuser. People pointed to me it could be due to inDesign's handling rather than PDF itself. printing - Text in PDF turns gibberish on copying but displays fine - Super User

       

      Would be glad to get any suggestions