9 Replies Latest reply on Sep 30, 2013 4:04 AM by Numediaweb

    Fix Arabic characters copy on a PDF made using InDesign CS

    Numediaweb Level 1

      Hi there,

       

      I'm trying to copy some text from a PDF that wast created using InDesign CS. The text is on Arabic, I can read it correctly but when I copy it, it transforms to strange characters. What is confusing is, the latin text can copy correctly!

       

      You can view an extracted page from this PDF here: https://db.tt/XpLvmJE0

       

      Is the error related to the font? or to the old InDesign CS (codenamed Dragontail: October 2003)

       

      Thanks for your help.

        • 1. Re: Fix Arabic characters copy on a PDF made using InDesign CS
          Michael Witherell Adobe Community Professional

          I would suspect that first font in the Document Properties listing of fonts. It would be nice if the fonts used were OpenType, and that you also had those same fonts installed.

          • 2. Re: Fix Arabic characters copy on a PDF made using InDesign CS
            Joel Cherney Adobe Community Professional & MVP

            Mike is right. It's an AXT font, which is associated with ArabicXT, a Quark XTension from ye olden days, pre-Unicode. InDesign was able to support AXT fonts up until recently, but they are extremely finicky and not easy to work with.

             

            So it's really not an InDesign problem at all. Those "strange characters" are just the funky encoding of the font. It's what pretty much anything outside of ASCII used to look like when the font dropped out.

             

            Because I specialize in the layout of complex scripts that have historically had strange encoding systems (like, you know, Khmer or Ge'ez or Tifinagh ) I have quite a few converters laying around from the time when I regularly had to convert pre-Unicode encoding systems. But, because I was never really much of a Quark user, I'm not sure if I have a good AXT to Unicode converter. And the heyday of AXT fonts was so long ago that English-language Google searches don't turn up much.

             

            If you're a Mac user, then Arabic Genie looks like your best bet. But if you can search in Arabic I bet you will find quite a few without too much hassle. This kind of conversion is a task that many software developers cut their teeth on, so there are often free converters on the Web that will let you feed in Your Obsolete Proprietary Encoding and get out True Well-Formed Unicode.

             

            Usually. Proof it after you convert!

            1 person found this helpful
            • 3. Re: Fix Arabic characters copy on a PDF made using InDesign CS
              Numediaweb Level 1

              Thanks guys for your replys. Verry helpful.

               

              @Mike
              I have both fonts installed om my Win 7 64b PC. But they still show bizzare once copied outside Acrobat. I tried even copying a portion of the text into InDesign CS6 itself but it doensn't recognize the Arabic text too.

               

              @Joel

              What a coincidence! I'm working on the Tifinagh language domaine too

              I needed to develop a solution to batch copy this text (by converting the pdf to XML then, use php to store it on a mySQL database) and publish it on the open source Berber dictionary amawal.net project.

               

              After searching further, I found this article quite intresting about AXt font encoding but I don't have Fontlab or python experience to decode it.

               

              Any ideas?

               

              Thanks

               

              PS; I contacted the IRCAM (the publisher of this document)  and asked if they could provide an open-type-written document.

              • 4. Re: Fix Arabic characters copy on a PDF made using InDesign CS
                Joel Cherney Adobe Community Professional & MVP

                I've done exactly two Tifnagh script jobs in my entire career - and both of 'em were "If you need an interpreter" stickers. However, I am the figure-out-complex-scripts guy at the nonprofit where I work, and what you have there is my stock in trade - an obsolete encoding system for a complex script that you want to drag into the 21st century, and automate the process.

                 

                If you understand that very well-written and comprehensive article, and you know enough about Arabic to feel confident about decoding it, and you know any programming language at all, you can probably write your own converter. The Fontlab stuff is only for converting the fonts themselves. You don't need that - you only need to run a substitution on the glyphs themselves. I could do so - I mean, I've written similar converters for Khmer and Burmese and etc. in Javascript for InDesign - but I am such an incredibly slow and clumsy scripter that it'd be better for me to just go out and buy the utility that does it for me automagically. (Unless I was actively wanting to bash my head against Extendscript for the joy of solving a puzzle - which I do, sometimes, when I have lots of free time, but that's not how my life works right now.)

                • 5. Re: Fix Arabic characters copy on a PDF made using InDesign CS
                  Joel Cherney Adobe Community Professional & MVP

                  Oh, heck! I said I'd just buy the app, but Arabic Genie is free! If you can extract the text, but don't have a Mac, I'd be happy to run the converter on it.

                  • 6. Re: Fix Arabic characters copy on a PDF made using InDesign CS
                    Numediaweb Level 1

                    figure-out-complex-scripts guy.. a complex script that you want to drag into the 21st century

                     

                    You are right Joel. Trying to bring this mal-wrriten-AXt-font-base- text to life is a headach!

                     

                    Since I'm both an Arabic and Berber native speaker I have no problemes with the language. I tryed a find-and-replace to change some chars arabic but I think I'll need to come up with a better encoder.

                    I will use Font Reporter to get the chars table of this AXt font (that uses ‘Mac Roman’ encoding) then probably, I could possibly run a better converter on it

                     

                    Thank you Joel for offering to convert the text (especially I don't have a Mac machine right now) but there's lots of documents and I don't want to bother you with that.. However, if you have some tine and want to have some fun with it (for the sake of old days!) Here's the document I'm working on right now, I just exported it to XML from this original PDF.

                     

                    Thanks for your time and help.

                    • 7. Re: Fix Arabic characters copy on a PDF made using InDesign CS
                      Numediaweb Level 1

                      Resolved!

                       

                      I got it working with some help from PHP

                       

                      1. utf8ToUnicodeCodePoints (I use preg_replace_callback both with ord and json_encode)
                      2. remapCodePoints to original code point
                      3. then reverse the string to match arabic's rtl

                      The code:

                       

                      $str = '»°ù«îÑJ';

                      echo utf8_strrev( utf8ToUnicodeCodePoints($str) ); // echoes; تبخيسـي

                       

                      function utf8ToUnicodeCodePoints($str) {

                       

                          if (!mb_check_encoding($str, 'UTF-8')) {

                              trigger_error('$str is not encoded in UTF-8, I cannot work like this');

                              return false;

                          }

                       

                          return preg_replace_callback('/./u', function ($m) {

                              $ord = ord($m[0]);

                                          $decoded = '';

                       

                                          //if it were ASCII it would only ever return integers up to 127

                              if ($ord <= 127) {

                                  $decoded = remapCodePoints( sprintf('\u%04x', $ord) );

                              }

                                          // values higher than 127 in a UTF-8 string represent the beginning of a multi-byte character

                                          else {

                                 $json = trim(json_encode($m[0]), '"');

                                 $decoded = remapCodePoints($json);

                              }

                                          // Returns something like \u00bb

                                          return $decoded;

                          }, $str);

                      }

                       

                      // Match code points to new ones

                      function remapCodePoints($str)

                      {

                              

                      /*

                                Total glyfs in AXtSAlwaBold; 108.

                                */


                      $unicode_table = array( '\u03bc'=>'ك','\u03a0'=>'ل',

                      '\u0020'=>' ', '\u0028'=>')', '\u0029'=>'(', '\u002c'=>'،',

                       

                      '\u0041'=>'ء', '\u0042'=>'~', '\u0043'=>'', '\u0044'=>'', '\u0045'=>'', '\u0046'=>'ئ', '\u0047'=>'ا', '\u0048'=>'ب', '\u0049'=>'ة', '\u004a'=>'ت',

                      '\u004b'=>'ث', '\u004c'=>'ج', '\u004d'=>'ح', '\u004e'=>'خ', '\u004f'=>'د',

                       

                       

                      '\u0051'=>'ر', '\u0052'=>'ز', '\u0053'=>'س', '\u0054'=>'ش', '\u0055'=>'ص', '\u0056'=>'ض', '\u0057'=>'ط', '\u0058'=>'ظ', '\u0059'=>'ع', '\u005a'=>'غ',

                       

                       

                      '\u0061'=>'ف', '\u0062'=>'ق', '\u0063'=>'ك', '\u0064'=>'ل', '\u0065'=>'م', '\u0066'=>'ن', '\u0067'=>'ه', '\u0068'=>'و', '\u0069'=>'ى', '\u006a'=>'ي',

                      '\u006e'=>'َ', '\u006f'=>'ُ',

                       

                       

                      '\u0071'=>'ّ', '\u0072'=>'ْ', '\u0073'=>'َّ', '\u0074'=>'ُّ', '\u0075'=>'ِّ',

                       

                       

                      '\u00c4'=>'ئ', '\u00c9'=>'ا', '\u00d1'=>'ب', '\u00d6'=>'ب', '\u00dc'=>'ب', '\u00e1'=>'ة', '\u00e0'=>'ت', '\u00e2'=>'ت', '\u00e4'=>'ت', '\u00e3'=>'ث',

                      '\u00e5'=>'ث', '\u00e7'=>'ث', '\u00e9'=>'ج',

                       

                       

                      '\u00eb'=>'ح', '\u00ed'=>'ح', '\u00ee'=>'خ', '\u00ef'=>'خ', '\u00f3'=>'د', '\u00f2'=>'ذ', '\u00f4'=>'ر', '\u00f5'=>'ز', '\u00f9'=>'س', '\u00fb'=>'ش',

                      '\u00fc'=>'ص',

                       

                       

                      '\u2020'=>'ض', '\u00b0'=>'', '\u00a2'=>'', '\u00a3'=>'ط', '\u00a7'=>'ط', '\u2022'=>'ط', '\u00b6'=>'ظ', '\u00a9'=>'ع', '\u2122'=>'ع', '\u00b4'=>'ع',

                      '\u00a8'=>'غ', '\u00d8'=>'ف',

                       

                       

                      '\u221e'=>'ف', '\u00b1'=>'ف', '\u2264'=>'ق', '\u2265'=>'ق', '\u00a5'=>'ق', '\u00b5'=>'ك', '\u2202'=>'ك', '\u2211'=>'ك', '\u220f'=>'ل', '\u03c0'=>'ل',

                      '\u222b'=>'ل', '\u00aa'=>'م', '\u00ba'=>'م', '\u2126'=>'م', '\u00e6'=>'ن', '\u00f8'=>'ن',

                       

                       

                      '\u00bf'=>'ن', '\u00a1'=>'ه', '\u00ac'=>'ه', '\u0192'=>'و', '\u2248'=>'ى', '\u00ab'=>'ي', '\u00bb'=>'ي', '\u2026'=>'ي',

                       

                       

                      '\u2019'=>'لا',

                       

                       

                      '\u00d3 '=>'لا'

                                );

                       

                                return $unicode_table[ $str ];

                      }

                       

                      // reverse the direction

                      function utf8_strrev($str){

                          preg_match_all('/./us', $str, $ar);

                          return join('',array_reverse($ar[0]));

                      }

                      • 8. Re: Fix Arabic characters copy on a PDF made using InDesign CS
                        Joel Cherney Adobe Community Professional & MVP

                        Ha! I would have done something much like that in Javascript. There are parts of PHP syntax that simply make no sense to me at all, but yup, that's how it works.  1000 thanks go to you for sharing your code.

                        • 9. Re: Fix Arabic characters copy on a PDF made using InDesign CS
                          Numediaweb Level 1

                          I have choosen PHP because I needed to parse XML and be able to send data to MySQL. Thanks to your hints about font encoding and this question in Stackexchange, where they mentionned Font Reporter I got the necessary tools to beging programming my first font decoder!

                          The original PDF was full of errors even for the latin characters, that made it hard to use regex to put every string block in its place. For example, "fi" was written "fi" (they look the same but trust me they are different!, the second "fi" is only one character!).

                          I put the code for other people to use it and enjoy font reversing!