5 Replies Latest reply: Apr 16, 2013 12:51 AM by aqua100 RSS

    how to map from cid to unicode

    aqua100 Community Member

      Hello.

      I'm now trying to convert cid to unicode by using the toUnicode cmap.

      The toUnicode cmap I extracted is as follows:

       

      /CIDInit /ProcSet findresource begin

      12 dict begin

      begincmap

      /CIDSystemInfo

      << /Registry (Adobe)

      /Ordering (UCS) /Supplement 0 >> def

      /CMapName /Adobe-Identity-UCS def

      /CMapType 2 def

      1 begincodespacerange

      <0000> <FFFF>

      endcodespacerange

      35 beginbfchar

      <0F3B> <7528>

      <0CA1> <8AAD>

      <0F62> <5229>

      <034B> <3042>

      <034D> <3044>

      <0358> <304F>

      <027B> <3002>

      <027C> <FF0C>

      <035D> <3054>

      <035E> <3055>

      <0360> <3057>

      <0369> <3060>

      <0370> <3067>

      <0372> <3069>

      <0373> <306A>

      <0294> <30FC>

      <0374> <306B>

      <0378> <306F>

      <0388> <307F>

      <0394> <308B>

      <03A5> <30A9>

      <02CE> <FF0A>

      <03AF> <30B3>

      <03B5> <30B9>

      <03B9> <30BD>

      <03BB> <30BF>

      <03BF> <30C3>

      <03C4> <30C8>

      <03CD> <30D1>

      <03D1> <30D5>

      <03D2> <30D6>

      <03DA> <30DE>

      <03E8> <30EC>

      <03EF> <30F3>

      <08BC> <8A66>

      endbfchar

      endcmap CMapName currentdict /CMap defineresource pop end end


      I think that the mapping process needs "beginbfrange" and "endbfrange."

      But, the above cmap does not include them.

       

      There should be a way to map from cid to unicode, because the Preview(Mac application) can search the same text.

      Please let me know my lack of understanding on toUnicode cmap.

        • 1. Re: how to map from cid to unicode
          Test Screen Name CommunityMVP

          I would think your app should be able to handle beginbfrange or beginbfchar.

          • 2. Re: how to map from cid to unicode
            MikelKlink Community Member

            You might want to lookup ToUnicode maps in the standard. ISO 32000-1:2008 says in section 9.10.3 "ToUnicode CMaps":

             

            The CMap defined in the ToUnicode entry of the font dictionary shall follow the syntax for CMaps introduced in 9.7.5, "CMaps" and fully documented in Adobe Technical Note #5014, Adobe CMap and CIDFont Files Specification. Additional guidance regarding the CMap defined in this entry is provided in Adobe Technical Note #5411, ToUnicode Mapping File Tutorial. This CMap differs from an ordinary one in these ways:

             

            • The only pertinent entry in the CMap stream dictionary (see Table 120) is UseCMap, which may be used if the CMap is based on another ToUnicode CMap.
            • The CMap file shall contain begincodespacerange and endcodespacerange operators that are consistent with the encoding that the font uses. In particular, for a simple font, the codespace shall be one byte long.
            • It shall use the beginbfchar, endbfchar, beginbfrange, and endbfrange operators to define the mapping from character codes to Unicode character sequences expressed in UTF-16BE encoding.
            • 3. Re: how to map from cid to unicode
              aqua100 Community Member

              Thanks for your reply.

              I have already implemented the functions to handle both of beginbfrange/endbfrange and beginbfchar/endbfchar.

               

              The cmap shown in my first question has only beginbfchar/endbfchar.

              Some cids don't have the corresponding unicodes.

              In this case, what should I do in order to map those cids to unicodes ?

              • 4. Re: how to map from cid to unicode
                Test Screen Name CommunityMVP

                Can you share the PDF? If not, what are the objects in the font definition?

                 

                Is the font installed on the Mac OS system?

                 

                Can Acrobat or Adobe Reader correctly extract text (e.g. by unicode copy/paste?)

                • 5. Re: how to map from cid to unicode
                  aqua100 Community Member

                  Thanks for your reply.

                   

                  I have to apologize to you.

                  I had a bug in my implementation for extracting toUnicode cmap from the PDF document.

                  The toUnicode cmap in the PDF has two pairs of beginbfchar/endbfchar,

                  while my code extracted only one pair of beginbfchar/endbfchar.

                  Therefore,  some cids didn't have the corresponding unicodes.

                   

                  Thanks for your advices.