2 Replies Latest reply on Jun 2, 2008 7:08 AM by (Mike_J_B)

    CMaps and ToUnicode CMaps

    Level 1
      I'm looking through a couple documents: 5014.CIDFont_Spec.pdf and 5411.ToUnicode.pdf. My ultimate goal is basically text extraction that includes the positions of the characters.

      So in a CMap file the following sequences can appear:

      3 begincidrange
      <20> <7e> 1
      <8140> <817e> 633
      <8180> <81ac> 696

      2 beginbfrange
      <10FE> <10FF> <4E00>
      <1100> <1101> <4E02>

      The bfrange is supposed to do this mapping, which I believe maps CID's to UTF-16BE.

      CID=4350 -> U+4E00
      CID=4351 -> U+4E01
      CID=4352 -> U+4E02
      CID=4353 -> U+4E03

      And the cidrange is supposed to map char codes to CID's.

      Ok. So looking through the pdf spec 5.9.1, a CID Font can be mapped to unicode in one of three ways:

      -ToUnicode map
      -use one of the encodings: MacRomanEncoding, MacExpertEncoding, or WinAnsiEncoding
      -map charcode to CID using font CMap and then CID to unicode based on the registry-ordering CMap

      So I guess where I'm confused is that these fonts will have something like this:

      <002600520051004900550044005700480055> Tj

      And based on CMap rules I can parse this string into char codes. That makes sense for a begincidrange, because that converts char codes to CID's. But if I have a ToUnicode CMap with beginbfrange, it is supposed to convert CID's to Unicode.

      So my guess is that the hex in the Tj array is CID's if we have a ToUnicode map, and it is char codes if we don't?
        • 1. Re: CMaps and ToUnicode CMaps
          Level 1
          Values in the "draw string" (TJ) operator are ALWAYS mapped first via the encoding of the font that is current at the time of drawing. From that encoding you either get the Unicode value implicitly (for the standard encodings or for certain font types) or explicitly (via the ToUnicode table, when present or otherwise required).

          There is a section of the PDF Reference that goes into detail on how to do text extraction. Consult that for more details.

          • 2. Re: CMaps and ToUnicode CMaps
            Level 1
            Hm.. so you're saying that for a CID Font, you always use the encoding to map char codes in the "draw string" to CIDs, and then you use the CMap to convert those to UTF-16BE?

            I'm looking at PDF Reference v 1.7 page 470 and it says

            "If the font dictionary contains a ToUnicode CMap, use that CMap to convert the character code to Unicode"

            That sounds like it maps the "draw string" directly to UTF-16BE no matter what, but just in case it doesn't, the font also has this encoding:

            "Identity−H The horizontal identity mapping for 2-byte CIDs; may be used with CIDFonts using any
            Registry, Ordering, and Supplement values. It maps 2-byte character codes ranging from
            0 to 65,535 to the same 2-byte CID value, interpreted high-order byte first"