Skip navigation
Currently Being Moderated

CMaps and ToUnicode CMaps

Jun 2, 2008 6:11 AM

I'm looking through a couple documents: 5014.CIDFont_Spec.pdf and 5411.ToUnicode.pdf. My ultimate goal is basically text extraction that includes the positions of the characters.

So in a CMap file the following sequences can appear:

3 begincidrange
<20> <7e> 1
<8140> <817e> 633
<8180> <81ac> 696
endcidrange

2 beginbfrange
<10FE> <10FF> <4E00>
<1100> <1101> <4E02>
endbfrange

The bfrange is supposed to do this mapping, which I believe maps CID's to UTF-16BE.

CID=4350 -> U+4E00
CID=4351 -> U+4E01
CID=4352 -> U+4E02
CID=4353 -> U+4E03

And the cidrange is supposed to map char codes to CID's.

Ok. So looking through the pdf spec 5.9.1, a CID Font can be mapped to unicode in one of three ways:

-ToUnicode map
-use one of the encodings: MacRomanEncoding, MacExpertEncoding, or WinAnsiEncoding
-map charcode to CID using font CMap and then CID to unicode based on the registry-ordering CMap

So I guess where I'm confused is that these fonts will have something like this:

<002600520051004900550044005700480055> Tj

And based on CMap rules I can parse this string into char codes. That makes sense for a begincidrange, because that converts char codes to CID's. But if I have a ToUnicode CMap with beginbfrange, it is supposed to convert CID's to Unicode.

So my guess is that the hex in the Tj array is CID's if we have a ToUnicode map, and it is char codes if we don't?
 
Replies
  • Currently Being Moderated
    Jun 2, 2008 6:14 AM   in reply to (Mike_J_B)
    Values in the "draw string" (TJ) operator are ALWAYS mapped first via the encoding of the font that is current at the time of drawing. From that encoding you either get the Unicode value implicitly (for the standard encodings or for certain font types) or explicitly (via the ToUnicode table, when present or otherwise required).

    There is a section of the PDF Reference that goes into detail on how to do text extraction. Consult that for more details.

    Leonard
     
    |
    Mark as:

More Like This

  • Retrieving data ...

Bookmarked By (0)