And the cidrange is supposed to map char codes to CID's.
Ok. So looking through the pdf spec 5.9.1, a CID Font can be mapped to unicode in one of three ways:
-use one of the encodings: MacRomanEncoding, MacExpertEncoding, or WinAnsiEncoding
-map charcode to CID using font CMap and then CID to unicode based on the registry-ordering CMap
So I guess where I'm confused is that these fonts will have something like this:
And based on CMap rules I can parse this string into char codes. That makes sense for a begincidrange, because that converts char codes to CID's. But if I have a ToUnicode CMap with beginbfrange, it is supposed to convert CID's to Unicode.
So my guess is that the hex in the Tj array is CID's if we have a ToUnicode map, and it is char codes if we don't?
Values in the "draw string" (TJ) operator are ALWAYS mapped first via the encoding of the font that is current at the time of drawing. From that encoding you either get the Unicode value implicitly (for the standard encodings or for certain font types) or explicitly (via the ToUnicode table, when present or otherwise required).
There is a section of the PDF Reference that goes into detail on how to do text extraction. Consult that for more details.
Hm.. so you're saying that for a CID Font, you always use the encoding to map char codes in the "draw string" to CIDs, and then you use the CMap to convert those to UTF-16BE?
I'm looking at PDF Reference v 1.7 page 470 and it says
"If the font dictionary contains a ToUnicode CMap, use that CMap to convert the character code to Unicode"
That sounds like it maps the "draw string" directly to UTF-16BE no matter what, but just in case it doesn't, the font also has this encoding:
"Identity−H The horizontal identity mapping for 2-byte CIDs; may be used with CIDFonts using any
Registry, Ordering, and Supplement values. It maps 2-byte character codes ranging from
0 to 65,535 to the same 2-byte CID value, interpreted high-order byte first"