This content has been marked as final. Show 2 replies
Values in the "draw string" (TJ) operator are ALWAYS mapped first via the encoding of the font that is current at the time of drawing. From that encoding you either get the Unicode value implicitly (for the standard encodings or for certain font types) or explicitly (via the ToUnicode table, when present or otherwise required).
There is a section of the PDF Reference that goes into detail on how to do text extraction. Consult that for more details.
Hm.. so you're saying that for a CID Font, you always use the encoding to map char codes in the "draw string" to CIDs, and then you use the CMap to convert those to UTF-16BE?
I'm looking at PDF Reference v 1.7 page 470 and it says
"If the font dictionary contains a ToUnicode CMap, use that CMap to convert the character code to Unicode"
That sounds like it maps the "draw string" directly to UTF-16BE no matter what, but just in case it doesn't, the font also has this encoding:
"Identity−H The horizontal identity mapping for 2-byte CIDs; may be used with CIDFonts using any
Registry, Ordering, and Supplement values. It maps 2-byte character codes ranging from
0 to 65,535 to the same 2-byte CID value, interpreted high-order byte first"