1. " I couldn't really find in either the PDF spec nor the PDF/A standard what "not embedding a font" exactly meant at a technical level. I assumed it meant "leaving out the /FontFile2 tag and the font data it refers to", but apparently that is not entirely the case."
Almost exactly that, except it's FontFile, FontFile2, FontFile3.
2. Does PDF/A allow a non-embedded font?
3. To work a non-embedded font must be renderable by the PDF viewer. For example one of the base 14 fonts. Or a font present on the system. In other cases, the viewer needs to be able to create a substitute using its substitution fonts, and the information in the font descriptor.
4. The particular viewer Acrobat includes Latin1 substitution fonts and the option of CJK substitution fonts. That is all.
5. The phrase "Unicode font" has no particular meaning in PDF. I often find it used as a kind of wishful thinking for "a font where I can use any Unicode code point without having to worry about the details of PDF font management". But this is just wishful thinking. PDF was invented before Unicode and, while Unicode text extraction is supported by ToUnicode maps and for many items that use "text strings" like annotation contents and bookmarks, the PDF format has no concept or implementation of a "don't bother me with the details" Unicode font for page contents.
6. Identity-H has no meaning except for a specific font. If the font does not exist, it is utterly meaningless. Consider: in Identity-H you are saying that your code points are CID/GID entries for glyphs in a font. If you have no font, how could this be looked up? ToUnicode is not relevant in any way to the process of finding and displaying a font.
Re 2: The PDF/A-2 standard, section 188.8.131.52.1 (there's probably a similar section in PDF/A-1 and PDF/A-3) (abridged): "The font programs for all fonts used for rendering shall be embedded. A font is considered to be used if at least one of its glyphs is referenced from a content stream." Note 2: "A font referenced for use in text rendering mode 3 is not rendered and is thus exempt from the embedding requirement."
My output files only contain text using rendering mode 3.
Since no pixel will be influenced by the font I am using in any way, I am not interested in any kind of rendering, except insofar as each character in my text (and therefore probably each actually used glyph in the font) needs a width and height, so I can tell the PDF viewer which highlight rectangle to draw when the searched text is found. So, apart from that little niggle, I want to avoid font processing as much as PDF allows me to (which is very little, I understood that already).
I do wonder why the PDF/A standard allows something (non-embedding of fonts) which Acrobat only allows for fonts using Latin1 and possibly some CJK encodings (I didn't try those), while at the same time PDF/A is rather concerned about text mappings to Unicode, and it also copies a number of implementation limits (such as the maximum of 8388607 indirect objects) straight from Acrobat, thereby implicitly acknowledging at the same time 1) the importance of a certain series of code points called Unicode, and 2) Acrobat as the main viewing implementation.
Anyway, I'm trying out the dummy font route now.
1 person found this helpful
Perhaps some insight into how the PDF and PDF/A committees decided this. PDF/A was not inventing new things, but accommodating existing practice and trying to only forbid things that were a problem.
So, it was well known that OCR applications used non-embedded Latin1 fonts with Tr 3. It is clear this is not a rendering issue, and countless millions such files existed, so it was allowed. It was not for the PDF/A committee to invent a new way to have non-embedded non-Latin1 fonts for Unicode OCR. That would be for the PDF committee to innovate. They (the PDF committee) have, however, moved ever further away from non-embedded fonts; the advantages of non-embedded fonts TO END USERS are pretty negligible, even though they make the developer's work easier.
As for /Identity-H: /Encoding is a required parameter, and something like /Identity-H is the easiest way to fulfil that requirement.
But it's a required parameter with meaning. Encoding is extremely relevant for both embedded and non-embedded fonts. So it must be right, not just there... and it cannot be right for a substitution font...
Embedding the font in this case is mostly a waste of bytes. They are cheap nowadays though.
Encoding is indeed very relevant, as Leonard Rosenthol explained in an answer to this related post which I found shortly after posting my question: Using a CMap with a non-embedded font. At the moment I read that, it was clear to me that my method wasn't going to work.
Well, I made it work using a dummy font containing only a .notdef glyph and a second glyph, empty but with a width and height. The code is mostly similar to what I posted above, except that the CIDToGIDMap now maps all 65536 two-byte codes to that single empty glyph. I could reduce the font (or rather, font-like data structure) to a size of only 280 bytes (176 bytes compressed), because according to the PDF spec only a small number of font tables are actually required, and particularly the cmap table can be left out. Actually, the character mapping data for the CIDToGIDMap compresses so well that I could have deflated it a second time to shave off some more bytes (I didn't).
It's a bit of a hack, but it conforms to the specs and works fine in Adobe Acrobat and all other Windows and Unix PDF readers that I tried; the only one that failed was (a fairly old version of) Mac OS X Preview.
The total "overhead" is about 1300 bytes, and I also save a lot of processing regarding font subsetting. (Again, I don't object against extra development work, but I do want to save processor cycles and storage bytes that are totally unnecessary.)
It may be interesting to note that the people at Tesseract OCR independently developed essentially the same solution for their searchable PDFs, and also managed to tweak the values inside the font so it works with a few more PDF viewers that are rather picky about their input. See https://github.com/tesseract-ocr/tesseract/blob/master/api/pdfrenderer.cpp and https://github.com/tesseract-ocr/tesseract/blob/master/tessdata/pdf.ttf (for more interesting details, look at the file history, the commits, and the discussion of the related issues).