9 Replies Latest reply on Sep 12, 2017 4:27 AM by gerbenvos

    How to create a PDF/A with invisible Unicode OCR text and no embedded font

    gerbenvos Level 1

      We have some code that converts TIFF (usually scanned and OCR'ed) to PDF. I am currently extending this code to:

      1. be PDF/A compliant (either PDF/A-1b "with Unicode" or PDF/A-2u),
      2. add Unicode text "behind" the image to make the PDF searchable, and
      3. while doing (2), avoid embedding any fonts (which is permitted in PDF/A as long as the text is invisible).

       

      I have succeeded in generating PDFs that pass the PDF/A preflight test in Adobe Acrobat 9 (we don't have a licence for a newer version right now), and that also conform to the other requirements above, with the one problem that when opening them in Acrobat (9) or Acrobat Reader (DC), I get the error "cannot find or create the font".

      CannotFindorCreateFont.png

      Currently, we use the following PDF code in our output:

       

      3 0 obj
      <</Type/FontDescriptor/FontName/DummyInvisibleMonospace/Flags 34
      /FontBBox[0 0 600 1000]/ItalicAngle 0/Ascent 1000/Descent -300
      /CapHeight 700/StemV 0/MissingWidth 600>>
      endobj
      4 0 obj
      <</Type/Font/Subtype/CIDFontType2/BaseFont/DummyInvisibleMonospace
      /CIDSystemInfo<</Registry(Adobe)/Ordering(UCS)/Supplement 0>>
      /FontDescriptor 3 0 R/DW 600/CIDToGIDMap/Identity>>
      endobj
      5 0 obj
      <</Length 374>>
      stream
      /CIDInit /ProcSet findresource begin
      12 dict begin
      begincmap
      /CIDSystemInfo 3 dict dup begin
      /Registry (Adobe) def
      /Ordering (UCS) def
      /Supplement 0 def
      end def
      /CMapName /Adobe-Identity-UCS def
      /CMapType 2 def
      1 begincodespacerange
      <0000> <ffff>
      endcodespacerange
      1 beginbfrange
      <0000> <ffff> <0000>
      endbfrange
      endcmap
      CMapName currentdict /CMap defineresource pop
      end
      end
      
      endstream
      endobj
      6 0 obj
      <</Type/Font/Subtype/Type0/Name/F1/BaseFont/DummyInvisibleMonospace
      /Encoding/Identity-H/ToUnicode 5 0 R/DescendantFonts[4 0 R]>>
      endobj
      7 0 obj
      <</Length 266193>>
      stream
      q 591.36 0 0 775.68 0 0 cm/I2 Do Q
      BT/F1 12 Tf 3 Tr
      % ...
      1.34762 0 0 0.88 308.64 576.624 Tm<0043004F004D00420049004E0045>Tj
      % ...
      ET
      endstream
      endobj
      8 0 obj
      <</Type/Page/Parent 1 0 R/MediaBox[0 0 591.36 775.68]
      /Resources<</XObject<</I2 2 0 R
      >>/Font<</F1 6 0 R>>/ProcSet[/PDF/Text/ImageB]>>/Contents 7 0 R>>
      endobj
      
      
      
      

       

      Replacing the fancy font name with that of one of the standard 14 fonts, say Courier, does not help.

       

      Unfortunately, I have not been able to find or create any sample PDF files that show how to add invisible text without embedding a font. I could get it to work without a Unicode mapping, referring to non-embedded Courier and using WinAnsiEncoding. However, in order to use Unicode, I need to add some more font structures, so there is a ToUnicode mapping in accordance with the PDF/A standard, and then I cannot get rid of this warning, which obviously would make the PDF unusable for our clients.

       

      I couldn't really find in either the PDF spec nor the PDF/A standard what "not embedding a font" exactly meant at a technical level. I assumed it meant "leaving out the /FontFile2 tag and the font data it refers to", but apparently that is not entirely the case.

       

      If this turns out to be impossible, I could maybe embed a dummy font, like Adobe Blank but with (empty) glyphs that do have a width, but this really should not be necessary if I understand the standards correctly.

       

      Any help or at least useful pointers are appreciated.

       

      Edited by: Gerben Vos; split long lines in code.

       

      Edited by: Gerben Vos; add comment on Adobe-Blank-like font.

        • 1. Re: How to create a PDF/A with invisible Unicode OCR text and no embedded font
          Test Screen Name Most Valuable Participant

          1. " I couldn't really find in either the PDF spec nor the PDF/A standard what "not embedding a font" exactly meant at a technical level. I assumed it meant "leaving out the /FontFile2 tag and the font data it refers to", but apparently that is not entirely the case."

          Almost exactly that, except it's FontFile, FontFile2, FontFile3.

           

          2. Does PDF/A allow a non-embedded font?

           

          3. To work a non-embedded font must be renderable by the PDF viewer. For example one of the base 14 fonts. Or a font present on the system. In other cases, the viewer needs to be able to create a substitute using its substitution fonts, and the information in the font descriptor.

           

          4. The particular viewer Acrobat includes Latin1 substitution fonts and the option of CJK substitution fonts. That is all.

           

          5. The phrase "Unicode font" has no particular meaning in PDF. I often find it used as a kind of wishful thinking for "a font where I can use any Unicode code point without having to worry about the details of PDF font management". But this is just wishful thinking. PDF was invented before Unicode and, while Unicode text extraction is supported by ToUnicode maps and for many items that use "text strings" like annotation contents and bookmarks, the PDF format has no concept or implementation of a "don't bother me with the details" Unicode font for page contents.

           

          6. Identity-H has no meaning except for a specific font. If the font does not exist, it is utterly meaningless. Consider: in Identity-H you are saying that your code points are CID/GID entries for glyphs in a font. If you have no font, how could this be looked up? ToUnicode is not relevant in any way to the process of finding and displaying a font.

          • 2. Re: How to create a PDF/A with invisible Unicode OCR text and no embedded font
            gerbenvos Level 1

            Re 2: The PDF/A-2 standard, section 6.2.11.4.1 (there's probably a similar section in PDF/A-1 and PDF/A-3) (abridged): "The font programs for all fonts used for rendering shall be embedded. A font is considered to be used if at least one of its glyphs is referenced from a content stream." Note 2: "A font referenced for use in text rendering mode 3 is not rendered and is thus exempt from the embedding requirement."

             

            My output files only contain text using rendering mode 3.

             

            Since no pixel will be influenced by the font I am using in any way, I am not interested in any kind of rendering, except insofar as each character in my text (and therefore probably each actually used glyph in the font) needs a width and height, so I can tell the PDF viewer which highlight rectangle to draw when the searched text is found. So, apart from that little niggle, I want to avoid font processing as much as PDF allows me to (which is very little, I understood that already).

             

            I do wonder why the PDF/A standard allows something (non-embedding of fonts) which Acrobat only allows for fonts using Latin1 and possibly some CJK encodings (I didn't try those), while at the same time PDF/A is rather concerned about text mappings to Unicode, and it also copies a number of implementation limits (such as the maximum of 8388607 indirect objects) straight from Acrobat, thereby implicitly acknowledging at the same time 1) the importance of a certain series of code points called Unicode, and 2) Acrobat as the main viewing implementation.

             

            Anyway, I'm trying out the dummy font route now.

            • 3. Re: How to create a PDF/A with invisible Unicode OCR text and no embedded font
              Test Screen Name Most Valuable Participant

              Perhaps some insight into how the PDF and PDF/A committees decided this. PDF/A was not inventing new things, but accommodating existing practice and trying to only forbid things that were a problem.

               

              So, it was well known that OCR applications used non-embedded Latin1 fonts with Tr 3. It is clear this is not a rendering issue, and countless millions such files existed, so it was allowed. It was not for the PDF/A committee to invent a new way to have non-embedded non-Latin1 fonts for Unicode OCR. That would be for the PDF committee to innovate. They (the PDF committee) have, however, moved ever further away from non-embedded fonts; the advantages of non-embedded fonts TO END USERS are pretty negligible, even though they make the developer's work easier.

              1 person found this helpful
              • 4. Re: How to create a PDF/A with invisible Unicode OCR text and no embedded font
                gerbenvos Level 1

                As for /Identity-H: /Encoding is a required parameter, and something like /Identity-H is the easiest way to fulfil that requirement.

                • 5. Re: How to create a PDF/A with invisible Unicode OCR text and no embedded font
                  Test Screen Name Most Valuable Participant

                  But it's a required parameter with meaning. Encoding is extremely relevant for both embedded and non-embedded fonts. So it must be right, not just there... and it cannot be right for a substitution font...

                  • 6. Re: How to create a PDF/A with invisible Unicode OCR text and no embedded font
                    gerbenvos Level 1

                    Embedding the font in this case is mostly a waste of bytes. They are cheap nowadays though.

                    • 7. Re: How to create a PDF/A with invisible Unicode OCR text and no embedded font
                      gerbenvos Level 1

                      Encoding is indeed very relevant, as Leonard Rosenthol explained in an answer to this related post which I found shortly after posting my question: Using a CMap with a non-embedded font. At the moment I read that, it was clear to me that my method wasn't going to work.

                      • 8. Re: How to create a PDF/A with invisible Unicode OCR text and no embedded font
                        gerbenvos Level 1

                        Well, I made it work using a dummy font containing only a .notdef glyph and a second glyph, empty but with a width and height. The code is mostly similar to what I posted above, except that the CIDToGIDMap now maps all 65536 two-byte codes to that single empty glyph. I could reduce the font (or rather, font-like data structure) to a size of only 280 bytes (176 bytes compressed), because according to the PDF spec only a small number of font tables are actually required, and particularly the cmap table can be left out. Actually, the character mapping data for the CIDToGIDMap compresses so well that I could have deflated it a second time to shave off some more bytes (I didn't).

                         

                        It's a bit of a hack, but it conforms to the specs and works fine in Adobe Acrobat and all other Windows and Unix PDF readers that I tried; the only one that failed was (a fairly old version of) Mac OS X Preview.

                         

                        The total "overhead" is about 1300 bytes, and I also save a lot of processing regarding font subsetting. (Again, I don't object against extra development work, but I do want to save processor cycles and storage bytes that are totally unnecessary.)

                        • 9. Re: How to create a PDF/A with invisible Unicode OCR text and no embedded font
                          gerbenvos Level 1

                          It may be interesting to note that the people at Tesseract OCR independently developed essentially the same solution for their searchable PDFs, and also managed to tweak the values inside the font so it works with a few more PDF viewers that are rather picky about their input. See https://github.com/tesseract-ocr/tesseract/blob/master/api/pdfrenderer.cpp and https://github.com/tesseract-ocr/tesseract/blob/master/tessdata/pdf.ttf (for more interesting details, look at the file history, the commits, and the discussion of the related issues).