-
1. Re: Example font subset with non-unicode output
Test Screen Name Mar 20, 2014 5:57 AM (in response to OleK)I don't really understand the question. ToUnicode is an add-on for text extraction. It has no connection to the other font parameters, and if it is removed it will not affect PDF display in any way. It will only reduce the likelihood of good text extraction.
-
2. Re: Example font subset with non-unicode output
Test Screen Name Mar 20, 2014 5:59 AM (in response to Test Screen Name)It also does not mean that the font, or the text to display, is in UTF-16 encoding. The font as declared will have text data in the page stream containing CIDs, not Unicode.
Message was edited by: Test Screen Name
-
3. Re: Example font subset with non-unicode output
OleK Mar 20, 2014 7:47 AM (in response to Test Screen Name)Hi,
When I use the TTF font "Arial" for instance. This font has different cmap tables.
So lets say I decide to use the cmap for Unicode USC-2.
Then I do a subset of the font for only these charactars which are used in the PDF.
Finally the output (right after Tf command) should be encoded as ISO-8859-1
All of this I want in a Composite Type0 font and not in TrueType
What are the required/depandet objects? How does the PDF skeleton look like?
-
4. Re: Example font subset with non-unicode output
Test Screen Name Mar 20, 2014 7:57 AM (in response to OleK)Sorry, I don't see what your requirement has to do with ToUnicode.
You would not be using the font's "cmap" table to display, you would use a "CMap" and these are not the same thing and do not map. You would not code UCS2 data in the page stream. You would be using Identity-H and encoding the data as CIDs (which is the same thing as GIDs in the font).
-
5. Re: Example font subset with non-unicode output
lrosenth Mar 20, 2014 7:59 AM (in response to OleK)You appear to be mixing up a few things - which is why font subsetting in PDF is hard.
There is the original font encodings (as found in cmaps, for example).
There is the encoding of the subset data itself (which for CID fonts is almost always custom encoding)
There is what you put in you "draw string" commands.
In the process of subsetting, the embedded data is custom encoded and the values written int he draw string matches that encoding (and looks nothing like a standard encoding). For example, if you have a subset with 'a', 'b', 'c' and 'd' using ID's 1-4 respectively, and you wanted to draw the string 'bad', then you would write
(\2\1\4)Tj
-
6. Re: Example font subset with non-unicode output
OleK Mar 20, 2014 8:32 AM (in response to lrosenth)Hi Irosenth,
that clears many things in my mind... thank you!
Ok So when I take the Arial now and subset i.. how the pdf should look like when using /Type0 font (including the CIDFontType2 Descendant of course)
The "draw string" was a mistake, I meant "Tj"
Thank you
-
7. Re: Example font subset with non-unicode output
lrosenth Mar 20, 2014 8:35 AM (in response to OleK)As described in ISO 32000-1:2008
-
8. Re: Example font subset with non-unicode output
OleK Mar 20, 2014 8:40 AM (in response to lrosenth)Mhm.. ok but what is wrong with this?
% Text output
6 0 obj
<< /Length 53 >> stream
BT 20.000 798.260 Td /F1 10.0 Tf (Lorem ipsum) Tj ET
endstream
endobj
7 0 obj
[5 0 R /Fit]
endobj
% arial as as composite font
8 0 obj
<</Type /Font /Subtype /Type0 /BaseFont /AAAAAD+Arial /Encoding /Identity-H /DescendantFonts [9 0 R] >>
endobj
% DescendantFonts
9 0 obj
<</Type /Font /Subtype /CIDFontType2 /BaseFont /AAAAAD+Arial /CIDSystemInfo << /Registry (Adobe) /Ordering (Identity) /Supplement 0 >> /FontDescriptor 11 0 R /W [ 51 [556] 11380 [500] ] /CIDToGIDMap 10 0 R >>
endobj
% CID To GID mapping
10 0 obj
<< /Length 131072>>
stream
[...]
endstream
endobj
% font descriptor
11 0 obj
<< /Type /FontDescriptor /Flags 32 /FontName /AAAAAD+Arial /StemV 70 /Ascent 1854 /Descent -434 /FontBBox [-664 -324 2000 1039] /ItalicAngle 0.0 /FontFile2 12 0 R >>
endobj
% font subset of arial
12 0 obj
<< /Length1 32500 /Length 32500 >> stream
[...]
endstream
endobj
-
9. Re: Example font subset with non-unicode output
lrosenth Mar 20, 2014 8:43 AM (in response to OleK)Without the actual font data - no idea what's wrong. You only provided a very limited bit of PDF.
-
10. Re: Example font subset with non-unicode output
OleK Mar 20, 2014 8:46 AM (in response to lrosenth)Here you can find the complete pdf document
-
11. Re: Example font subset with non-unicode output
OleK Mar 20, 2014 9:27 AM (in response to OleK)Well, I eventually know what I am doing wrong.. but I have no idea how to fix it this way I want to...
As per PDF Reference describes:
If the TrueType font program is embedded, the Type 2 CIDFont dictionary
must contain a CIDToGIDMap entry that maps CIDs to the glyph indices for the
appropriate glyph descriptions in that font program.
Because I am having 1 bytes (UTF-8 or ansi) text located in the "() Tj" command it does not display the glyph properly
When I change it so, that I have UTF-16BE inserted into "() Tj" it will work.. Guranteed.
Now coming to the question! How do I arrange the CIDToGIDMap so, that it will work for UTF-8 or any other 1 byte encoding?
-
12. Re: Example font subset with non-unicode output
Test Screen Name Mar 20, 2014 9:27 AM (in response to OleK)I don't know whether the font data is valid or not, but you certainly are not passing CIDs to Tj. You are just passing 1-byte ANSI text.
-
13. Re: Example font subset with non-unicode output
Test Screen Name Mar 20, 2014 9:30 AM (in response to Test Screen Name)Why not just use CIDToGIDMap Identity, and use CIDs. You seem determined to make the font fit what you want to put in Tj, rather than put in Tj what works with the font!
-
14. Re: Example font subset with non-unicode output
OleK Mar 21, 2014 1:13 AM (in response to Test Screen Name)Hi Test Screen NAme ,
that is exaclty what i wanted I guess.. I want to match CIDToGID but for ANSI chars...
Currently the formula for CIDtoGID is the following in the PDF writer which is two byte unicode I guess.
if($this->isUnicode){
if($char >= 0){
if ($char >= 0 && $char < 0xFFFF && $glyphIndex) {
$cidtogid[$char*2] = chr($glyphIndex >> 8);
$cidtogid[$char*2 + 1] = chr($glyphIndex & 0xFF);
}
}
}
Now I want it to get it working for 1 byte. So that the output for "() Tj" contains ANSI text characters.
I dont know if this is even possible or I am totally mixed up now -
15. Re: Example font subset with non-unicode output
OleK Mar 21, 2014 1:31 AM (in response to OleK)I think I need a closer look to character codes, glyphs and cids. does anyone of you know good exampleS?
-
16. Re: Example font subset with non-unicode output
Test Screen Name Mar 21, 2014 1:38 AM (in response to OleK)Stop trying to use ANSI characters in Tj. What makes you believe this is correct or valid?
-
17. Re: Example font subset with non-unicode output
OleK Mar 21, 2014 1:42 AM (in response to Test Screen Name)But when I use the core fonts, like Helvetica all the text writen in Tj is ANSI, isnt it?
So when I now use a font, which does not provide Unicode (like Arial - NOT ARIALUNI) why I should convert it into UTF-16BE?
-
18. Re: Example font subset with non-unicode output
Test Screen Name Mar 21, 2014 1:49 AM (in response to Test Screen Name)And by the way, that code you posted in reply #14 looks a good start for working with CIDs/GIDs. What makes you guess it is Unicode? I Think you are perhaps focussing too much on Unicode; it is almost completely irrelevant to the task of writing the PDF, and Unicode is almost never written to make streams.
-
19. Re: Example font subset with non-unicode output
Test Screen Name Mar 21, 2014 1:52 AM (in response to OleK)The byte values in a string are what the font requires, according to its Encoding (1-byte fonts) and combination of CMap/internal CIDs (2-byte fonts). Often you will see for 1-byte fonts ANSI characters, because the Encoding used happens to be WinAnsiEncoding or MacRomanEncoding, which are the same as ANSI for byte values < 127. But, so what? You are not writing a 1-byte font.
"So when I now use a font, which does not provide Unicode (like Arial - NOT ARIALUNI) why I should convert it into UTF-16BE?"
You should not. Why would you, and why do you think that Unicode is involved? Are you just assuming that when you see a 2-byte value that it is 2-byte Unicode? Never assume, read the spec!!
-
20. Re: Example font subset with non-unicode output
lrosenth Mar 21, 2014 1:58 AM (in response to Test Screen Name)Just to add to this excellent explanation.
A CID font is a TWO BYTE FONT. So by choosing to use one, you are requiring the use of two bytes per glyphs.
If you only want a single byte per glyph, use a simple TrueType font with a standard encoding (such as WinANSI).
-
21. Re: Example font subset with non-unicode output
olafdruemmer Mar 21, 2014 2:25 AM (in response to OleK)Text encoding in PDF in essence is for glyph lookup only.
On one end of a processing chain, you have a character code in a text showing operator.
On the other hand you have a glyph in a font.
PDF (and by reference several font specifications) defines several ways how to define (in the code of a PDF file) how certain character codes are - for a given font resource - linked to the font resource and the glyphs in it. As there are a number of ways to do this, it can be very confusing for the novice. In order to be successful in this field you must fully understand the part of the PDF specification/standard that defines how to put text on a PDF page. If there are parts you do not understand yet - keep reading the spec until you understand it. There is no silver bullet or magic trick to do text in PDF without this.
It might be true, under certain circumstances, and in many cases coincidentally so, that character codes happen to be equivalent to single byte text codes (as defined WinANSI, MacRoman, MacExpert encodings, and in addition through a Differences array and use of predefined glyph names as defined by AGL). But it is not smart to hope for this, especially as any of this could - for the purpose of text extraction - be overwritten by a ToUnicode table (and a character that accoridng to WinANSI represents "A" would have to be extracted as "B" if the ToUnicode table says so).
So in the interest of writing better code, forget about the notion that character codes have in any deterministic anything to do with text / Unicode code codepoints, single byte or not. Character codes are just numbers that let you look up a glyph in a font if a certain procedure is followed...
-
22. Re: Example font subset with non-unicode output
OleK Mar 21, 2014 2:27 AM (in response to lrosenth)Thank you both!!!
One final question to Irosenth...
When I simple use Subtype /TrueType it requires to set FirstChar and LastChar, right?
So, lets say I am using Arial as a subset font but with only two characters (single byte of course). The chars are "!" (U+0021) and " ﻼ" (U+FEFC).
So FirstCharacter is set to 33 and LastCharacter is set to 65276, correct?
Does this also mean I have to set the /Widths for all 65243 (65276 - 33)?
What would you suggest I should do in this case?
-
23. Re: Example font subset with non-unicode output
olafdruemmer Mar 21, 2014 2:44 AM (in response to OleK)Try not to use or think the word "Unicode" or Unicode code points until you have managed to understand how PDF defines the relationship between character code and glyphs. Unicode does not have any role here. FirstChar and LastChar related only to the character codes in a Simple Font (where character codes have one byte). I have no idea what you mean by FirstCharacter and LastCharacter (try hard not to be sloppy with keywords!)
I think you do have a lot of reading ahead of you - just grab the PDF spec and read the part on text again. Try to focus on one approach (Simple Font / TrueType in your case), until you've got it.
-
24. Re: Example font subset with non-unicode output
Test Screen Name Mar 21, 2014 2:46 AM (in response to OleK)No, this is proceeding from wrong assumptions.
For all single byte fonts, EVERY PDF contruction deals with 256 values, range 0 to 255.
Let's look at what you say. You mention two Unicode values, and you continue to think that Unicode has some importance or meaning in PDF. No, it doesn't. Perhaps your characters are U+0021 and U+FEFC but the PDF does not care, and nor should you (when writing the PDF). I emphasise: if you are thinking "I will put this Unicode value somewhere in the PDF" and you are dealing with page streams/fonts, you are almost certainly not following the spec.
For a single byte font the mapping from the arbitrary codes in the PDF to the internal font structures has one and only one variable: the Encoding value. The Encoding value, if present (and the situations in which it can be omitted are specific and exact), is a shorthand representation for something which will always be an array of 256 names.
Since TrueType fonts don't contain names, the interpretation of this is complicated for embedded TrueType, and you must take care to follow exactly what the spec says. Other things MAY SOMETIMES WORK without being valid, but will probably fail in different software, so stick to the spec exactly.
There is no substitute for endless, deep and repetitive study of the PDF Reference. It will help if you can phrase your future questions in terms of "what does it mean in the specification when it says...". That will help keep you focussed. I have been working with the PDF specification for many, many years and I have found that whenever I assume anything, I am almost always wrong. Read the spec first, second and last.
FInally, don't be discouraged if you find this hard. Font embedding IS one of the very hardest things to master about PDFs, easily adding an order of magnitude of complexity to any project.
-
25. Re: Example font subset with non-unicode output
OleK Mar 21, 2014 3:11 AM (in response to Test Screen Name)Thank you all!
I will go through the pdf spec again and try to ask the correct questions then...
For single byte fonts it already helped me to know that max 256 values are "allowed"
I really appreciate your comments
Thank you
-
26. Re: Example font subset with non-unicode output
Test Screen Name Mar 21, 2014 3:15 AM (in response to OleK)Shortcut for 1-byte fonts.
All of these are1-byte value with the same meaning
- FirstChar, LastChar
- Indexes in WIdths (offset by FirstChar)
- values in the byte strings passed to Tj etc.
- indexes into the 256 name array generated from Encoding
None applies to Type 0/multibyte fonts.




