-
1. Re: How to recover incomplete cmap?
lrosenth May 12, 2014 9:58 AM (in response to PigPigPig)Printing to PDF is the worst way to get extractable text - since a printer doesn't need that information and the app won't send it.
Instead, just use the "Create Adobe PDF" button that is installed in Word for you by your Acrobat installation.
-
2. Re: How to recover incomplete cmap?
PigPigPig May 12, 2014 10:03 AM (in response to PigPigPig)BTW, there are only the following tables in FontFile2.
JJEELB+SakkalMajalla ============================
[OS/2, b27ce97c, 2f48, 60]
[cvt , 26fe2829, bc, 210]
[fpgm, 415e9478, 2cc, 706]
[glyf, 21b04ea9, 2fa8, 360]
[head, ec84ea9a, 5ac4, 36]
[hhea, 13f00e17, 9d4, 24]
[hmtx, 98f2d5b4, 9f8, 2352]
[loca, 142726, 3308, 2358]
[maxp, adb03c7, 5aa4, 20]
[name, 6d3f9b26, 5660, 441]
[prep, 17b8ae00, 2d4c, 1f9]
-
3. Re: How to recover incomplete cmap?
PigPigPig May 12, 2014 10:09 AM (in response to lrosenth)I developed an app to extract texts from PDF. I can't decide how my customers generate them. Do you mean there is not any solution to solve my question?
-
4. Re: How to recover incomplete cmap?
lrosenth May 12, 2014 11:48 AM (in response to PigPigPig)Ah - misunderstood the question.
If the mappings aren't provided in the ToUnicode, then you COULD go back to the font (but doing so is outside the spec) or you fallback to the notdef glyph/encoding (as per spec).
-
5. Re: How to recover incomplete cmap?
PigPigPig May 12, 2014 12:32 PM (in response to lrosenth)I thought maybe I could get some information from Truetype font tables which are stored in FontFile2 of FontDescriptor, like GSUB and CMAP. Or I could get some information from those PDF objects which I didn't know to recover an incomplete cmap. I know it is impossible now.
I used the "Create PDF" button to export a PDF file. The toUnicode becomes incorrect. CID 0284 and 06B4 have the same unicode U+FFFD. What happened?
BT
/P <</MCID 0 >>BDC
/C2_0 1 Tf
24 -0 0 24 513.84 764.04 Tm
<0284>Tj
0.495 0.592 Td
<0551>Tj
-0.168 -0.592 Td
<06B4>Tj
0.4 0.422 Td
<0551>Tj
-0.12 -0.422 Td
<024F>Tj
EMC
/Span <</MCID 1 >>BDC
/TT0 1 Tf
12 -0 0 12 511.9636 764.04 Tm
( )Tj
EMC
ET
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo
<< /Registry (Adobe)
/Ordering (UCS) /Supplement 0 >> def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<0000> <FFFF>
endcodespacerange
4 beginbfchar
<024F> <FECB>
<0551> <064E>
<0284> <FFFD>
<06B4> <FFFD>
endbfchar
endcmap CMapName currentdict /CMap defineresource pop end end
-
6. Re: How to recover incomplete cmap?
lrosenth May 12, 2014 12:51 PM (in response to PigPigPig)Most PDF production tools will not include the GSUB table in the embedded font because it's not required by the spec and not used by the renderer. So you shouldn't expect to find that.
Don't know - I'd need to see the original document and then final PDF to understand.
-
7. Re: How to recover incomplete cmap?
PigPigPig May 12, 2014 1:11 PM (in response to lrosenth)the original document? I don't know if you have installed Sakkal Majalla font in your system or not. Otherwise you can't correctly display the docx file. Can I attach a PDF and a Word file in reply?
-
8. Re: How to recover incomplete cmap?
lrosenth May 12, 2014 2:31 PM (in response to PigPigPig)Assuming you mean, http://www.microsoft.com/typography/fonts/family.aspx?FID=375 - I can easily install it.
Just post the files on your favorite sharing site and post the links.
-
9. Re: How to recover incomplete cmap?
PigPigPig May 13, 2014 5:40 AM (in response to lrosenth)Yes, that is the font I am using in my Word file. Please check the following files. Thanks.


