please I have an enquiry about how pdf store arabic letters inside it because when i extracted arabic letteres from pdf file I saw different unicode for each letter not the original unicode for arabic letters in the range(06##)???
please need any answer about this question!
Arabic characters are stored in the same fashion as any other characters.
In order for text extraction to work, sufficient encoding information must be present. If text extraction fails, some of the encoding information available is either missing, incompolete or incorrect.
Details are described in the PDF Reference. Be warned that character encoding is one of the more challenging aspects of PDF syntax.
Thanks Olaf and Test Screen Name for reply, but please need more help!
I already get the page stream for pdf file with arabic text and i saw in the page stream different unicode for each arabic character such as 038F-0396-00AC-03EA,that are in the range for other languages not arabic!
this is sample output page stream for pdf file(one page) with arabic text:
/F1 14.04 Tf
1 0 0 1 477.34 707.14 Tm
EMC EMC /P <</MCID 1/Lang (ar-EG)>> BDC BT
/F2 14.04 Tf
1 0 0 1 540.1 707.14 Tm
[( )] TJ
EMC /TagSuspect <</TagSuspect /Ordering >>BDC /P <</MCID 2/Lang (ar-EG)>> BDC BT
arabic charactres unicode range (0600 to 06FF)
is PDF make any equations on arabic unicode to produce this result in the page stream??
if these codes are not unicode how can I get arabic characters from them??
Wait your reply.
Thanks olaf for these important information,
i know from searching that pdf make mapping for arabic characters with thing named 'CMAP'
map character code to its original unicode, the question, where i can find this map so that i can convert character code to its original unicode??
As asked before – what parts of the specification are not clear to you? Everything you need to know is in there but we're happy to explain things more if you can pinpoint which parts are confusing.
Creating one for an existing document is quite difficult if not impossible – depends a LOT on what you are given and what extra information you have at your disposal.
As a developer new totnis area, i feel i should mention two paricular problems people have
1. They assume it will be an easy or quick project. No, it won't, and you have to spend days or weeks studying very small details of the specification and its references.
2. They assume it will be possible. There are many files for which text extraction is impossible. Here is a simple test: try to copy/paste from Acrobat. If it does not copy for Abobe with 20 years of development, it is fair to guess it won't extract for you.
the pdf specifier mention the part related with toUnicod cmap in generat without details and without how can get this cmap!
i am working in extracting text from pdf using java library called "IText", Itext get page stream for each pdf page with arabic text, page stream has codes for arabic characters and i want to get its corresponding unicode!
i ask about where cmap??
is cmap exist in page stream or not?
if not where pdf store cmap?
is there pdf files without cmap?
wait your reply...
thanks for you,
Then perhaps you should ask the iText people. They offer support for their library.
Short answer for you, however – you are going about doing text extraction ALL WRONG! The page stream is INCOMPLETE! In order to understand the page stream, you need to reference all sorts of other objects in the PDF including Fonts, Encodings, Cmaps and more.
There are two kinds of CMap. Neither one is in the page stream. Both types may have a job in text extraction.
32000-1 says exactly where to find a ToUnicode CMap. Can you point out where the information becomes unclear.
Many files have no ToUnicode CMap, and most do not have the other kind either.
It's all in the PDF Specification (ISO 32000). You need to read and
understand the complete document. Just like with a legal document, every
single word is important in a specification. The information about
text/fonts/CMaps/ToUnicode tables is in chapter 9. Based on my experience,
reading this chapter once is not sufficient, you have to go back again and
again until you really understand how things work. Again, all the
information is there, and with the help of some sample PDF files, you
should be able to work your way through this pretty dry document.
As Test Screen Name already suggested, working with the Acrobat API is a
complicated job, not only do you need to learn about the API, you also need
a very good grounding in the PDF specification.
Again, based on my experience, you need to plan for about half a year to
get familiar with how the Acrobat API and PDF works. If you are not willing
or able to spend that much time, stop right now and let somebody who has
this experience finish your project. I see many PDF files that get created
by software that only "kind of" implements the PDF spec, and people are
usually not happy when these files don't work the way other PDF files do
(e.g. they don't print right, or when used with other PDF tools end up
corrupting other PDF files, ...)
Karl Heinz Kremer
PDF Acrobatics Without a Net
PDF Software Development, Training and More...
On Wed, Jun 26, 2013 at 11:19 AM, Test Screen Name <email@example.com