Skip navigation
Currently Being Moderated

Arabic letters inside pdf!

Jun 25, 2013 1:02 AM

Hello All,

please I have an enquiry about how pdf store arabic letters inside it because when i extracted arabic letteres from pdf file I saw different unicode for each letter not the original unicode for arabic letters in the range(06##)???

please need any answer about this question!

 
Replies
  • Currently Being Moderated
    Jun 25, 2013 3:05 AM   in reply to Mohammed_Mostafa

    Arabic characters are stored in the same fashion as any other characters.

     

    In order for text extraction to work, sufficient encoding information must be present. If text extraction fails, some of the encoding information available is either missing, incompolete or incorrect.

     

    Details are described in the PDF Reference. Be warned that character encoding is one of the more challenging aspects of PDF syntax.

     

    Olaf

     
    |
    Mark as:
  • Currently Being Moderated
    Jun 25, 2013 3:12 AM   in reply to Mohammed_Mostafa

    And character codes in a page stream are not Unicode, that would ne far too simple.

     
    |
    Mark as:
  • Currently Being Moderated
    Jun 25, 2013 5:07 AM   in reply to Mohammed_Mostafa

    IS there a ToUnicode CMap? The page stream is only the first of many complex steps - see the section on text extraction in 32000-1

     
    |
    Mark as:
  • Currently Being Moderated
    Jun 25, 2013 6:47 AM   in reply to Mohammed_Mostafa

    THe section I mentioned - which paragraphs are not clear?

     
    |
    Mark as:
  • Currently Being Moderated
    Jun 25, 2013 7:52 AM   in reply to Mohammed_Mostafa

    All of the details of where to find the ToUnicode CMAP and how to decode characters in a PDF in general can be found in ISO 32000-1.  You have a copy ,yes?

     
    |
    Mark as:
  • Currently Being Moderated
    Jun 25, 2013 8:18 AM   in reply to Mohammed_Mostafa
     
    |
    Mark as:
  • Currently Being Moderated
    Jun 26, 2013 6:50 AM   in reply to Mohammed_Mostafa

    As asked before – what parts of the specification are not clear to you?  Everything you need to know is in there but we're happy to explain things more if you can pinpoint which parts are confusing.

     

    Creating one for an existing document is quite difficult if not impossible – depends a LOT on what you are given and what extra information you have at your disposal.

     
    |
    Mark as:
  • Currently Being Moderated
    Jun 26, 2013 7:04 AM   in reply to lrosenth

    As a developer new totnis area, i feel i should mention two paricular problems people have

    1. They assume it will be an easy or quick project. No, it won't, and you have to spend days or weeks studying very small details of the specification and its references.

    2. They assume it will be possible. There are many files for which text extraction is impossible. Here is a simple test: try to copy/paste from Acrobat. If it does not copy for Abobe with 20 years of development, it is fair to guess it won't extract for you.

     
    |
    Mark as:
  • Currently Being Moderated
    Jun 26, 2013 7:39 AM   in reply to Mohammed_Mostafa

    Then perhaps you should ask the iText people.  They offer support for their library.

     

    Short answer for you, however – you are going about doing text extraction ALL WRONG!   The page stream is INCOMPLETE!  In order to understand the page stream, you need to reference all sorts of other objects in the PDF including Fonts, Encodings, Cmaps and more.

     
    |
    Mark as:
  • Currently Being Moderated
    Jun 26, 2013 8:19 AM   in reply to Mohammed_Mostafa

    There are two kinds of CMap. Neither one is in the page stream. Both types may have a job in text extraction.

     

    32000-1 says exactly where to find a ToUnicode CMap. Can you point out where the information becomes unclear.

     

    Many files have no ToUnicode CMap, and most do not have the other kind either.

     
    |
    Mark as:
  • Currently Being Moderated
    Jun 26, 2013 8:34 AM   in reply to Test Screen Name

    It's all in the PDF Specification (ISO 32000). You need to read and

    understand the complete document. Just like with a legal document, every

    single word is important in a specification. The information about

    text/fonts/CMaps/ToUnicode tables is in chapter 9. Based on my experience,

    reading this chapter once is not sufficient, you have to go back again and

    again until you really understand how things work. Again, all the

    information is there, and with the help of some sample PDF files, you

    should be able to work your way through this pretty dry document.

     

    As Test Screen Name already suggested, working with the Acrobat API is a

    complicated job, not only do you need to learn about the API, you also need

    a very good grounding in the PDF specification.

     

    Again, based on my experience, you need to plan for about half a year to

    get familiar with how the Acrobat API and PDF works. If you are not willing

    or able to spend that much time, stop right now and let somebody who has

    this experience finish your project. I see many PDF files that get created

    by software that only "kind of" implements the PDF spec, and people are

    usually not happy when these files don't work the way other PDF files do

    (e.g. they don't print right, or when used with other PDF tools end up

    corrupting other PDF files, ...)

     

     

     

    Karl Heinz Kremer

    PDF Acrobatics Without a Net

    PDF Software Development, Training and More...

     

    khk@khk.net

    http://www.khkonsulting.com

     

     

     

    On Wed, Jun 26, 2013 at 11:19 AM, Test Screen Name <forums_noreply@adobe.com

     
    |
    Mark as:

More Like This

  • Retrieving data ...

Bookmarked By (0)

Answers + Points = Status

  • 10 points awarded for Correct Answers
  • 5 points awarded for Helpful Answers
  • 10,000+ points
  • 1,001-10,000 points
  • 501-1,000 points
  • 5-500 points