Skip navigation
Mohammed_Mostafa
Currently Being Moderated

Paragraph detection in pdf

Aug 22, 2013 12:52 AM

Tags: #pdf

Please, How can I detect paragrahs when reading pdf file?

 
Replies
  • Currently Being Moderated
    Aug 22, 2013 1:29 AM   in reply to Mohammed_Mostafa

    Hi Mohammed,

     

    PDF essentially does not have a concept of paragraphs - a page is just a 2 dimensional z-ordered arrangement of graphic objects -  with the following exception: if someone created the PDF such that it is adequately tagged using the standard structure types  (see chapter 14.8 of ISO 32000-1) then could detect paragraphs by looking for the P tags in the document (or heading, lists, table tags etc. accordingly).

     

    Olaf

     

    Am 22 Aug 2013 um 09:53 schrieb Mohammed_Mostafa <forums_noreply@adobe.com>:

     

     

    Paragraph detection in pdf

    created by Mohammed_Mostafa in PDF Language and Specifications - View the full discussion

    Please, How can I detect paragrahs when reading pdf file?

     

    Please note that the Adobe Forums do not accept email attachments. If you want to embed a screen image in your message please visit the thread in the forum to embed the image at http://forums.adobe.com/message/5616199#5616199

    Replies to this message go to everyone subscribed to this thread, not directly to the person who posted the message. To post a reply, either reply to this email or visit the message page: Paragraph detection in pdf

    To unsubscribe from this thread, please visit the message page at Paragraph detection in pdf. In the Actions box on the right, click the Stop Email Notifications link.

    Start a new discussion in PDF Language and Specifications by email or at Adobe Community

    For more information about maintaining your forum email notifications please go to http://forums.adobe.com/message/2936746#2936746.

     

    --

    Olaf Druemmer | Managing Director | callas software GmbH | Schoenhauser Allee 6/7 | 10119 Berlin

    Tel +49.30.4439031-0 | Fax +49.30.4416402 | o.druemmer@callassoftware.com | www.callassoftware.com

    Amtsgericht Charlottenburg, HRB 59615 | Geschäftsführung: Olaf Drümmer, Ulrich Frotscher

     
    |
    Mark as:
  • Currently Being Moderated
    Aug 22, 2013 2:14 AM   in reply to Mohammed_Mostafa

    Extracting text is DIFFICULT.

     

    You can use guesswork and fuzzy logic as all text extraction software must do.

    On no account place characters into the PDF separately using coordinates from PDF! This cannot work because the layout methods will be different.

     

    You can sort the information received by coordinate to put into a plausible reading order. You can then use fuzzy logic to guess where spaces might be, which are line breaks and which are paragraph breaks, whether there are columns, and anything else you want to guess and spend a lot of time programming, which the human eye does in half a second (a frustrating realisation).

     
    |
    Mark as:
  • Currently Being Moderated
    Aug 22, 2013 3:02 AM   in reply to Mohammed_Mostafa

    Yes, it is important to know it is difficult, but that is not a reason not to do it.

     

    One important point about the need for fuzzy logic which may be overlooked: it can mean that two valid programs for extracting PDF text can produce different results.

     

    One more note: if a file is tagged, PDF text extraction can become accurate and precise: this is the reason for tagging. A quality PDF text extractor will handle this case too.

     
    |
    Mark as:
  • Currently Being Moderated
    Aug 22, 2013 4:18 AM   in reply to Mohammed_Mostafa

    You will have to read the PDF spec some more (chapter 14), and look at tagged PDF example files (makes it easier to udnerstand what chapter 14 is really about)!

     

    Starting points:

    • entry Marked in MarkInfo dict in Catalog set to true (please note that this key is in one case in chapter 14.7  in ISO 32000-1incorrectly stated as "Markings"  - this is a typo, and should say "MarkInfo" instead)
    • StructTreeRoot key in Catalog is present and contains suitable data structures (the tagging information is contained under this root entry, and connected to page content through marked content in the contents streams, and sometimes OBJR references)

     

    Olaf

     
    |
    Mark as:
  • Currently Being Moderated
    Aug 22, 2013 4:24 AM   in reply to Mohammed_Mostafa

    We propose and motivate a novel task: paragraph segmentation. We discuss and compare this task with text segmentation and discourse parsing. We present a system that performs the task with high accuracy. A variety of features is proposed and examined in detail. The best models turn out to include lexical, coherence, and structural features.

     
    |
    Mark as:
  • Currently Being Moderated
    Aug 22, 2013 4:44 AM   in reply to swayam44

    Um, why have you posted the abstract of Dmitriy Genzel's paper? Without copyright acknowledgement?

     
    |
    Mark as:

More Like This

  • Retrieving data ...

Bookmarked By (0)

Answers + Points = Status

  • 10 points awarded for Correct Answers
  • 5 points awarded for Helpful Answers
  • 10,000+ points
  • 1,001-10,000 points
  • 501-1,000 points
  • 5-500 points