Skip navigation
ehboym
Currently Being Moderated

Object reference in the xref table is in offset to the actual reference of the objects

Jul 20, 2013 6:07 AM

Tags: #javascript #parse #xref

Hi

 

Im trying to parse a PDF file. The goal is to automatically mark sections of the document to make it easier for a reader to spot the important sections of the document.

 

I read the PDF file by making an AJAX call then storing the responseText in a variable.

 

I the find the reference to the xref table and try to get to the xref table using that reference.

 

The thing is that the reference at the startxref section of the doc is showing "startxref 275815%%EOF" but the actual reference to the xref table is different, if I do "...indexOf('xref')" its showing 265798 (a difference of 10017) and all the references in the xref tables are also offset by the same amount.

 

Any hints on the source of that difference ?

 

What am I doing wrong ?

 

Thanks

 

Erez

 
Replies
  • Currently Being Moderated
    Jul 20, 2013 6:19 AM   in reply to ehboym

    Are you sure you understand the complexity of your project?

     

    Regarding your specific question/issue: if you use a hex editor - what do you find at the offset indicated by startxref?

     

    Olaf

     

    PS: For a JavaScript based implementation of a PDF parser / viewer see the pdf.js from the Mozilla corporation. Might give you an idea of what kind of project you are embarking on....

     
    |
    Mark as:
  • Currently Being Moderated
    Jul 20, 2013 12:34 PM   in reply to ehboym

    While trying not to sound rude or arrogant, I have to say: RTFM. In other words: unless you read all the relevant sections of the PDF specification (get the ISO 32000-1 document from the Adobe website as the free of charge authoritative spec, 100% matching the official ISO version (not free of charge) of the PDF standard) you will not be able to successfully complete your project. The fact that you are already stiumbling over a - in comparison -  extremely simple aspect of how PDFs are constructed could be a pointer that you might want  to revisit your decision to implement this from scratch.

     

    Olaf

     
    |
    Mark as:
  • Currently Being Moderated
    Jul 21, 2013 5:38 AM   in reply to ehboym

    Make sure that you (and AJAX and JS) are treating the PDF as a STRICT BINARY BLOB.  If something is thinking that it's text and trying to convert line endings, then of course your offsets will be wrong.

     
    |
    Mark as:
  • Currently Being Moderated
    Jul 21, 2013 1:03 PM   in reply to ehboym

    I should say that writing from scratch something to (in the general case) allow text markup is not less than 6 months work for an experienced and fast programmer. An interesting project, but I suspect you may be underestimating the work to

    - parse the xrefs and consolidate

    - find the objects

    - traverse the pages tree

    - extract text from page contents and nested content streams, allowing for the complexity of text setting and encodings, supporting the different kinds of embedded fonts, ToUnicode etc.

    - implement fuzzy logic for text ordering and word detection

    - directly extract tagged text

    - and only then can you start the comparatively simple process of adding annotations, together with their content streams, updating xref or adding incremental updates.

     
    |
    Mark as:
  • Currently Being Moderated
    Jul 22, 2013 10:26 AM   in reply to ehboym

    There is much scope for simplification if you only have to work for one file for a demonstration or special purpose tool, certainly.

     

    zlib is the right tool, but your method for finding the start of the stream data is oversimplified; there are specific rules for whitespace following the word "stream".

     

    You should not scan for "endstream"; that is a double check. You should read the stream length, which may be a direct or indirect object, and use that. Be sure only to use the exact stream length - your code would potentially include whitespace before endstream.

     

    Also, you don't seem to be using the consolidated xref tables; it is just blind luck (or one file) if the first stream you find is a page stream, it could be anything (e.g. a font, metadata, function dictionary...) and bear in mind Contents could be an array. Even assuming it is a stream, the word stream could appear in other contexts. DO NOT TRY TO FIND OBJECTS BY TEXT SEARCHING!!

     
    |
    Mark as:
  • Currently Being Moderated
    Jul 22, 2013 10:43 AM   in reply to Test Screen Name

    And of course if you still aren't setting binmode (did you deal with the original problem) there's no way in the world that your zlib stream will decompress.

     
    |
    Mark as:
  • Currently Being Moderated
    Jul 26, 2013 6:44 AM   in reply to ehboym

    Are you sure the section in question is Flate encoded?   There are MANY different compression/filter options used in PDF.

     
    |
    Mark as:
  • Currently Being Moderated
    Jul 26, 2013 11:45 PM   in reply to ehboym

    I still don't see code to parse and skip the white space following "stream".

     
    |
    Mark as:
  • Currently Being Moderated
    Jul 27, 2013 9:38 AM   in reply to ehboym

    There is always, by definition, white space to skip. It might occupy one or two bytes; in this case you are correct that it occupies one byte. In general, though, you have to inspect the data.

     

    I recommend you add debug code to dump the bytes you are about to decompress actually contains the bytes you see and expect in a hex editor.

     
    |
    Mark as:

More Like This

  • Retrieving data ...

Bookmarked By (0)

Answers + Points = Status

  • 10 points awarded for Correct Answers
  • 5 points awarded for Helpful Answers
  • 10,000+ points
  • 1,001-10,000 points
  • 501-1,000 points
  • 5-500 points