8 Replies Latest reply: Jan 17, 2014 2:57 AM by Test Screen Name RSS

    PDF Stream Extent?

    tetleyforget Community Member

      Hello everyone.

       

      I'm trying to interpret exacty what is and isn't included in a PDF stream and to date am still confused. I'll paste a section of the ISO3200 - 1 PDF reference below.

       

      I'm not sure, but these statements appear to contradict each other. 

       

      So I have a stream which specifies a lenghth of 2215 bytes in its compressed form.

      There is a carriage return and a line feed at the start and end of the stream data falling between the 'stream' and 'endstream' keywords.

       

      So my data looks like this :  stream  CR LF Data Data Data CR LF endstream  Keep in mind that CR = Carriage Return and LF = Line feed

       

      Before I remove the CR and LF from each end of the data the total size of the stream is 2217 bytes (between the 'stream' and 'endstream' keywords.  From the first paragraph below it appears that I am reading the data between the Carriage return and line feed characters at each end which brings the compressed size down to 2213 bytes (not 2215 as the stream 'Lenght' specifies.

       

      If I follow the second paragraph from Table 5 in relation to Stream Lenghth, it appears that only the carriage return and line feed at the end of the stream are removed.  So the stream to be decompressed would look like this:  CR LF Data Data Data .  This in fact adheres to the Stream Lenght specification for that stream which is 2215 bytes?

       

      When decompressing a stream, what should and shouldn't be included?  Cut the CR and LF from the start or the end ,,, or both?     Note the red bolded section below: "lie between the end-of-line marker (I assume this means not inclusive).  Like saying, stand between those two people (this doesn't mean stand on these two people and centre yourself).   Yet... the green bolded area in the second section doesn't mention the initial white space?

       

      Perhaps this is what it means.  The first whitespace character after the 'stream' keyword and the whitespace character preceding the 'endstream' keyword are ignored so the stream looks like this:

      Original Stream Data before removing whitespace:    CR LF Data Data Data CR LF

      Actual Stream data to be decompressed (whitespace removed):   LF Data Data Data CR


      That last option produces a stream of 2215 bytes as well.

       

       

      Thanks

       


       

      Under 'Stream Objects - General'

      The keyword stream that follows the stream dictionary shall be followed by an end-of-line marker

      consisting of either a CARRIAGE RETURN and a LINE FEED or just a LINE FEED, and not by a CARRIAGE RETURN alone. The sequence of bytes that make up a stream lie between the end-of-line marker following the stream keyword and the endstream keyword; the stream dictionary specifies the exact number of bytes. There should be an end-of-line marker after the data and before endstream; this marker shall not be included in the stream length.

       

      AND

      From table 5 in relation to the stream Length.

      (Required) The number of bytes from the beginning of the line

      following the keyword stream to the last byte just before the

      keyword endstream. (There may be an additional EOL

      marker, preceding endstream, that is not included in the count

      and is not logically part of the stream data.) See 7.3.8.2,

      "Stream Extent", for further discussion.


        • 1. Re: PDF Stream Extent?
          lrosenth Adobe Employee

          Sounds like a broken PDF, since your interpretations of the spec are correct.

          • 2. Re: PDF Stream Extent?
            tetleyforget Community Member

            Thanks.. But which interpretation is correct?

            • 3. Re: PDF Stream Extent?
              lrosenth Adobe Employee

              Ignore the line ending (either one byte or two, depending) after stream and (if present) before endstream.

              • 4. Re: PDF Stream Extent?
                tetleyforget Community Member

                Thanks, I hate to be a pest, but depending on what?

                 

                If I chop two white space characters from the beginning the encoded stream is 2215 bytes.

                 

                If I chop two off the end the stream is 2215 bytes.

                 

                If I chop the very first whitespace character after 'stream' and the very last whitespace before endstream I get a 2215 byte stream.

                 

                So I would assume, from your answer, that the length of the stream must be maintained.   Therefore when you say "depending", I assume you mean that if I can only lose two bytes from the stream I have to take one from either side.  If I can lose 4 bytes to get the 2215 byte stream I can take two from either side of the steam (between stream and endstream).   What if I can take 3 bytes, then what?   This is why I need to be clear on which whitespace characters to delete because the day will come when this issue will arrise.

                 

                In my case I am still left with a whitespace character at the beginning and the end of the stream.  Is that acceptable?

                 

                Thanks

                • 5. Re: PDF Stream Extent?
                  Test Screen Name CommunityMVP

                  One approach to reading a stream is to skip the initial white space precisely by the rules of 32000-1, then read the number of bytes specified by Length. This will succeed with every valid PDF. I might then stop. If I wanted some degree of error checking I might then skip optional white space (any amount) and verify the presence of endstream. (So Length takes precedence over the precise location of endstream). Only if I was specifically writing a validator would I bother with the exact whitespace rules before endstream.

                   

                  The job of writing a validator is interesting and important but entirely different from the job of writing a real world PDF consumer. PDF files in their countless millions do not follow the precise rules of PDF. PDF creators tend to the "it opened in Acrobat and looked about right" test.

                   

                  Do not overlook the possibility of a damaged file that must be repaired (if you support repair; since Acrobat does, many people consider files that need repair as "valid"). In this case the new lines are likely to be messed up, and you must develop your own heuristics for stream reconstruction.

                  • 6. Re: PDF Stream Extent?
                    Test Screen Name CommunityMVP

                    Streams are more forgiving of slight errors in their Length than you might think.

                     

                    1. Consider the case of an uncompressed (no Filter) page stream where the extra new line is included in the Length rather than outside it. This will make no difference because it is just white space to be skipped.

                     

                    2. Some compressed data would be terminally damaged by incorrect data. But

                    (a) some PDF tools do not report an error; if they find the stream to be unreadable by the rules of the filter they simply stop reading as if at end of file. I don't agree with this but it is now such common practice I would have to think carefully about flaunting it.

                    (b) some filter streams include an explicit end of stream maker. Many tools will stop reading, rather than verify nothing follows the marker.

                    (c) some tools will not care if there are too few or too many bytes for an image.

                     

                    This is not a reason for PDF creators to be sloppy, but a reason for PDF consumers to consider being forgiving.

                    • 7. Re: PDF Stream Extent?
                      tetleyforget Community Member

                      Thanks for that. Good detailed answer. 

                       

                      I suppose when reading the ISO3200-1, the paragraphs above contradict eachother slightly.  First paragraph seems to indicate skipping the initial whitespace though the last paragraph doesn't mention that initial whitespace being relevent to the stream length. 

                       

                      But as you have indicated, the degree of adherence to standards is a little uneven at times so I'm just starting off with the standards and will move forward from there.

                       

                      It's easy enough to test for whitespace and count the bytes.  I've used a vector for this and have chopped off the whitespace characters where necessary. 

                       

                      I find the ISO3200 standard a little unclear in this area.  For example I'll quote below:

                       

                      The sequence of bytes that make up a stream lie between the end-of-line marker following the stream keyword and the endstream keyword

                       

                      The end of line marker can consist of a carriage return and a newline or just a newline character.  I interprest this as meaning that if a Carriage Return and Newline are present after the word 'stream' then those combined are the end of line marker in total and should be ignored.  This is because the the data lies between this and endstream.  In all, I just want to know what is it really saying.

                       

                       

                       


                      • 8. Re: PDF Stream Extent?
                        Test Screen Name CommunityMVP

                        Yes, a new line character is either one or two bytes long, and this character (of one or two bytes) is to be ignored in that context.