10 Replies Latest reply: Mar 25, 2014 7:53 AM by tetleyforget RSS

    PDF File with Run Length Encoding?

    tetleyforget Community Member

      Would anyone have a PDF file containing text streams compressed in Run-Length Encoding?  I need one to test a decompressor.  I know they are probably rare and not used anymore which is why one is so difficult to find.

       

      If you do have one could you attach it in a reply or refer me to it. 

       

      Thanks

        • 1. Re: PDF File with Run Length Encoding?
          tetleyforget Community Member

          Maybe someone can suggest how to find one of these?  Is there some way of creating one with software?

           

          Thanks

          • 2. Re: PDF File with Run Length Encoding?
            Test Screen Name CommunityMVP

            I think it's almost inconceivable anyone would have made one, except for a test, since run length applied to text streams would only ever make it bigger.  You might find it applied to images. Indeed Acrobat Distiller offers this as an option for compression of monochrome images. You could also write an Acrobat plug-in that compressed a text stream in this way. Both of course require purchase of Adobe software, but our hosts probably wouldn't see that as a bad thing.

            • 3. Re: PDF File with Run Length Encoding?
              tetleyforget Community Member

              No worries..

               

              I'm in the process of building a decoder to parse pdf files compressed in multiple different encoding types. Only text though..

               

              Just reading the ISO3200-1 2008 in Table 6. pasted below on Run-Lenght decode.  The reference to text.

               

              This is to assume that I can't reprint the pdf (using pdf writer) into flate decode for text streams which seems to be the standard now.  I've inlcuded LZW, FlateDecode, still deciding on Run-Length Decode, and of course the ASCII85 decode filter which will be combined with all three if necessary or at least Flate and LZW.  

               

              Any other suggestions on that?   For decoding text streams mainly.

               

               

              RunLengthDecode no Decompresses data encoded using a byte-oriented run-length

               

              encoding algorithm, reproducing the original text or binary data

              (typically monochrome image data, or any data that contains

              frequent long runs of a single byte value).

              • 4. Re: PDF File with Run Length Encoding?
                lrosenth Adobe Employee

                I don't know what you mean by "only text though".  PDF is a binary format not a text format.  There are certainly places that you can have a text string or a stream of text-like data, but the entire format is binary.

                • 5. Re: PDF File with Run Length Encoding?
                  tetleyforget Community Member

                  You are complicating a relatively simple question.  Please be reasonable.

                   

                  Yes, all files are ultimately represented in binary if you break them down.

                   

                  But streams (yes, ultimately binary) are configured to represent something.. Images, text etc... I really don't understand why you made that point.

                   

                  A stream representing text is casually referred to as a text stream in this case or more correctly a string stream.. or just a string.  Because we're talking about PDFs, yes, ultimately you can break this text stream down to 8bit bytes.

                   

                  So, if the stream is meant to prepresent text as opposed to an image or something else, then what would be a realistic list of encoding types which could possible be applied to it.  I even pasted the section from the ISO3200 above to show you why there may be some ambiguity.

                  • 6. Re: PDF File with Run Length Encoding?
                    Test Screen Name CommunityMVP

                    If you mean a page stream, the likely filters are LZW and Flate, optionally with ASCII85. It depends whether you want to support all theoretical PDFs or just those you're likely to find in the field. I guess you are saying that you only intend to decompress page streams (and presumably form XObjects), likely to extract text.

                     

                    I see no ambiguity in the statement. And I agree with lrosenth that talking about "only text" really does nothing to simplify your problem. All filters are simply implemented as binary octets in, binary octets out. If it is a page stream, it will usually contain only visible characters, but by no means always as the text strings passed to Tj (etc.) can contain arbitrary binary data, and there may be inline images.

                    • 7. Re: PDF File with Run Length Encoding?
                      lrosenth Adobe Employee

                      Since there is no such thing in ISO 32000, can you tell us what do you mean by a "text stream"?   Do you mean a content stream, where the page content drawing instructions are?  Those can include arbitrary data (as mentioned by TSN) and not just drawing instructions.   Also, the instructions you find there aren't actual text - they are just references to glyphs in the font.  You would need to decode the font program streams as well if you wished to get actual text from a PDF.

                      • 8. Re: PDF File with Run Length Encoding?
                        tetleyforget Community Member

                        I think it's pretty clear what I meant.. ISO3200-1 (I've already mentioned it in a previous post)  but thanks for picking up that small insignificant fact.   You don't have to give me the ins and outs of binary, glyphs and complicating factors which detract from the question.  

                         

                        I've been parsing pdfs for years but they have all been Flat or LZW decode some with ASCII filters.. All I asked is to ascertain whether text would be encoded in any other way as I'm building on an existing automated process I have in place. 

                         

                        By assuming that I'm stupid and complicating things by highlighting small deviations in my chosen vocabulary you are not seriusly adressing the issue at hand.

                         

                        In a slang way I'm obviously referring to a stream in a text object. The Tj operator treats each element (yes binary value) as a character code..

                         

                        Two posts above answers my question anyway..

                        • 9. Re: PDF File with Run Length Encoding?
                          Test Screen Name CommunityMVP

                          I'm sorry you think that small deviations from the definition are unimportant. (And by the way I apologise for using "page stream" when I misremembered "Content stream"). You are asking people who have digested the 1000 page PDF specification (in some cases maybe even helped to edit it) about very specific and fine details from it. You must expect them to be very precise so that the answer can be accurate. The only way to deal with a specification like this is to be very pedantic indeed, so no apologies for that at all.

                          • 10. Re: PDF File with Run Length Encoding?
                            tetleyforget Community Member

                            I was talking to lrosenth..

                            Your post was rather helpful..