13 Replies Latest reply: Aug 19, 2013 1:47 AM by WCoder RSS

    Decoding TJ on BT...ET block...

    WCoder Community Member

      Hi,

       

      I have some problem when i try to decode this kind of thinks :

       

      [(\007\003\b\007\006)-275(\004\002\005\003\007)]TJ

      (\000)Tj

      (\024\020\004\b\000\017\032\035\030\034\000\025\021\001\033\036\027\r\024\020\004\b\000\01 7\032\035\030\034\000\025\021\001\033\036\027\000\000\004\b\002\f\002\003\t\000\000\004\b\ r\b\t\000\000\023\026\031\030\000\004)Tj

      [(VOLUME)-600(TWO)-1801(ISSUE)-600(THREE)]TJ

       

      For the last one : [(VOLUME)-600(TWO)-1801(ISSUE)-600(THREE)]TJ

      the text is under ( ) and -600 -1801 -600 is space between each word (according to "PDF  Reference" ).

       

      For this other i have 2 questions in one case i have an array [] an on the other not...

       

      String seems to be an escape sequence (according to "PDF  Reference" ) :

      \n Line feed (LF)
      \r Carriage return (CR)
      \t Horizontal tab (HT)
      \b Backspace (BS)
      \f Form feed (FF)
      \( Left parenthesis
      \) Right parenthesis
      \\ Backslash
      \dddCharacter code ddd (octal)

       

      My problem is how convert \007\003\b\007\006   to    "first word" \b "second word) ?

       

      007 is supose to be a Character code in octal ?!

       

      How convert it ACII or unicode ?

       

      Thanks,

       

      WCoder

        • 1. Re: Decoding TJ on BT...ET block...
          olafdruemmer Community Member

          Please read section "9.10 Extraction of Text Content" in the PDF Reference. Make sure you understand every bit of it (which might require reading a lot of other parts of the PDF reference, and possubly other documents abotu fonts and encoding and ...) and then, if there is stuff you still do not understand, come back with specific questions.

           

          If you decide not to completely wrap your head around the stuff mentioned above your text extraction will never work well.

           

          Olaf

          • 2. Re: Decoding TJ on BT...ET block...
            WCoder Community Member

            Sorry I already read it... and try to re-read ... it but I don't understand... I already read the doc with success for displaying image (with matrix transform, clipping, color conversion CMYK to RGB using icc profile), drawing vector element...

             

            But with the text I failed in decoding it, my last chance is that \007 is the index of the glyphs in the font, but I first extract font from stream, I found a tool to display it...

             

            If someone have a simple explication than the doc content your welcome ^_^

            • 3. Re: Decoding TJ on BT...ET block...
              Test Screen Name MVP

              No, you need to read it more. There is no simple explanation because it is not simple! Certainly it is not an index to the glpyh in the font and you must analyse ToUnicode, CMaps and/or Encodings for each character in the byte string (\007 is interpreted in the normal way because this is a string).

               

              Are there any specific points from 9.10 which we can help you clarify? We cannot substitute for a DEEP understanding of these issues, but perhaps we can help you to it.

              • 4. Re: Decoding TJ on BT...ET block...
                WCoder Community Member

                Ok i found, now i need to find explaination on doc ~o~

                 

                \007 is the index in an array describ here : /Differences[0/space/exclam 3/A/C/D/E/G/I/K/L/M/N/O/R/S/T/W/X/a/c/e/f/i/k/l/n/o/p/r/s/t/u]

                Base index is here : /FirstChar 0

                Last index is here : /LastChar 32

                Glyphs width isin array describ here : /Widths[333 250 0 570 535 607 500 606 286 608 482 804 622 659 571 572 500 856 572 499 429 500 285 250 501 269 533 518 554 338 465 304 537]

                 

                -- Closed thread ---

                • 5. Re: Decoding TJ on BT...ET block...
                  Test Screen Name MVP

                  Just a minor clarification for anyone reading over this thread.

                   

                  \007 is an index into the encoding array. This is always an array of 256 characters. The /Differences array is used to modify the Encoding array after it is built from the standard name used, but Differences isn't the array itself.

                   

                  When extracting text you should first look for a ToUnicode value, because if it is present it must be used in preference to the Encoding.

                  • 6. Re: Decoding TJ on BT...ET block...
                    WCoder Community Member

                    An other question, i have a problem to interpret width from glyphs and from text.

                     

                    When i display text using freetype, font size is 1.0 and text advance generated is near 0.5 0.6 for x and 0 for y under text matrix ref, it's not exactely the same space than in Acrobat reader, so i try to use information from text decoding, the vakue is 16, when i try to transform [x=16;y=0] using text matrix the result is not good.

                     

                    I see on the document than for type 3 i need to div by 1000, the result is not good...

                     

                    THe text position is good so my text matrix context is good, why transformation of text width using text matrix do not give the right value ?

                    • 7. Re: Decoding TJ on BT...ET block...
                      Test Screen Name MVP

                      If you're using some other system to work out font metrics, rather than reading them directly from a font, be sure to work with very large characters. 1000 point is a good choice. Otherwise you get rounding errors.

                      • 8. Re: Decoding TJ on BT...ET block...
                        WCoder Community Member

                        I already have and solve this kind of problem when i draw font using freetype with a size of one, it generate deformed glyph...

                         

                        I found the formule on document (9.4.4 - Text Space Details) to compute Trm and update Tm for vertical or horizontal display... Start position of word is correct but char inter-space and Word space don't seem to work (no effect on display)...

                         

                        I replace Tc, Tw with value read from pdf without conversion but the space between char don't change...

                        • 9. Re: Decoding TJ on BT...ET block...
                          Test Screen Name MVP

                          Please share an example PDF (not a fragment of code).

                          • 10. Re: Decoding TJ on BT...ET block...
                            MikelKlink Community Member

                            WCoder wrote:

                             

                            I found the formule on document (9.4.4 - Text Space Details) to compute Trm and update Tm for vertical or horizontal display... Start position of word is correct but char inter-space and Word space don't seem to work (no effect on display)...

                             

                            I replace Tc, Tw with value read from pdf without conversion but the space between char don't change...

                             

                            Then either you have a bug in your code (because the character and word spacing values obviously are influencing the Tm update) or the PDF explicitly positions each glyph individually.

                             

                            BTW, using integer arithmetics might be such a bug and would explain why distances don't change when testing with different 0.x character spacing values.

                            • 11. Re: Decoding TJ on BT...ET block...
                              WCoder Community Member

                              Sorry but what is BTW ?

                              • 12. Re: Decoding TJ on BT...ET block...
                                MikelKlink Community Member

                                WCoder wrote:

                                 

                                Sorry but what is BTW ?

                                 

                                Excuse my confusing you; BTW is a short cut for "By the way" often used in chats.

                                • 13. Re: Decoding TJ on BT...ET block...
                                  WCoder Community Member

                                  ~o~ ... I use agg who work with double, but i already have issue with freetype, my pdf sample page use a font size of 1 who generate glyph deformation, so i apply a factor to solve this issue (i generate a path from glyph and draw the path)... so i don't thinks it's this kind of problem...

                                   

                                  Maybe a bug or a transform matrix problem...

                                   

                                  The agg doc is not complete, the integration of freetype in agg is not really documented, and template to not simplify code analyse...

                                   

                                  So i need to re-re-....re-re-read the doc and check my code ^_^