Skip navigation
Currently Being Moderated

Decoding TJ on BT...ET block...

Jun 11, 2013 8:46 AM

Tags: #text #decode #et #tj #bt

Hi,

 

I have some problem when i try to decode this kind of thinks :

 

[(\007\003\b\007\006)-275(\004\002\005\003\007)]TJ

(\000)Tj

(\024\020\004\b\000\017\032\035\030\034\000\025\021\001\033\036\027\r\ 024\020\004\b\000\017\032\035\030\034\000\025\021\001\033\036\027\000\ 000\004\b\002\f\002\003\t\000\000\004\b\r\b\t\000\000\023\026\031\030\ 000\004)Tj

[(VOLUME)-600(TWO)-1801(ISSUE)-600(THREE)]TJ

 

For the last one : [(VOLUME)-600(TWO)-1801(ISSUE)-600(THREE)]TJ

the text is under ( ) and -600 -1801 -600 is space between each word (according to "PDF  Reference" ).

 

For this other i have 2 questions in one case i have an array [] an on the other not...

 

String seems to be an escape sequence (according to "PDF  Reference" ) :

\n Line feed (LF)
\r Carriage return (CR)
\t Horizontal tab (HT)
\b Backspace (BS)
\f Form feed (FF)
\( Left parenthesis
\) Right parenthesis
\\ Backslash
\dddCharacter code ddd (octal)

 

My problem is how convert \007\003\b\007\006   to    "first word" \b "second word) ?

 

007 is supose to be a Character code in octal ?!

 

How convert it ACII or unicode ?

 

Thanks,

 

WCoder

 
Replies
  • Currently Being Moderated
    Jun 11, 2013 8:58 AM   in reply to WCoder

    Please read section "9.10 Extraction of Text Content" in the PDF Reference. Make sure you understand every bit of it (which might require reading a lot of other parts of the PDF reference, and possubly other documents abotu fonts and encoding and ...) and then, if there is stuff you still do not understand, come back with specific questions.

     

    If you decide not to completely wrap your head around the stuff mentioned above your text extraction will never work well.

     

    Olaf

     
    |
    Mark as:
  • Currently Being Moderated
    Jun 11, 2013 10:03 AM   in reply to WCoder

    No, you need to read it more. There is no simple explanation because it is not simple! Certainly it is not an index to the glpyh in the font and you must analyse ToUnicode, CMaps and/or Encodings for each character in the byte string (\007 is interpreted in the normal way because this is a string).

     

    Are there any specific points from 9.10 which we can help you clarify? We cannot substitute for a DEEP understanding of these issues, but perhaps we can help you to it.

     
    |
    Mark as:
  • Currently Being Moderated
    Jun 11, 2013 10:51 AM   in reply to WCoder

    Just a minor clarification for anyone reading over this thread.

     

    \007 is an index into the encoding array. This is always an array of 256 characters. The /Differences array is used to modify the Encoding array after it is built from the standard name used, but Differences isn't the array itself.

     

    When extracting text you should first look for a ToUnicode value, because if it is present it must be used in preference to the Encoding.

     
    |
    Mark as:
  • Currently Being Moderated
    Aug 16, 2013 6:26 AM   in reply to WCoder

    If you're using some other system to work out font metrics, rather than reading them directly from a font, be sure to work with very large characters. 1000 point is a good choice. Otherwise you get rounding errors.

     
    |
    Mark as:
  • Currently Being Moderated
    Aug 19, 2013 12:22 AM   in reply to WCoder

    Please share an example PDF (not a fragment of code).

     
    |
    Mark as:
  • Currently Being Moderated
    Aug 19, 2013 12:35 AM   in reply to WCoder

    WCoder wrote:

     

    I found the formule on document (9.4.4 - Text Space Details) to compute Trm and update Tm for vertical or horizontal display... Start position of word is correct but char inter-space and Word space don't seem to work (no effect on display)...

     

    I replace Tc, Tw with value read from pdf without conversion but the space between char don't change...

     

    Then either you have a bug in your code (because the character and word spacing values obviously are influencing the Tm update) or the PDF explicitly positions each glyph individually.

     

    BTW, using integer arithmetics might be such a bug and would explain why distances don't change when testing with different 0.x character spacing values.

     
    |
    Mark as:
  • Currently Being Moderated
    Aug 19, 2013 1:30 AM   in reply to WCoder

    WCoder wrote:

     

    Sorry but what is BTW ?

     

    Excuse my confusing you; BTW is a short cut for "By the way" often used in chats.

     
    |
    Mark as:

More Like This

  • Retrieving data ...

Bookmarked By (0)

Answers + Points = Status

  • 10 points awarded for Correct Answers
  • 5 points awarded for Helpful Answers
  • 10,000+ points
  • 1,001-10,000 points
  • 501-1,000 points
  • 5-500 points