-
1. Re: Extra space getting added between two Tj's
Test Screen Name Sep 22, 2014 6:55 AM (in response to Sunny Boyka)Are you using Acrobat to create this PDF? The Acrobat SDK? Do you believe Acrobat is displaying it wrong?
-
2. Re: Extra space getting added between two Tj's
Sunny Boyka Sep 23, 2014 12:17 AM (in response to Test Screen Name)i got a pdf created using acrobat. I used my logic and writing into a notepad. So can you please tell me more about that textmatrix.
Because the logic i used is writing into the notepad without space when the text matrix is 1 0 0 1.
So I wanted to know how to read this text matrix.
-
3. Re: Extra space getting added between two Tj's
Test Screen Name Sep 23, 2014 1:19 AM (in response to Sunny Boyka)No, I don't think I understand, sorry.
Did you create the content stream? Or did Acrobat make it in the usual way (e.g. print to PDF, convert a file etc.)? It still isn't clear whether
- you are learning PDF creation and puzzled
- you are learning PDF parsing and puzzled
- Acrobat seems to be displaying a PDF wrongly
- Acrobat seems to be making a PDF wrongly
The assistance we can give you, or whether we refer you somewhere else, depends on this.
(You don't have to worry about how to read a text matrix specifically. You must apply the calculations in the PDF reference, which derive the text rendering matrix (Trm) from the text matrix, CTM and other graphics state parameters. This also describes how to do text positioning. Do not make any assumptions, just do the math. Don't forget to use text widths AS STORED IN THE PDF, not as found in the system or external font file).
-
4. Re: Extra space getting added between two Tj's
Sunny Boyka Sep 23, 2014 10:08 PM (in response to Sunny Boyka)Actually I am extracting text from pdf which was created using acrobat.
1 0 0 1 1409 77 Tm
[(Filer) -48.4848 (oS) -33.3333 (earchE) -47.4747 (xpres) -36.3636 (si) -44.4444 (on_bp) -37.3737 (:)] TJ
1 0 0 1 1827 77 Tm
[(bp)] TJ
The text getting extracted is "FileroSearchExpression_bp:bp"
0.99 0 0 -1 1409 77 Tm
[(Filer) -48.4848 (oS) -33.3333 (earchE) -47.4747 (xpres) -36.3636 (si) -44.4444 (on_bp) -37.3737 (:)] TJ
0.99 0 0 -1 1827 77 Tm
[(bp)] TJ
The text getting extracted is "FileroSearchExpression_bp: bp"
In the extracted text I am getting a extra space when the text matrix has -1 in it. So I wanted to know the significance of -1.
-
5. Re: Extra space getting added between two Tj's
Test Screen Name Sep 24, 2014 1:56 AM (in response to Sunny Boyka)Are you writing a text extractor software which analyses the content streams?
Or using Acrobat to extract text?
Or using different software to extract text?
The effect of the -1, like the other elements, is to change how Trm is generated, and how the text spacing is calculated. It doesn't have a specific effect by itself. It will be being combined with the CTM, set by cm, which you didn't mention; my guess is that cm will also have a negative fourth component.
In all cases it's important to realise that spaces are inserted by guesswork and fuzzy logic and are often wrong. They may also depend on other factors including average text spacing elsewhere on the page.
-
6. Re: Extra space getting added between two Tj's
Sunny Boyka Sep 24, 2014 3:13 AM (in response to Test Screen Name)Ya I am writing a text extractor software.
-
7. Re: Extra space getting added between two Tj's
Test Screen Name Sep 24, 2014 3:20 AM (in response to Sunny Boyka)So you are doing all the calculations to obtain Trm and text spacing, using embedded Widths or a fixed table for the base 14 fonts, yes?
You say a space is extracted in a particular case, but do you mean that your software extracts a space, or some other software (e.g. Acrobat)?
-
8. Re: Extra space getting added between two Tj's
Sunny Boyka Sep 24, 2014 3:26 AM (in response to Test Screen Name)So you say like it is depending on concat operator, cm??
-
9. Re: Extra space getting added between two Tj's
Test Screen Name Sep 24, 2014 3:36 AM (in response to Sunny Boyka)I am guessing, since you are only now asking about cm, you are NOT calculating Trm. This is very important. In turn this suggests you are just trying to pick the interesting details from page streams instead of fully understanding.
Text extraction is MUCH more difficult than many people at first suppose because of
- need to keep graphics state, state stacks, Q/q
- text in form XObjects
- width and spacing calculations
- text widths from different sources and defaults
- different font types including CID/CMap
- text potentially set out of order
- overlain, overprinted, subscript, superscript, vertical, hidden and angled text
- multiple layers of complexity in processing font extraction
If you attempt this without deep understanding of text and graphics models you will not do it right. A fast and experienced programmer already knowing the PDF standard deeply might be able to do a fair job in 6-12 months of coding.
-
10. Re: Extra space getting added between two Tj's
Test Screen Name Sep 24, 2014 3:38 AM (in response to Test Screen Name)Sorry, must add to that list
- spaces usually not present in page stream, derived by fuzzy logic
- no information on columns, margins, alignment as some expect to find
- no font styling
- different logic needed for tagged files
-
11. Re: Extra space getting added between two Tj's
Sunny Boyka Sep 24, 2014 3:56 AM (in response to Test Screen Name)Thank you so much bro !!!


