1. Get hold of the PDF specifications. Don't worry, they're free.
2a. you have to use binary read functions (as per PDF specification)
2c. then you have to implement the PDF coordinate system, which is heavily based upon Postscript-style matrix operations, and supports several independent "layers" of transformations.
Oh, and since you ask about text:
Somehow I doubt it's worth all this trouble. Can't you just open the PDF in Illustrator?
There are PDFs that Illustrator doesn't quite work so well on (not counting the large class of bitmapped PDFs that it doesn't work at all on).
What an odd tool
It could be this is what the OP ultimately was after (I still don't see what InDesign/scripting has to do with it). Its online demo shows how accurately it can work: a random PDF got converted to several thousands of lines, one for each character in the PDF, and all in the ilk of
<span style="position:absolute; left:113px; top:180px; font-size:13px;">A</span>
Thanks for the tip; seems this Python scripter (!) has done all of the hard work I mentioned above. Definitely something to experiment with.
(On further examination: I don't think we'll ever know what the OP thinks of this.
He didn't follow up on even one of his twenty-something posts, and also didn't bother to declare any of them "answered" to his satisfaction.
Hard to please, eh? Some points would have been nice, too.)
Perhaps, though, no one has pointed out the point system to him. After all, it's not intuitively obvious that it matters if you dont' spend a lot of time hanging out here...
1 person found this helpful
A good way to get this sort of information out of a PDF would be to use Adobe's own PDFXML format (was Mars). This gives an archive that presents each page as a separate SVG file. It's much easier than trying to work with a PDF binary file.
However it's all gone a bit quiet on the Adobe Labs Mars pages with no update for Acrobat X... Perhaps it's just a dead-end.