From a coder's high level view, what is involved in finding a specified text and replacing it by something else?
What are the main steps in doing this?
Thank you for your time.
It's a tad complicated.
1) Extract the object with the text you want. This may require you to read in the xref table or Xref stream.
2) In the unlikely event the text is simply in string objects, or has an /ActualText flag, extract the strings directly from between ( and )
3) In the more likely event that the object is compressed, decompress it. Usually this will use FlateDecoding, so you'll need access to a zlib function unless you're the type of masohist who wants to write one him/herself.
4) If the object has a predictor, "unpredict" it. See my other post about this.
5) Map the text to Unicode via a CMap, either embedded or standard. See Annex D of the PDF Specification for info on standard encodings. See section 9.10.3 of the PDF specification for more information on non-standard maps. (That will lead you to still other spec. docs.)
Thank you for the overview.
It sounds, from my newbie perspective that there is lot of reading in front of me.
Heck, I never programmed anything against PDF or Acrobat.
I actually started to read the PDF reference and am now at the FlateDecoding part.
Its the kind of document I have to read a couple of times... .. the marinating process will take time...
Section 3.4.3 tells me
"The cross-reference table contains information that permits random access to indirect objects within the file so that the entire file need not be read to locate any particular object."
So I see now part of what you meant (at first it was all unknown terminology to me).
I read that I may have, if I actually wanted to replace a text or a graphics, to update the xref tables.
Does that mean the physical byte offset is hard coded for any object?
I think this is what they say:
"The format of an in-use entry is
nnnnnnnnnn ggggg n eol
nnnnnnnnnn is a 10-digit byte offset
ggggg is a 5-digit generation number
n is a literal keyword identifying this as an in-use entry
eol is a 2-character end-of-line sequence
The byte offset is a 10-digit number, padded with leading zeros if necessary, giving the number of bytes from the beginning of the file to the beginning of the object.
Could you share if you want, how you would go about recalculating all the offsets? (and is there more to do ?)