    Does tagging a PDF for acessibility affect privacy?

      I read a lot about "Enabling acessibility and Reflow" in Acrobat and I got some questions (more questions than the topic could suggest).

      I converted a set of JPEG scanned book pages into an acessible PDF. The initial size was 40.000 KB. After some ajustments, I usually sanitize the document before publishing, but I realized that there were overlapping objects being erased.

      I found strange that a recent created file had overlapping objects, and after further investigation I realized that the overlapping objects were created by the acessibility and reflow utility. Even more strange, the file size after sanitizing was 250.000 KB.

      So I got some questions about it:


      1. Why does the file size change so much after sanitizing?


      2. What consequences to privacy those tags may have, considering the original JPEG files didn't have any sensitive tags themselves?

      I tried to examine tag by tag, but they are endless, impossible to check one by one to see if some personal information was being stored there without my knowledge.


      3. If I sanitize the document, am I losing acessibility and reflow, or is it just my impression?


      4. Just one offtopic question: I should apply OCR only after sanitizing, right?

          "Does tagging a PDF for accessibility affect privacy?"  -- No.

            I think you are doing quite a bit more work than you need to do. Once you've scanned your document, there are only a couple of things you should check for, such as hidden text, and unnecessary metadata. Once you've done that, save your document. Then do the OCR, and add the tags.


            There's no need to go through the document tag by tag looking for hidden content or items that need sanitizing as Acrobat doesn't add anything that's private.


            Having overlapping content in a document is quite common actually. If you have an image on a page, for example, and crop it to allow for more text on the page, the part of the image that's cropped is considered hidden content. That's probably also why your file size increased, although I can't say that for sure.


            This is the workflow I recommend:


            1. Open your PDF document and look for metadata and hidden text, and clean that out.

            2. Save the file.

            3. Run OCR. Use ClearScan as the scan method for the best accuracy.

            4. Add tags.

            5. Save your document.