I often need to convert emails to pdfs in my job, because sometimes the email contains certain information that needs to be redacted, such as student information. I work for a university. When the email being requested is pulled from several different accounts there is often the same email that appears in each account. Is there a way to remove the duplicates before I start to review and redact, so I don't have to keep making the same redactions over and over?
Thanks for any help.
This might be possible. Here is how I would approach this:
You cannot get access to the actual PDF content on a page, all you can do is iterate over all "words" on a page. What Acrobat considers a word may not be identical to your interpretation in all cases. You could then create a "checksum" for all pages in your document and then try to identify pages that result in the same checksum. Depending on how you create this checksum, you may then still have to compare the pages word by word to make sure you are dealing with an exact duplicate. You would then mark the duplicate page as one that needs to be deleted, and in a final step, delete the pages from the end of the document.
If you need help with any of these steps, that's what I do for a living