What is the best way to detect duplicate pages?
The pages I am dealing with are searchable image (scanned Image background with selectable text overtop). In this case, Any two pages that have the exact same background image will be duplicate.
I only know how to get page text though, so I've been getting the text and hashing it, then checking for duplicate hashes. This works for the most part, but I fear running into two different pages with the exact same text.
What about looking at the background image? If a PDF has multiple pages with the same background image, I assume it would store the image once and then just reference it from the pages? Is it possible to check duplicate pages this way?
Or Does Acrobat have a built-in checking solution I haven't discovered? As always, any help is appreciated
JS has no way to access data such as the images in a PDF file. It might be
possible to do it with a Preflight profile, but I'm not 100% sure about
Ok, well for the most part doing it by text works, but it sometimes flags things that arn't duplicate: such as two of the same worksheets that were not filled out will have the exact same text, despite being completely different pages