1 person found this helpful
There is no "no index" tag in PDF - what you need to do is prevent the search engine from indexing the file. The most straight forward method is to use a robots.txt file on your web server and then hope that the search engine's spider program does actually honor the information in that file. In your case, that will not help, because you don't know in advance who is breaking the rules and makes the files available. To prevent content extraction, you can assign a permissions or owner password that prevents content extraction. To do that, open the PDF file is Acrobat, and then bring up the document information dialog (Ctrl-D or Cmd-D or via the menu item in the File menu). Then go to the Security tab and select to add password security. Now make sure that "Enable copying of text, images and other content" is not enabled. This should prevent a well behaved PDF indexer from accessing your content, but if somebody's software is not playing by the rules imposed by the PDF format, there is nothing you can do that would also severely restrict the usefulness of the PDF documents.
Thanks. That sounds like a good option!
Maybe someday they'll add a 'noindex' option in PDFs. It seems like something that should be easy to implement in tags or other meta content that search engines can read.
The interesting question is who is the "they" who would do that. It would need any specific changes to PDF to add more metadata, but people would prefer to see something simple in the UI (or a simple tool). But how do you persuade all of the makers of indexing tools that this is a thing they want to do? Each indexing tool would need to invest in it separately. Adobe don't control PDF any more, it is done by ISO, but they can take years to change anything at all. Anyone could invent a tag, but would it help - would it in fact give a false sense of security?
In fact it's an HTTP tag; each PDF served has HTTP data, outside it. (HTML has it inside and outside). But most web curators don't have the power to set this. Google invented noindex, they would be the people to persuade.
When a search engine ignores the robots.txt file it will also ignore this tag in a PDF file.
Yes, you are correct. That is why I started the post with "encourage search engines" not "prohibit." It seems the only way truly block a search engine would be adding a password.
But I believe most mainstream search engines respect the noindex tag, so if such a tag could be added to PDFs (with ISO approval?), I believe they would also respect it.
Perhaps Google should be encouraged to respect a nonidex tag in metadata in a PDF just like they do in HTML.
Thanks for all the great comments/suggestions.