9 Replies Latest reply on Feb 18, 2016 9:29 AM by mnx2

    Prevent search engines from crawling help output

    mnx2

      I'm trying to figure out how we can prevent our HTML webhelp from being crawled by search engines like Google. I found these instructions while digging through some of the discussions on this forum: Stop search engine robots indexing Your private folders by ‘robots.txt’. | Internet marketing Blog

       

      However, you would think we could add some code into the project itself in order to stop the search engines from crawling the help. We tried adding this code into our master page since the master page is applied on all topics, but the code didn't remain after the output was generated:

       

      <meta name="robots" content="NOINDEX, NOFOLLOW" />

       

      Does anyone know how we can prevent search engines from crawling our help?

        • 1. Re: Prevent search engines from crawling help output
          Willam van Weelden Adobe Community Professional & MVP

          Adding a meta tag and a robots.txt is only a courtesy. A search engine *may* decide to skip your site. But there is no guarantee.

           

          If you really don't want your content to be indexed, you have to cut of the access to your content. If you require authentication (for example, by using a .htaccess file Htaccess Authentication - Htaccess Tools) the search engines are no longer able to index your content.

          • 2. Re: Prevent search engines from crawling help output
            deborahs68858966 Level 1

            I need to do this as well. There doesn't seem to be a way to do it within RoboHelp. Several sources have suggested Find/Replace to add the <meta> tag to each .htm file. Arduous and error prone. Any other thoughts?

            • 3. Re: Prevent search engines from crawling help output
              Amebr Level 4


              A weird "feature" that might work for you.

               

              Make sure there is a Robohelp header section in your master page.

              Switch to HTML view and paste the meta code into the "?rh_region_start type=header" and "?rh_region_end type=header" tags.

              Save. RH automagically moves the code between the master page "head" tags.

               

              When you generate, the meta tag will be in each page, but not within the "head" tags - you will find it further down the page, just above the first content in the topic (e.g the topic H1). I'm not sure if the placement affects the webcrawlers, though.

              • 4. Re: Prevent search engines from crawling help output
                Deb Sauer Level 1

                Hi Amebr

                 

                Thanks for the info. I tried it, and it looks good...the meta tag moves up into the header section of the Master page. But, when I publish the webhelp, it is not in the <head> section of the .htm files. It is in the <body> section and appears as:

                 

                <div style="width: 100%; position: relative;" id="header">

                <meta name="robots" content="noindex, nofollow" />

                  <p>&#160;</p>

                </div>

                 

                When I look at the topics in the help, there is extra space at the top of the topic, above the breadcrumbs, so clearly something is there. But, it's not between the <head> and </head> tags in the .htm.

                 

                Too bad. That would have been easy.

                 

                This is what I did to get the meta tag in the right place:

                 

                1. Publish the help to a designated folder (as usual).
                2. In RH, select Edit -> Find and Replace in Files.
                3. Specify </head> in the Find what field.
                4. Specify  <meta name="robots" content="noindex, nofollow"/> </head> in the Replace with field.
                5. Specify the folder with the published webhelp output in the Look in field.
                6. Select Text file types (*.htm ; *.html ; *.txt) in the Files of type field.
                7. Check the Include Subfolders option.
                8. Click Find Next, and then Replace All.

                 

                I chose to do the Find/Replace at the top level, so the folder that contains all of the output (the folder that contains the resource folder, whdata folder, whgdata foler, etc.). This means that the meta tag is in all of the .htm files, not just the ones with the topic content. I don't think there's any harm in that.

                 

                Now I need to get the meta tag in the head section of the .htm files of the responsive HTML5  output from FrameMaker. Any thoughts on that?

                • 5. Re: Prevent search engines from crawling help output
                  Amebr Level 4

                  Yeah, as I said, not in the head, but I don't know enough about the web side to know how much of a problem that is/isn't.

                   

                  You can add the code into the screen layout although that can be a little hairy. It would need to go into every .slp file I believe. Willam might be able to offer more advice.

                  • 6. Re: Prevent search engines from crawling help output
                    deborahs68858966 Level 1

                    Thanks for the suggestions! Nice to have a place to knock ideas around.

                     

                    I put the meta tag into the head area of the Screen Layout for topics (Topic.slp). In RH HTML view, the tag is in the correct place. When I open Topic.slp in Notepad, it's in the correct place. But, when I generate the webhelp, it is inserted in the body as:

                     

                    <div style="width: 100%; position: relative;" id="header">

                      <p>&#160;</p>

                    <meta name="robots" content="noindex, nofollow" />

                    </div>

                     

                    Perhaps Willam van Weelden will have another idea.

                    • 7. Re: Prevent search engines from crawling help output
                      Amebr Level 4

                      Ah oops. I missed the bit about webhelp. The screen layouts are for Multiscreen or Responsive HTML5 output so updating them won't result in a change in webhelp. What you are seeing would be the code you added to the master page before.

                       

                      I don't know if you can update the webhelp skin in the same way as the screen layouts, sorry.

                      • 8. Re: Prevent search engines from crawling help output
                        Willam van Weelden Adobe Community Professional & MVP

                        The masterpage header won't work for this. Personally, I would also do a find and replace in the output. That's the fastest way.

                         

                        Just remember that search engines not indexing your site based on meta tags is a courtesy, it doesn't block bots completely. Only the nice guys such as Google will listen. Not even a robots.txt will block crawlers. (For example, see: Learn about robots.txt files - Search Console Help) If you really don't want unauthorised access, you have to force authentication on your server.

                        • 9. Re: Prevent search engines from crawling help output
                          mnx2 Level 1

                          Thanks! This helps confirm that my company needs to work on forcing an authentication, which we're trying to do.