Remember your robots.txt
Recently updated on
As part of our website creation process, we utilize one or more servers for development. This keeps the live site live while we do our work but gives us a dev website on the Internet for testing purposes.
Because the development site is also on the web, it needs its own web address (e.g. dev.yourcompany.com). There are many adjustments needed in the configuration file to accommodate the development URL, especially when testing third-party connections. For example, ecommerce gateways don't like when credit card information is coming from a URL other than expected.
Often, the robots.txt file gets lost in the shuffle. This file controls what pages within a website are eligible for search engines to index and display as part of their search returns. It is an important element of a live website and, it turns out, an important element of your development site.
Naysayers will say that this is not very important, because the search engine needs to find the site before it can index it, and that means there must be an incoming link. There should be no incoming links to the dev site.
Mostly true, however consider this scenario. A website development project can often take months to complete. A marketing executive is giving a presentation and wants to showcase his new (under development) website so he takes a screenshot and adds a link to his PowerPoint presentation. Of course, this presentation is provided to the conference organizers who put it online. And now we're off to the races.
A search engine crawls the conference site and follows the link to the dev site. It looks for robots.txt and follows instructions, including indexing the whole site if robots.txt is missing. Within a few seconds, the entire new site is indexed and being returned in searches. Because the robots.txt file needs to be present on dev and different than its counterpart on live, we create a special development version and add a conditional pointer to the Apache Web config file.
First, the simple job of creating the "index nothing" robots.txt file:
User-agent: * Disallow: /
We save that in a centralized place and call it devel-robots.txt. Then we add the following line to our base custom-rewrites.conf file that is used whenever we create a dev site:
<IfDefine DEVEL> Alias /robots.txt /centralized-place/devel-robots.txt </IfDefine>
If the configuration files are shared by the live and development sites, this pointer can be wrapped in a conditional so that it only applies to the development environment.