It’s easy to overlook the importance of robots.txt. However, this text file is a critical component of your on-page SEO. Why? Because a robots.txt file is effectively a roadmap for search engine bots, telling them which pages they can – and cannot – access on your website.
What Is Robots.txt?
So, what is robots.txt file, and why is it important? In essence, a robot txt file supplies search engine bots with a set of instructions. The bots, such as web crawlers, will then read these instructions and appropriately manage their activities. The instructions typically boil down to one of two things: it can or cannot crawl a part of your website.
The bots themselves are automated computer programs, which means they follow the rules you set. Now you might be wondering, why do you need to manage the activities of web crawlers? One reason is to avoid indexing any pages not intended for public view; another is to ensure your site’s web server hosting isn’t being overtaxed.
Below is a robots txt example:
User-Agent: *
Disallow: /*__*
Sitemap: https://www.yourwebsite.com/sitemap_index.xml
How Does a Robots.txt File Function?
Now you know what is robots.txt, the next step is understanding how it functions. As you can likely gather by the .txt. extension, robots.txt is simply a text file – no HTML code included. Nevertheless, robots txt is still hosted like any other file found on a web server.
If you don’t believe us, simply type in a website’s homepage URL, and add the “/robots.txt” extension. Try it out for yourself; you can take a look at our robot txt page here: https://www.clickintelligence.co.uk/robots.txt
Even though it’s there for everyone to see and access, users won’t stumble upon your website’s robots txt file. Why? The answer is simple: there’s no reason to link to it on your site – unless you’re doing a demonstration as we did above! Nevertheless, this file is often the first stop in the journey for crawler bots before they crawl the rest of your site’s content.
Crawler bots do this so they can follow your instructions and know which pages to access. Note that if the robot txt file features contradictory commands, the more granular command will be followed by the bot. Also, each subdomain requires its own robots txt file.
What Is a User Agent?
In the example of a robot txt file above, you may have been confused by the part that says “User-Agent: *”.
In terms of a robot txt file, the user agent function allows you to indicate instructions for specific agents – aka bots. As an example, you could decide to have a certain webpage appear in search results for Bing, but not in Google results. In this case, you can set a disallow command for “User-agent: Googlebot”, and the opposite for “User-agent: Bingbot”.
If the text file uses the asterisk (“User-agent: *”) option, this indicates the intended action is for all bots to follow – not just the specific one you’ve highlighted.
How Does the Disallow Command Work?
When it comes to the robots exclusion protocol (REP), one of the most prevalent commands is the “Disallow” one. With a robots.txt file, this function states that bots shouldn’t access a specific webpage – or collection of webpages – following the command.
Is a disallowed page hidden completely? Not necessarily. However, you might decide to opt against a webpage showing up in search results because it’s not useful for your audience. If need be, a user will be able to find this page if they know how to navigate to it on your site.
With robots txt disallow, there are various command options available. The most common options include:
- Single webpage block: The disallow selection allows you to block a single webpage. For instance, this could be a blog post or an “about us” page.
- Full website block: If you don’t want your website to appear in search results, you can do a robots.txt disallow all selection. This is achieved with a “Disallow: /” command, as the “/” acts as the site’s entire hierarchy.
- Full access: What if you don’t want to disallow any of your web pages? In this situation, you can go with the following command: “Disallow:” That lets the bots know they can crawl your whole website.
- Block a directory: Rather than a robots.txt disallow all option for your full site, you can block individual directories. This is a more efficient approach when you want to block numerous pages in one go.
What Other Commands Are There?
Even though it’s by far the most frequently used command, this doesn’t mean disallow is all alone. Here are the other two commands available with robots txt:
Allow
Not much explanation is required with this command. Rather than disallowing a page from being available to bots, you can “Allow” a specific webpage or directory to be accessed. This is useful if you’re, for example, disallowing an entire directory, but you want to allow the bots to access a particular webpage within said directory.
Crawl-Delay
The crawl delay command in action helps prevent bots from putting too much strain on your server. This delay makes a bot wait a specified time, in milliseconds, between each request. If you want to wait 12 milliseconds, this is how to enter the command:
Crawl-delay: 12
Note that while other search engines recognise the crawl delay, Google doesn’t. However, if required, you can alter the crawl frequency within the Google Search Console.
Using the Robots.txt Tester
Do you want to ensure your robot.txt file is functioning as expected? The good news is that validating its crawl directives is an easy task. This is because of the robots.txt tester tool provided by none other than Google itself. You can find the robots.txt tester by clicking here. Google also provides a handy rundown, enabling users to know what to do when utilising the tester.
As you might expect, there is a notable limitation with this tool: it can only be used to test your robots txt file with Google’s web crawler. Google “cannot predict how other web crawlers interpret” this file.
Need Help?
Those with experience and a decent amount of knowledge should find using the robots.txt file and robots.txt tester tool fairly simple, especially if the intention is to use it to check SEO performance. However, not everyone’s an expert – and that’s okay. If you find your mind boggled by the process – particularly when it comes to SEO strategy – don’t panic; our gurus at Click Intelligence are here to help. Simply get in touch for assistance!