Robot.txt files explained

Team TypeStack
Team TypeStack ...
Table of Content

What is Robots.txt 

A robots.txt file is a text file that contains the rules and regulations for search engine crawlers, such as Google Bots, Bingbots, and Yandex bots, to interact with your website. If a robots.txt file exists on a website, you can just go to their domain and add "/robots.txt" to the end of the URL to see the file's contents. Typically, you use a robots.txt file to keep search engines from indexing duplicate content on your website-- which is common on eCommerce sites. But if you don't need to restrict any part of your website, then you don't need to worry about it.

It’s best practice for every website to have a robots.txt file - especially if you're trying to get your content to rank on search engines. Google themselves has said that every site should have one. If your site doesn't have a robots.txt file, it's theoretically possible that search engines won't crawl your site.

Why do you need a robots.txt file?

First off, for eCommerce sites where you let visitors do a product search or filter products by categories or attributes, every search, filter, category or attribute creates multiple pages on your website. This can use up a lot of the crawl budget, which means that search engines may miss crawling important pages on your website because they're busy crawling pages that aren't as important.

A good example is this Ikea. They are one of the largest furniture stores in the world and there are tons of products on their website. If you visit their website and the robots.txt file, you will find that they are disallowing search engines to crawl their filters and sorting pages. There are so many of them.

Second, you can prevent search engines from crawling certain files on your website, for example, images, PDFs, etc. In case those files are meant for lead magnets.

For example, where you want to capture people's contact information before giving out that document. You definitely do not want people to be able to search for that document on search engines.
Third, you can keep certain parts of your website private by disallowing search crawlers from crawling either a file path or URL parameters. Number four, you can specify a crawl delay to prevent your servers from being overloaded when crawlers load multiple content of your website at once. And finally, it is a good practice to specify the location of sitemaps on the robots.txt file so they can find it easily.  

Robot.txt files - Language

Now that you know what we can do with the robots.txt file, let's understand the language of the search engine crawlers. The language, in technical terms, is called the robots.txt syntax.

The first thing you need to know is the "User-agent:". This syntax is used to call out specific search engine crawlers. When a search engine crawler finds your website, the first thing they will do is to look for your robots.txt file in the root folder of your website. They will scan the text file to see if they are being called out. If they are, they will further read the parts that are related to them. 

Next, is the "Disallow:" rule that tells the user agent not to crawl certain parts of the website. You can only add one "Disallow:" command per line, so that's why you see the Ikea robots.txt having so many disallow rules. The allow rule only applies for one

of Google's crawlers called Googlebot to allow it to access a page or subfolder, even though its parent page or subfolder may be disallowed.

For example, in Rank Math's robots.txt file, we disallow all search engines from crawling the folder called WP-admin, which is the file that resides on the root folder.

But, we want to allow search engines to crawl one particular file within the parent folder we have disallowed. The crawl delay tells the crawler to wait at your site's

doorstep for a couple of seconds before loading and crawling pages of your site.

Next, there is the sitemap that is used to tell search engine crawlers where your XML sitemap is located. And then, there is "/", which is the file path separator. If you leave it as an individual property, it will mean the entire folder of your website.

Next we have the "*", which is the wildcard that represents any sequence of characters. In other words, everything related to a certain criteria that comes after.

For example, in the Ikea robots.txt file it disallows search engines to crawl everything with the URL parameter that contains the filter and everything that comes after the filter. If you leave the "*" as a standalone character, it would mean everything. For example, you are calling out all the user agents.

Next, everything that comes after the hashtag will be marked as comments.

And finally, the "$" signifies matching all the strings of characters that come after it.

For example, the site is disallowing the "/solutions/" URL parameter, and every URL slug that comes after.

A Robots.txt file should be added to the website's top level directory. As mentioned earlier, when a search engine crawler finds your website, the first thing they would look for is the robots.txt file in the root folder of your website. On the file manager of your web host, the robots.txt file should be under your home and within the "public_html" folder as you can see over here. And if you have subdomains, for example, it should have a folder with your site name. That's where you should add your robots.txt file.

A robots.txt file can have more than one group, and each group can consist of different rules. Each group has to begin with a "User-agent:" followed by the rules for that crawler when they are visiting your website. Every rule needs to be written in a separate line. You should not write all rules on one line, nor should you break up one directive into several lines. 

By default, we can assume that a user agent can crawl any page on your website unless you specifically ask them not to do so by adding a "Disallow:" rule followed by the file path or URL parameters. In general, everything is allowed. That's the reason web crawlers exist.

If you want to call on more crawlers, you will add the "User-agent:" followed by the rules, and then finally the "Sitemap:". That's the structure of robots.txt.

The best way to discover any errors on your robots.txt is through the robots testing tool by Google. First you select the property, make sure that you're logged into the right Google profile that manages the search console of your website and you will see the robots.txt file of your website. You should copy all the information and paste it on the tester and hit submit. Then you can ask Google to update their information by clicking on Submit. Once done, refresh the page and you should see the changes.

If you have made a mistake, for example, you are missing a colon, you will see a warning message at the line you have made a mistake. This is a great way to debug if there are any errors in the syntax of your robots.txt. 

It can seem intimidating at first but it becomes rather easy if you take some time to understand it. All you need is just to understand the language search engines are speaking.