robots.txt ! ?
You’ve mastered the basics of the robots.txt file, but it’s time to dig a little deeper! The robots.txt files are mainly used to guide search engine robots by using rules to block them or allow them to access certain parts of your site.
Although the easiest way to use the robots.txt file is to block robots in entire directories, there are several advanced features that give more precise control over how your site is indexed.
Here are five tips for those who want to be a little more advanced in their bot management….
Suppose you operate a large website with a high frequency of updates. Let’s say it’s an information site. Every day, you post dozens of new articles to your home page. Due to the large number of updates, search engine robots are constantly crawling your site, causing a heavy load on your servers.
The robots.txt file gives you an easy way to fix this: the “crawl delay” directive. This directive instructs robots to wait a certain number of seconds between requests. For example :
Crawl delay: 10
One of the advantages of this directive is that it allows you to limit the number of URLs visited per day on important sites. If you set your crawl time to 10 seconds, as in the example above, it means that Bingbot would crawl a maximum of 8640 pages per day (60 seconds x 60 minutes x 24 hours / 10 seconds = 8640). Unfortunately, not all search engines (or robots in general) recognize this directive, the most notable being Google.
Filter a character string
Wildcard filtering allows you to check for strings within raw data blocks.
This can be very useful, especially when you need bots to crawl certain types of files or expressions. It allows finer control than the approach of blocking entire directories, and saves you from having to list each URL you want to block individually.
The simplest form would be to use the wildcard character (*). For example, the following directive blocks all subdirectories in the “private” folder for the Google bot:
You can match the end of a string using the dollar sign. The following, for example, would block all URLs ending in “.asp”:
Another example: to block all URLs that contain the question mark character (?), Use the following command:
You can also use this technique to block robots for specific file types, in this example.gif files:
The “Allow” directive
If you’ve read this far, you probably know the disallow ban directive. A lesser known directive is the “allow” directive. As you can guess the “allow” directive works in the opposite way to the disallow directive. Instead of blocking robots, you specify the paths that designated robots can access.
This can be useful in a number of cases. For example, let’s say you have banned an entire section of your site, but still want robots to crawl a specific page in that section.
In the following example, the Googlebot is only authorized to access the website’s “google” directory:
Unlike the “disallow” directive, the “noindex” directive will not prevent your site from being visited by search engine robots. However, this will prevent search engines from indexing your pages.
Good to know: it will also delete these pages from the index. This has obvious advantages, for example if you need a page containing sensitive information to be removed from search engine results pages.
Note that “noindex” is unofficially supported by Google but not by Bing.
You can combine the “disallow” and “noindex” directives to prevent pages from being browsed and indexed by robots:
Another essential tool for optimizing your site is XML sitemaps, especially if you want search engine robots to find and index your pages!
Before a bot can find your page, it must first find your XML sitemap.
To ensure that search engine robots find your XML sitemap, you can add its location to your robots.txt file: