robots.txt – stop sign for search engine bots?

Home / SEO / robots.txt – stop sign for search engine bots?

Search engine bots (including robots, spiders or user agents) crawl the web every day looking for new content. Your mission is to analyze and index websites. Before the crawlers begin their work, however, they first have to pass the robot.txt file. The so-called “Robots Exclusion Standard Protocol” was first published in 1994 and regulates the behavior of search engine bots on websites.

Unless otherwise stated, the bots can crawl your website unhindered. Creating a robots.txt can also help to protect certain pages or individual elements from the view of web spiders. In this article you will learn in which cases it makes sense to create a robots.txt and what you should pay attention to when generating and testing the file.

If a search engine bot reaches your website, it aims to crawl all pages and content as possible. With the correct instructions in the robots.txt file, the search bots can be told which content is relevant. In other cases, sensitive data can be protected, non-public directories can be excluded or test environments can be temporarily hidden.

In advance: There is no guarantee that crawlers will adhere to the prohibitions created in the robots.txt file! The specified instructions are only guidelines and therefore cannot enforce any particular behavior on the part of the crawler. Hackers and scrapers cannot be stopped by the robots.txt file. 

Nevertheless, experience shows that at least the best-known and most popular search engines such as Google or Bing adhere to the regulations. You can read a detailed article on how crawling works in our blog post ” Crawling – the spider on the move on your website “.

Also, prohibitions in the robots.txt file are not necessarily the method of choice if indexing by search engines is to be prevented. In particular, websites that have strong links can still be shown in the results lists. In this case, the affected pages or files should instead be protected with the Meta Robots tag “noindex”.

The robots.txt file is always located in the root directory of your website. The robots.txt file can be called up as follows: Enter the URL of the website in the browser search bar and add /robots.txt at the end of the domain. If no file is available yet, there are various options for correctly creating and testing the robots.txt file.

 

Create robots.txt – the right syntax is important

There are now numerous free tools and generators on the web that can be used to automatically create the robots.txt file. If you want to do without a generator and prefer to create your own file, you can use a plain text editor or the Google Search Console. The Google Search Console can also be used to subsequently test the correct creation of the robots.txt file.

Each robots.txt contains data sets, so-called “records”. Each data record consists of two parts. The correct syntax must be used so that the defined rules can apply.

In order to receive instructions and prohibitions, the respective crawler must first be addressed with the command “User Agent”. In the second part, further rules for the bots are introduced with the instruction “Disallow”. If a page or an element is not blocked via Disallow, the user agent will by default crawl all content.

The basic scheme for the structure of a data record is structured as follows:

User agent: *

Disallow:

Another option would be to use the “Allow” directive, which explicitly allows crawling compared to Disallow:

User Agent: *

Allow: /

These instructions allow all crawlers to access all resources. The star (wildcard) used is a variable for all crawlers. All instructions are processed top-down. When creating the rules, you must also pay attention to upper and lower case. In the following example, Googlebot is only prohibited from accessing a subpage of the website:

User agent: Googlebot

Disallow: /unterseite.html

With the instruction “User Agent” only one search engine bot can be addressed at a time. If different bots are to be addressed, an additional block is required in the robots.txt file. This is also structured according to the basic scheme. A blank line is noted between the lines:

User agent: Bingbot

Disallow: / directory1 /

User agent: Googlebot

Disallow: / shop /

In this example, the Bingbot is not allowed to access directory 1, while the Googlebot is prohibited from crawling the shop of the site. Alternatively, the same instructions for different bots can also be noted directly below one another:

User agent: Bingbot

User agent: Googlebot

Disallow: / shop /

 

Overview of the most famous search engine crawlers:

Exclude website content correctly

As the last examples show, specifications for files and directories usually begin with a slash “/” after the domain. The path then follows. You should also always think of the slash at the end of the directory name. However, it should be noted that with the slash at the end, further subdirectories can still be crawled by the bots. If you want to avoid this, simply leave out the slash at the end.

In the original protocol, it was not intended to explicitly allow individual pages or elements for indexing. However, Allow can also be used during creation to release subdirectories or files in actually locked directories for the bots:

User agent: *

Disallow: / images /

Allow: / images / public /

To block certain files (e.g. PDFs or images), it is recommended to add a “$” to the end of the file name. This signals that no further characters may follow:

User agent: *

Disallow: /*.gif$

It is also recommended to include a reference to the sitemap in the robots.txt file. You can read why it makes sense to use a sitemap for your website and how to set it up correctly in our blog post “ The perfect sitemap ”. A simple additional line is sufficient to add the sitemap:

User agent: *

Disallow:

Sitemap: https://NameWebseite.com/sitemap.xml

 

Test the robots.txt file correctly

After creation, the robots.txt should be tested again, because even the smallest errors can lead to the file being disregarded by the crawlers. The Google Search Console can be used to check whether the syntax of the file was created correctly. The robots.txt tester checks your file in the same way as the Googlebot, whether the corresponding files have been properly blocked. All you have to do is enter the URL of your page in the text field at the bottom of the page. The robots.txt tester can already be used before you load the file into the root directory. Simply enter the syntax in the input mask:

If the error messages are no longer displayed, you can upload the robots.txt file to the root directory of the website. At the time of this writing, the robots.txt tester from the old version of the Google Search Console was used as an example. The robots.txt tester is currently not available in the new version of the GSC.

 

Conclusion:

The robots.txt file provides the general framework for search engine bots and can help to hide pages or individual files from the crawlers. However, there is no guarantee that the rules will be observed. If you want to be on the safe side and prevent the page from being indexed, you should also use the meta tag “noindex”. When creating the robots.txt, it is essential to ensure that the correct syntax is used. Tools such as Google Search Console can be used to check the file.

Author Bio – Vishal Garg has several years of experience in digital marketing. With good expertise in advanced marketing and promotional strategies, he has helped numerous brands establish their online niche with his out of the box internet marketing strategies and lead generation capabilities. Currently, he is running a successful digital marketing company in Jaipur

 

Leave a Reply

Your email address will not be published. Required fields are marked *