Robots.txt :: Technical SEO :: The.Domain.Name

Thanks to Vary.com - https://varvy.com/robottxt.html

Guide to the robots.txt file

Updated: April 29th 2016

What is a robots.txt file

The robots.txt file is a simple text file placed on your web server which tells webcrawlers like Googlebot if they should access a file or not.

Basic robots.txt examples

Here are some common robots.txt setups (they will be explained in detail below).

Allow full access

User-agent: *
Disallow:

Block all access

User-agent: *
Disallow: /

Block one folder

User-agent: *
Disallow: /folder/

Block one file

User-agent: *
Disallow: /file.html

Why should you learn about robots.txt?

Improper usage of the robots.txt file can hurt your ranking
The robots.txt file controls how search engine spiders see and interact with your webpages
This file is mentioned in several of the Google guidelines
This file, and the bots they interact with, are fundamental parts of how search engines work

Tip: To see if your robots.txt is blocking any important files used by Google, use the Google guidelines tool.

Search engine spiders

The first thing a search engine spider like Googlebot looks at when it is visiting a page is the robots.txt file.

It does this because it wants to know if it has permission to access that page or file. If the robots.txt file says it can enter, the search engine spider then continues on to the page files.

If you have instructions for a search engine robot, you must tell it those instructions. The way you do so is the robots.txt file. ²

Priorities for your website

There are three important things that any webmaster should do when it comes to the robots.txt file.

Determine if you have a robots.txt file
If you have one, make sure it is not harming your ranking or blocking content you don't want blocked
Determine if you need a robots.txt file

Determining if you have a robots.txt

You can enter a website below, click go and it will detect if the site has a robots.txt file and display what the file says (it shows results here on this page).

If you do not want to use the tool above, you can check from any browser. The robots.txt file is always located in the same place on any website, so it is easy to determine if a site has one. Just add "/robots.txt" to the end of a domain name as shown below.

www.yourwebsite.com/robots.txt

If you have a file there, it is your robots.txt file. You will either find a file with words in it, find a file with no words in it, or not find a file at all.

Determine if your robots.txt is blocking important files

You can use the Google guidelines tool, which will warn you if you are blocking certain page resources that Google needs to understand your pages.

If you have access and permission you can use the Google search console to test your robots.txt file. Instructions to do so are found here (tool not public - requires login).

To fully understand if your robots.txt file is not blocking anything you do not want it to block you will need to understand what it is saying. We cover that below.

Do you need a robots.txt file?

You may not even need to have a robots.txt file on your site. In fact it is often the case you do not need one.

Reasons you may want to have a robots.txt file:

You have content you want blocked from search engines
You are using paid links or advertisements that need special instructions for robots
You want to fine tune access to your site from reputable robots
You are developing a site that is live, but you do not want search engines to index it yet
They help you follow some Google guidelines in some certain situations
You need some or all of the above, but do not have full access to your webserver and how it is configured

Each of the above situations can be controlled by other methods, however the robots.txt file is a good central place to take care of them and most webmasters have the ability and access required to create and use a robots.txt file.

Reasons you may not want to have a robots.txt file:

It is simple and error free
You do not have any files you want or need to be blocked from search engines
You do not find yourself in any of the situations listed in the above reasons to have a robots.txt file

It is okay to not have a robots.txt file.

When you do not have a robots.txt file the search engine robots like Googlebot will have full access to your site. This is a normal and simple method that is very common.

How to make a robots.txt file

If you can type or copy and paste, you can also make a robots.txt file.

The file is just a text file, which means that you can use notepad or any other plain text editor to make one. You can also make them in a code editor. You can even "copy and paste" them.

Instead of thinking "I am making a robots.txt file", just think, "I am writing a note" they are pretty much the same process.

What should the robots.txt say?

That depends on what you want it to do.

All robots.txt instructions result in one of the following three outcomes

Full allow: All content may be crawled.
Full disallow: No content may be crawled.
Conditional allow: The directives in the robots.txt determine the ability to crawl certain content.

Let's explain each one.

Full allow - all content may be crawled

Most people want robots to visit everything in their website. If this is the case with you, and you want the robot to index all parts of your site, there are three options to let the robots know that they are welcome.

1) Do not have a robots.txt file

If your website does not have a robots.txt file then this is what happens...

A robot like Googlebot comes to visit. It looks for the robots.txt file. It does not find it because it isn't there. The robot then feels free to visit all your web pages and content because this is what it is programmed to do in this situation.

2) Make an empty file and call it robots.txt

If your website has a robots.txt file that has nothing in it then this is what happens...

A robot like Googlebot comes to visit. It looks for the robots.txt file. It finds the file and reads it. There is nothing to read, so the robot then feels free to visit all your web pages and content because this is what it is programmed to do in this situation.

3) Make a file called robots.txt and write the following two lines in it...

User-agent: *
Disallow:

If your website has a robots.txt with these instructions in it then this is what happens...

A robot like Googlebot comes to visit. It looks for the robots.txt file. It finds the file and reads it. It reads the first line. Then it reads the second line. The robot then feels free to visit all your web pages and content because this is what you told it to do (I explain this below).

Full disallow - no content may be crawled

Warning: This means that Google and other search engines will not index or display your webpages.

To block all reputable search engines spiders from your site you would have these instructions in your robots.txt:

User-agent: *
Disallow: /

It is not recommended to do this as it will result in none of your web pages being indexed.

The robot.txt instructions and their meanings

Here is an explanation of what the different words mean in a robots.txt file

User-agent

User-agent:

The "User-agent" part is there to specify directions to a specific robot if needed. There are two ways to use this in your file.

If you want to tell all robots the same thing you put a " * " after the "User-agent" It would look like this...

User-agent: *

The above line is saying "these directions apply to all robots".

If you want to tell a specific robot something (in this example Googlebot) it would look like this...

User-agent: Googlebot

The above line is saying "these directions apply to just Googlebot".

Disallow:

The "Disallow" part is there to tell the robots what folders they should not look at. This means that if, for example you do not want search engines to index the photos on your site then you can place those photos into one folder and exclude it.

Lets say that you have put all these photos into a folder called "photos". Now you want to tell search engines not to index that folder.

Here is what your robots.txt file should look like in that scenario:

User-agent: *
Disallow: /photos

The above two lines of text in your robots.txt file would keep robots from visiting your photos folder. The "User-agent *" part is saying "this applies to all robots". The "Disallow: /photos" part is saying "don't visit or index my photos folder".

Googlebot specific instructions

The robot that Google uses to index their search engine is called Googlebot. It understands a few more instructions than other robots.

In addition to "User-name" and "Disallow" Googlebot also uses the Allow instruction.

Allow

Allow:

The "Allow:" instructions lets you tell a robot that it is okay to see a file in a folder that has been "Disallowed" by other instructions. To illustrate this, let's take the above example of telling the robot not to visit or index your photos. We put all the photos into one folder called "photos" and we made a robots.txt file that looked like this...

User-agent: *
Disallow: /photos

Now let's say there was a photo called mycar.jpg in that folder that you want Googlebot to index. With the Allow: instruction, we can tell Googlebot to do so, it would look like this...

User-agent: *
Disallow: /photos
Allow: /photos/mycar.jpg

This would tell Googlebot that it can visit "mycar.jpg" in the photo folder, even though the "photo" folder is otherwise excluded.

Testing your robots.txt file

To find out if an individual page is blocked by robots.txt you can use this technical SEO tool which will tell you if files important to Google are being blocked and also display the content of the robots.txt file.

Key concepts

If you use a robots.txt file, make sure it is being used properly
An incorrect robots.txt file can block Googlebot from indexing your page
Ensure you are not blocking pages that Google needs to rank your pages

Assignments

Robots.txt

The robots.txt file