Robots Exclusion Protocol - NetwaxLab

Breaking

Facebook Popup

BANNER 728X90

Sunday, November 23, 2014

Robots Exclusion Protocol

The Robot Exclusion Standard, also known as the Robots Exclusion Protocol or robots.txt protocol, is a convention to advising cooperating web crawlers and other web robots about accessing all or part of a website which is otherwise publicly viewable.

Web Robots (also known as Web Wanderers, Crawlers, or Spiders), are programs that traverse the Web automatically. Search engines such as Google use them to index the web content, spammers use them to scan for email addresses, and they have many other uses.

The standard was proposed by Martijn Koster, when working for Nexor in February, 1994.

It is great when search engines frequently visit your site and index your content but often there are cases when indexing parts of your online content is not what you want. For instance, if you have two versions of a page (one for viewing in the browser and one for printing), you'd rather have the printing version excluded from crawling, otherwise you risk being imposed a duplicate content penalty. Also, if you happen to have sensitive data on your site that you do not want the world to see, you will also prefer that search engines do not index these pages (although in this case the only sure way for not indexing sensitive data is to keep it offline on a separate machine). Additionally, if you want to save some bandwidth by excluding images, style sheets and JavaScript from indexing, you also need a way to tell spiders to keep away from these items.

One way to tell search engines which files and folders on your Web site to avoid is with the use of the Robots metatag. But since not all search engines read metatags, the Robots matatag can simply go unnoticed. A better way to inform search engines about your will is to use a robots.txt file.

What is Robots.txt?

Robots.txt is common name of a text file that is uploaded to a Web site's root directory and linked in the html code of the Web site.

Robots.txt is a text (not html) file you put on your site to tell search robots which pages you would like them not to visit. Robots.txt is by no means mandatory for search engines but generally search engines obey what they are asked not to do. It is important to clarify that robots.txt is not a way from preventing search engines from crawling your site (i.e. it is not a firewall, or a kind of password protection) and the fact that you put a robots.txt file is something like putting a note “Please, do not enter” on an unlocked door – e.g. you cannot prevent thieves from coming in but the good guys will not open to door and enter. That is why we say that if you have really sensitive data, it is too naïve to rely on robots.txt to protect it from being indexed and displayed in search results.

The Details

The /robots.txt is a de-facto standard, and is not owned by any standards body. There are two historical descriptions:
  • the original 1994 A Standard for Robot Exclusion document,
  • a 1997 Internet Draft specification A Method for Web Robots Control.

In addition there are external resources:
  • HTML 4.01 specification, Appendix B.4.1,
  • Wikipedia - Robots Exclusion Standard.

Where to Place?

The location of robots.txt is very important. It must be in the main directory because otherwise user agents (search engines) will not be able to find it – they do not search the whole site for a file named robots.txt. Instead, they look first in the main directory (i.e. http://mydomain.com/robots.txt) and if they don't find it there, they simply assume that this site does not have a robots.txt file and therefore they index everything they find along the way. So, if you don't put robots.txt in the right place, do not be surprised that search engines index your whole site.
A robots.txt file covers one origin. For websites with multiple subdomains, each subdomain must have its own robots.txt file. If example.com had a robots.txt file but a.example.com did not, the rules that would apply for example.com would not apply to a.example.com. In addition, each protocol and port needs its own robots.txt file; http://example.com/robots.txt does not apply to pages under https://example.com:8080/ or https://example.com/.

Structure of a Robots.txt File?

Structure of Robot Txt
The structure of a robots.txt is pretty simple (and barely flexible) – it is an endless list of user agents, disallowed  & allowed files and directories.


Basically, the syntax is as follows:

User-agent:
Disallow:
Allow:

“User-agent” are search engines' crawlers, disallow: & allow: lists the files and directories to be excluded or include from indexing respectively. In addition to “user-agent:” and “disallow:” entries, you can include comment lines – just put the # sign at the beginning of the line.

User-agent:

The "User-agent" part is there to specify directions to a specific robot if needed. There are two ways to use this in your file.

If you want to tell all robots the same thing you put a " * " after the "User-agent" It would look like this...

User-agent: *

If you want to tell a specific robot something (in this example Googlebot) it would look like this... (this line is saying "these directions apply to just Googlebot")

User-agent:Googlebot


Disallow:

The "Disallow" part is there to tell the robots what folders they should not look at. This means that if, for example you do not want search engines to index the photos on your site then you can place those photos into one folder and exclude it.
Lets say that you have put all these photos into a folder called "photos". Now you want to tell search engines not to index that folder.

User-agent:*
Disallow:/photos

The above two lines of text in your robots.txt file would keep robots from visiting your photos folder. The "User-agent *" part is saying "this applies to all robots". The "Disallow: /photos" part is saying "don't visit or index my photos folder".

Allow:

The "Allow:" instructions lets you tell a robot that it is okay to see a file in a folder that has been "Disallowed" by other instructions.

To illustrate this, let's take the above example of telling the robot not to visit or index your photos. We put all the photos into one folder called "photos" and we made a robots.txt file that looked like this...

User-agent:*
Disallow:/photos

Now let's say there was a photo called mycar.jpg in that folder that you want Googlebot to index. With the Allow: instruction, we can tell Googlebot to do so, it would look like this...

User-agent:*
Disallow:/photos
Allow:/photos/mycar.jpg

This would tell Googlebot that it can visit "mycar.jpg" in the photo folder, even though the "photo" folder is otherwise excluded.


Example

User-agent:FreeFind
Disallow:/mysite/test/
Disallow:/mysite/cgi-bin/post.cgi?action=reply
Disallow:/mysite/cgi-bin/post.cgi?action=reply
Disallow:/a


In this example the following addresses would be ignored by the spider:

http://adomain.com/mysite/test/index.html
http://adomain.com/mysite/cgi-bin/post.cgi?action=reply&id=1
http://adomain.com/mysite/cgi-bin/post.cgi?action=replytome
http://adomain.com/abc.html


and the following ones would be allowed:

http://adomain.com/mysite/test.html
http://adomain.com/mysite/cgi-bin/post.cgi?action=edit
http://adomain.com/mysite/cgi-bin/post.cgi
http://adomain.com/abc.html


It is also possible to use an "allow" in addition to disallows. For example:

User-agent:FreeFind
Disallow:/cgi-bin/
Allow:/cgi-bin/ultimate.cgi
Allow:/cgi-bin/forumdisplay.cgi


Example demonstrating multiple user-agents:

User-agent:googlebot                       #all services
Disallow:/private/                               #disallow this directory

User-agent:googlebot-news              #only the news services
Disallow:                                            #on everything

User-agent:*                                      #all robots
Disallow:/something/                         #on this directory

Traps of a Robots.txt File?

When you start making complicated files – i.e. you decide to allow different user agents access to different directories – problems can start, if you do not pay special attention to the traps of a robots.txt file. Common mistakes include typos and contradicting directives. Typos are misspelled user-agents, directories, missing colons after User-agent and Disallow, etc. Typos can be tricky to find but in some cases validation tools help.

The more serious problem is with logical errors. For instance:

User-agent:*
Disallow:/temp/
User-agent:Googlebot
Disallow:/images/
Disallow:/temp/
Disallow:/cgi-bin/

The above example is from a robots.txt that allows all agents to access everything on the site except the /temp directory. Up to here it is fine but later on there is another record that specifies more restrictive terms for Googlebot. When Googlebot starts reading robots.txt, it will see that all user agents (including Googlebot itself) are allowed to all folders except /temp/. This is enough for Googlebot to know, so it will not read the file to the end and will index everything except /temp/ - including /images/ and /cgi-bin/, which you think you have told it not to touch. You see, the structure of a robots.txt file is simple but still serious mistakes can be made easily.


Tools to Generate and Validate a Robots.txt File?

Having in mind the simple syntax of a robots.txt file, you can always read it to see if everything is OK but it is much easier to use a validator, like this one: http://tool.motoricerca.info/robots-checker.phtml. These tools report about common mistakes like missing slashes or colons, which if not detected compromise your efforts. For instance, if you have typed:

User-agent:*
Disallow:/temp/

this is wrong because there is no slash between “user” and “agent” and the syntax is incorrect.


In those cases, when you have a complex robots.txt file – i.e. you give different instructions to different user agents or you have a long list of directories and subdirectories to exclude, writing the file manually can be a real pain. But do not worry – there are tools that will generate the file for you. What is more, there are visual tools that allow to point and select which files and folders are to be excluded. But even if you do not feel like buying a graphical tool for robots.txt generation, there are online tools to assist you. For instance, the Server-Side Robots Generator offers a dropdown list of user agents and a text box for you to list the files you don't want indexed. Honestly, it is not much of a help, unless you want to set specific rules for different search engines because in any case it is up to you to type the list of directories but is more than nothing.

Can I block just bad robots?

In theory yes, in practice, no. If the bad robot obeys /robots.txt, and you know the name it scans
for in the User-Agent field. then you can create a section in your /robotst.txt to exclude it specifically. But almost all bad robots ignore /robots.txt, making that pointless.

If the bad robot operates from a single IP address, you can block its access to your web server through server configuration or with a network firewall.

If copies of the robot operate at lots of different IP addresses, such as hijacked PCs that are part of a large Botnet, then it becomes more difficult. The best option then is to use advanced firewall rules configuration that automatically block access to IP addresses that make many connections; but that can hit good robots as well your bad robots.

----

No comments:

Post a Comment