This post is more advanced and will help you to better understand the robots.txt file.
Website owners use the robots.txt file to give instructions about their site to web robots. This action is also known as “The Robots Exclusion Protocol” or REP.
This process looks like this: a robot (crawler) wants to visit a page URL. Before it does so, it firsts checks for the existence of a robots.txt file inside of the root folder of that (top-level directory of your web server) website. Example: http://www.example.com/robots.txt
If the respective robot can’t find a file called robots.txt, means it is allowed to visit all website pages. So, it can now proceed to visit the wanted page. If that robot is a crawler of a search engine, it will next look for the presence of robots meta tag in the HTML code of that page before taking any action.
A simple example of the text inside robots.txt will be:
“User-agent” refers to robot identity. A “*” means “all”. So, our example will affect all web robots. However, you can replace * with the name of a specific bot, like Google, or Googlebot-Image, bingbot, msnbot, etc. For a more complex list of available bot names, see the Robots Database.
“Disallow” will restrict robots access to the specified pages after “:”. The slash “/” means website root directory .So, in the example above, our robots.txt file will disallow the access of all web robots to all website pages. If you delete “/”, robots will have access to all website pages.
Disallowing access to a directory will automatically disallow access to all the files and folders inside it. However, you can allow the access to the files inside a blocked directory by specifying “Allow” attribute, followed by file name and extension.
This example will disallow bots access to “images” folder, but will allow access to “logo.png” file inside images folder.
It is very important to specify each rule on a separate row. Otherwise, the robots will not read the file properly. Also, use only lower case for robots.txt file name (not Robots.txt, ROBOTS.TXT, Robots.TXT or other combinations). “.txt” is just the extension of the file. Make sure your file is not named “robots.txt.txt”.
The robots.txt file is publicly accessible to all who visit http://example.com/robots.txt in a web browser. So, do not use this file to hide confidential information. Also, there are malicious bots who can ignore the content in robots.txt file, so do not only rely on this file to protect vulnerable parts of your website.
If you have created a Sitemap to list the most important pages on your site, you can point the bots to it by referencing it at the end of robots.txt file.
Google WebMaster has a tool for testing robots.txt files and show you whether the file blocks Google crawlers from specific URLs on your website. It’s accessible at https://www.google.com/webmasters/tools/robots-testing-tool
While Google, Bing and other major search engines won’t crawl or index the content blocked by robots.txt, you may still find information such as anchor text in links to the site that appear in search results. You can completely stop your URL from appearing in SERP by using your robots.txt file in combination with HTML robots meta tags.
Why to use robots.txt file?
Use robots.txt file to hide important folders, files or pages from being indexed in SERP. You can also forbid search engines to index images on your web site.
By example, the robots.txt file generated by WordPress after installation looks like this:
This means that search engines should not have access to http://example.com/wp-admin/ and should not index this page in SERP. Additionally, you can limit the access to other files and folders.