In this episode of Path TV, Avinash Conda and Michael Stearne of Path Interactive discuss robot.txt. file and it’s importance to your website.
So what is robot.txt. and how does it affect search engine crawlers? To understand robot.txt. it’s important to first understand what a robot it. Avinash describes it as a program that crawls or indexes your website, uncovering what exactly your website is all about. A robot.txt. file is simply a text file that provides the robot with information on what to crawl and what not to crawl, ultimately feeding this information to search engines. If there is something on your website that you simply do not want a search engine to crawl, you can tell your robot.txt. file to not crawl that specific pages.
Pages on your website, such as sensitive data on your site or a thank you page, you may not want to be indexed by search engines. In that case, you’ll want to program your robot.txt file that way, signaling the search engine bots to not crawl those pages.
Stay tuned for the next episode of Path TV, in which Avinash and Michael will discuss the structure of a robot text file.
0:13 Michael Stearne – Hello and welcome another exciting episode of Path TV.
0:15 I am your co-host Michael Stearne and this is my co-host Avinash Conda.And today we are discussing Robots.txt
0:31 Avinash Conda – All right.
0:33 Michael Stearne – So what is it? What’s it for?
0:35 Avinash Conda – Yeah, First thing what are Robots? Well they are all robots, spiders, crawlers.
0:41 They are nothing but a program, a script which search engines and few other site write to come to your website and look at all your pages.
0:52 Take lists of all the pages, which its called crawling, is coming to your website and actually crawling your website.
0:59 To index them in their indexes. Maybe for search engine sakes; Mostly for search engine sake but few other sites which do that, So those are robots.
1:09 Michael Stearne – They can go through every page in the site, page by page, crawling around the site as if you were a regular visitor or real visitor crawling from page to page.
1:19 Avinash Conda – Yeah, that’s a robot. You know what’s a robot.txt.
1:22 It’s a text file, obviously, txt which tells the robot what to crawl and what not to crawl.
1:30 So not only for security reasons but also its a preference to people who don’t want some of their pages indexed by search engine; thank you page or something.
1:41 Michael Stearne- Some administrative section or private section of sites.
1:44 Avinash Conda – Private section, some people actually have passwords of their website, all that data they don’t want to crawl and indexed. So those actually can be blocked in the robot.txt.
1:54 It’s a file which can actually say any robot just crawl this part and not this part. Or you can actually mention the robot; let it be Google robot or you can say for this search engine please crawl all this and you can actually block some of the rebooting crawling them.
2:14 For example maybe your website doesn’t want any of its images to be indexed at all in Google. And they can simple say Google image robot do not crawl in my images.
2:29 Michael Stearne – So which is a great thing when the search engine play by the rules. So every search engine, when it sends a robot or crawler to your website, look for this special file called yourdomain.com/robots.txt.
2:52 Avinash Conda – Yeah there is actually a protocol called robots exclusion protocol
Which all search engines, cannot say all but most of the search engines follow and most of the robots actually follow.
3:03 Except perhaps some malware which are different school all together but yeah most of them follow the protocol.
3:10 Michael Stearne – So that is one little problem is that you know Google and Bing and YAHOO and Lycos all this different search engines will follow this rules but if there are some kind of spammy search engine and they are just crawling your site for data mining or whatever they are not going to follow any way so you have to take other security measures.
3:33 Avinash Conda – And you might have guessed can help.
3:39 Michael Stearne – On the next episode Avinash will write a WebCrawler.
3:43 Avinash Conda – No I wouldn’t. Why would I?
3:46 Michael Stearne – You are not. I thought we are gonna program a WebCrawler onscreen using Google robot.
3:52 Not on Google robot but a random small crawler just to crawl something. I will try.
4:00 Michael Stearne – Maybe, maybe not. But thanks for watching Avinash will do this do sign off. This is Michael Stearne from Path TV. Thank you and Avinash Conda.
4:15 Avinash Conda – Namaste. Bye.
4:16 Michael Stearne – No. No. No.
4:19 Avinash Conda – What?
4:20 Michael Stearne – Lets Go.
4:21 Avinash Conda – Lets GO.