You may not have realized it - I hadn't - but there are gangs of unruly robots charging around the place, night and day.
Once they've found your house, this horde will not knock. Well-behaved robots are supposed to check and obey your robots.txt. Many don't.
They will enter not only by the front door, but through every window, down your chimney and through cracks in the walls. And they won't go away. They'll seek out and enter every room and crawl-space they can find or they once heard about. Not once but over and over, for ever.
These robots are tricky. You can't always tell who they are. Visitors to your house are supposed to identify themselves with a distinctive user agent string. The robots' strings are far too numerous to be a practical guide. And some robots pretend to be people. Only their behaviour may give them away.
To learn more, see GooGle:robots
Would a MeatBall:SurgeProtector help?
Absolutely. Nice name for it.
Would a meta-tag in standard_wiki_header help?
<meta name="robots" contents="noindex, nofollow">
Thanks, I wasn't aware of that. It may do some good for some robots.
Hmm, what I really want is to let e.g. GooGle? index content without getting lost or following backlinks or reparent links.
'Zope's acquisition feature makes this tricky to do reliably, unfortunately.'
So what we really need is a robots.txt somewhat along these lines:
# robots.txt User-agent: # Or googlebot, or whatever Disallow: /backlinks Disallow: */map # eof
isn't it?
Alas not supported. Disallow lines can only contain URL prefixes.
Could this be achieved by adding the meta-noindex-nofollow stuff to the header of */{backlinks,map} pages?
Yes, this should work.
... provided the robots are well-behaving, and you did call them unruly. ;-)
- problem with that approach is that what you're trying to do is keep robots from asking for those pages. Because the robots themselves are likely to generate more peak load than users following those links later. --BillSeitz Oct'01
- I'm aware that right (0.9.5pre1) now all pages have INDEX,NOFOLLOW tags. Unfortunately this keeps the entire wiki from being indexed (unless you submit each page separately to a robot, which is a pain and generally not liked by the robot operators)
- a different approach might be to hide map and backlinks features behind buttons triggering POST actions, since robots generally ignore those.
- yet another approach would be to write a function like isARobot or isABrowser based on the Agent request header param, then use that within map/backlinks pages to suppress the entire searching process for those Agents. Of course this assumes that you're dealing with a properly-identified Agent. But in my experience the real agents are the most common sources of nasty load. If you wanted to refine the process, you could flag IP#s of requestors, but that (a) adds more code slowing down the whole site, and (b) doesn't catch users with dumb software if they're on dynamic IP (unless you block a whole range which then locks out other users on the same network). Tough calls.
BTW, the RobotExclusionStandard? can be found at http://info.webcrawler.com/mak/projects/robots/norobots.html
See also: TheSpamProblem, TheVandalProblem, ModThrottle
Take special care if you're ZMI is exposed to Robots!
Commments:
FlorianKonnertz, 2002/12/04 15:44 GMT (via web):
My site suffered from heavy traffic overload today. Most of them caused by googlebot. I don't use the meta-tag, because i want ot have my pages indexed. What can i do additionally to the above suggested robots.txt? Is this enough? I use 0.11, no more backlinks and map pages used. I can't do anything else, can i?
BTW, is this meant by #341 a googlebot or other crawl repeatedly crashes zwiki.org ??
Yes, #351 zwiki.org's zope server leaks memory (and hangs/restarts when it reaches quota) (memory leakage) and #226 on freebsd, python/zope may crash repeatably when browsing diffs or saving certain pages (frequent crashes/restarts) are also relevant. --SM
FlorianKonnertz, 2002/12/06 20:06 GMT (via web):
I see. - Thanks, Simon. - Meanwhile our server is recovering from his hyperactive guest. - Cu again in four weeks, Mr.googlebot?? ;-)
SimonMichael, 2002/12/06 22:41 GMT (via web):
Oh does he come by every four weeks ?
Florian are you running on freebsd as a matter of interest ? Do you experience #226 on freebsd, python/zope may crash repeatably when browsing diffs or saving certain pages ? (lots of aieee error code 10 & 11's in your event log).
FlorianKonnertz, 2003/01/20 06:40 GMT (via web):
Simon: The GooGle? website says, that it takes up to 4-6 weeks for new websites to get indexed automatically and i think i read elsewhere that he comes by every four weeks in average.
I use Linux-2.4.20, Zope-2.5.1;
We will upgrade our server soon, so all robots are invited to come ;-)
hide some functions behind buttons -- Tue, 10 Feb 2004 08:56:04 -0800 reply
If you have certain functions like backlinks, parents, etc. triggered by a button doing a POST, rather than a link, that will keep robots away from those functions... --BillSeitz
... -- Mon, 05 Apr 2004 06:20:07 -0700 reply
Just checking again my robots.txt
and reread this page... - Here's the current robots protocol : http://www.robotstxt.org/wc/exclusion-admin.html (or did i miss the link above??) - I try it now inside plone with
User-agent: * Allow: / Disallow: /external_edit Disallow: /recycle_bin/ Disallow: /IssueTracker Disallow: /FilterIssues Disallow: /backlinks Disallow: /diff Disallow: /sendto_form Disallow: /subscribeform Disallow: /login_form Disallow: /mail_password_form Disallow: /search_form Disallow: /enabling_cookies
update of robots.txt -- Mon, 05 Apr 2004 06:21:52 -0700 reply
BTW, how often checks google the robots.txt if it has changed?
update of robots.txt --simon, Mon, 05 Apr 2004 09:15:36 -0700 reply
Google appears to check often.. at the beginning of each crawl I think.
useful links --Simon Michael, Fri, 11 Mar 2005 21:50:22 -0800 reply
Jens Vagelpohl wrote:
This is the spec:http://www.robotstxt.org/wc/norobots.html
Here is a robots.txt validator:
http://www.searchengineworld.com/cgi-bin/robotcheck.cgi
Here's a funny one: Some collected all the reckless/useless user agents for exclusion:
http://www.searchenginegenie.com/Dangerous-user-agents.htm
This one explains Slurp-specific extensions:
my personal robots.txt -- jensn -- Sun, 10 Apr 2005 02:20:41 -0700 reply
This is what my robots.txt looks like for my wiki. Feel free to comment on it. Maybe we could "offer" a general robots.txt here, which fits most ZWiki installations:
Somehow the wildcards (stars, "*") are not rendered in the robots.txt below (maybe someone with more knowledge of zwiki syntax could fix this
# only Yahoo Slurp supports "Crawl-Delay" # see http://help.yahoo.com/help/us/ysearch/slurp/slurp-03.html User-agent: Slurp Crawl-Delay: 120 Disallow:# only google support wildcards (), and end-markers ($) User-agent: googlebot Disallow: //map Disallow: //subscribeform Disallow: //diff Disallow: //backlinks Disallow: //editform
# default for all others # not sure about the semantics of this # I hope it does not mean "/diff" but all files called "diff" will not be crawled User-agent: * Disallow: map Disallow: subscribeform Disallow: diff Disallow: backlinks Disallow: editform
Using "Conditional HTTP GET" will help with Google --betabug, Thu, 01 Mar 2007 10:16:18 +0000 reply
The new code that handles "Conditional HTTP GET" for Zwiki (currently in darcs, soon in a .tgz file near you!) will likely also lessen the load due to Google runs. Google does "If-modified-since" requests, and handing it 304 replies with no content when there is no change reduces traffic a lot. Only sometimes will some of the googlebots get old content again unconditionally - maybe when they have a mess in their own database.