You may not have realized it - I hadn't - but there are gangs of unruly robots charging around the place, night and day.

Once they've found your house, this horde will not knock. Well-behaved robots are supposed to check and obey your robots.txt. Many don't.

They will enter not only by the front door, but through every window, down your chimney and through cracks in the walls. And they won't go away. They'll seek out and enter every room and crawl-space they can find or they once heard about. Not once but over and over, for ever.

These robots are tricky. You can't always tell who they are. Visitors to your house are supposed to identify themselves with a distinctive user agent string. The robots' strings are far too numerous to be a practical guide. And some robots pretend to be people. Only their behaviour may give them away.

To learn more, see GooGle:robots


Would a MeatBall:SurgeProtector help?

Absolutely. Nice name for it.

Would a meta-tag in standard_wiki_header help?

<meta name="robots" contents="noindex, nofollow">

Thanks, I wasn't aware of that. It may do some good for some robots.

Hmm, what I really want is to let e.g. GooGle? index content without getting lost or following backlinks or reparent links.

'Zope's acquisition feature makes this tricky to do reliably, unfortunately.'

So what we really need is a robots.txt somewhat along these lines:

 # robots.txt
 User-agent:  # Or googlebot, or whatever
 Disallow: /backlinks
 Disallow: */map
 # eof

isn't it?

Alas not supported. Disallow lines can only contain URL prefixes.

Could this be achieved by adding the meta-noindex-nofollow stuff to the header of */{backlinks,map} pages?

Yes, this should work.

... provided the robots are well-behaving, and you did call them unruly. ;-)

BTW, the RobotExclusionStandard? can be found at http://info.webcrawler.com/mak/projects/robots/norobots.html

See also: TheSpamProblem, TheVandalProblem, ModThrottle

Take special care if you're ZMI is exposed to Robots!


Commments:

FlorianKonnertz, 2002/12/04 15:44 GMT (via web):
My site suffered from heavy traffic overload today. Most of them caused by googlebot. I don't use the meta-tag, because i want ot have my pages indexed. What can i do additionally to the above suggested robots.txt? Is this enough? I use 0.11, no more backlinks and map pages used. I can't do anything else, can i?

FlorianKonnertz

BTW, is this meant by #341 a googlebot or other crawl repeatedly crashes zwiki.org ??

Yes, #351 zwiki.org's zope server leaks memory (and hangs/restarts when it reaches quota) (memory leakage) and #226 on freebsd, python/zope may crash repeatably when browsing diffs or saving certain pages (frequent crashes/restarts) are also relevant. --SM

FlorianKonnertz, 2002/12/06 20:06 GMT (via web):
I see. - Thanks, Simon. - Meanwhile our server is recovering from his hyperactive guest. - Cu again in four weeks, Mr.googlebot?? ;-)

SimonMichael, 2002/12/06 22:41 GMT (via web):
Oh does he come by every four weeks ?

Florian are you running on freebsd as a matter of interest ? Do you experience #226 on freebsd, python/zope may crash repeatably when browsing diffs or saving certain pages ? (lots of aieee error code 10 & 11's in your event log).

FlorianKonnertz, 2003/01/20 06:40 GMT (via web):
Simon: The GooGle? website says, that it takes up to 4-6 weeks for new websites to get indexed automatically and i think i read elsewhere that he comes by every four weeks in average.

I use Linux-2.4.20, Zope-2.5.1;

We will upgrade our server soon, so all robots are invited to come ;-)

subtopics:


comments:

hide some functions behind buttons -- Tue, 10 Feb 2004 08:56:04 -0800 reply
If you have certain functions like backlinks, parents, etc. triggered by a button doing a POST, rather than a link, that will keep robots away from those functions... --BillSeitz

... -- Mon, 05 Apr 2004 06:20:07 -0700 reply
Just checking again my robots.txt and reread this page... - Here's the current robots protocol : http://www.robotstxt.org/wc/exclusion-admin.html (or did i miss the link above??) - I try it now inside plone with

User-agent: *
Allow: /
Disallow: /external_edit
Disallow: /recycle_bin/
Disallow: /IssueTracker
Disallow: /FilterIssues
Disallow: /backlinks
Disallow: /diff
Disallow: /sendto_form
Disallow: /subscribeform
Disallow: /login_form
Disallow: /mail_password_form
Disallow: /search_form
Disallow: /enabling_cookies

update of robots.txt -- Mon, 05 Apr 2004 06:21:52 -0700 reply
BTW, how often checks google the robots.txt if it has changed?

update of robots.txt --simon, Mon, 05 Apr 2004 09:15:36 -0700 reply
Google appears to check often.. at the beginning of each crawl I think.

useful links --Simon Michael, Fri, 11 Mar 2005 21:50:22 -0800 reply
Jens Vagelpohl wrote:

This is the spec:

http://www.robotstxt.org/wc/norobots.html

Here is a robots.txt validator:

http://www.searchengineworld.com/cgi-bin/robotcheck.cgi

Here's a funny one: Some collected all the reckless/useless user agents for exclusion:

http://www.searchenginegenie.com/Dangerous-user-agents.htm

This one explains Slurp-specific extensions:

http://help.yahoo.com/help/us/ysearch/slurp/slurp-03.html

my personal robots.txt -- jensn -- Sun, 10 Apr 2005 02:20:41 -0700 reply
This is what my robots.txt looks like for my wiki. Feel free to comment on it. Maybe we could "offer" a general robots.txt here, which fits most ZWiki installations:

Somehow the wildcards (stars, "*") are not rendered in the robots.txt below (maybe someone with more knowledge of zwiki syntax could fix this

# only Yahoo Slurp supports "Crawl-Delay"
# see http://help.yahoo.com/help/us/ysearch/slurp/slurp-03.html
User-agent: Slurp
Crawl-Delay: 120
Disallow: 

# only google support wildcards (), and end-markers ($) User-agent: googlebot Disallow: //map Disallow: //subscribeform Disallow: //diff Disallow: //backlinks Disallow: //editform

# default for all others # not sure about the semantics of this # I hope it does not mean "/diff" but all files called "diff" will not be crawled User-agent: * Disallow: map Disallow: subscribeform Disallow: diff Disallow: backlinks Disallow: editform

Using "Conditional HTTP GET" will help with Google --betabug, Thu, 01 Mar 2007 10:16:18 +0000 reply
The new code that handles "Conditional HTTP GET" for Zwiki (currently in darcs, soon in a .tgz file near you!) will likely also lessen the load due to Google runs. Google does "If-modified-since" requests, and handing it 304 replies with no content when there is no change reduces traffic a lot. Only sometimes will some of the googlebots get old content again unconditionally - maybe when they have a mess in their own database.