Zwiki.org has had reliability problems. This project aims to fix this, making Zwiki more reliable and scalable at the same time.
See also the #232 frequent errors while browsing zwiki.org meta-issue and MemoryUsage.
Needs an update
Symptoms
For zwiki.org users (and zwiki admins in general): here's an overview of known symptoms and what they may mean.
- Proxy Error
This is apache telling you that the zope server is down. It should be back up soon, usually within 15 seconds. See below for some known zope-crashing issues.
- it may have crashed because of a zope or python crashing bug, or because of exceeding available memory, due to something you just did (see if you can reproduce it reliably)
- it may have crashed earlier, or been shut down to reclaim memory, and failed to restart
- a restart may be in progress to upgrade or install development code
- DisconnectedError or MemoryError or "The object at http://zwiki.org/zwiki has an empty or missing docstring"
Due to a large number of pages and high MemoryUsage, zope cannot allocate enough memory to complete this operation. We would like it to restart automatically when this happens, but right now it may not - we are experimenting with ZopeOutOfMemoryKiller? and AutoLance?.
- "page has missing docstring" error
Happens when you try to view a page as it's being changed. Try again and it should work.
- traceback mentioning Localizer
Seems to be due to ZODB conflict errors, cause unknown. Try again and it should work.
- traceback mentioning ZWikiPage, or a correct page url saying not found
Simon may have installed some broken development code. Try again in a few minutes (and report on GeneralDiscussion).
- traceback mentioning "max recursion limit exceeded" when you change a page
Zwiki's regular expressions break down when pages get very large. We don't see this in practice.. if you see it on zwiki.org please report on GeneralDiscussion.
- site not responding (but can ping zwiki.org)
- May be another symptom of the memory problem above. Zope should restart soon, try again in a few minutes.
- May be due to some slow operations tying up all available threads; only
2 threads are running at present (to conserve memory). The site may respond slowly
or not at all until some operations complete. Slow operations include:
- renaming a page
- rebuilding the site catalog (several minutes, rare)
- upgradeAll operation in progress after zwiki upgrade (several minutes, rare)
- someone using the print method to print a lot of pages at once ? (unknown)
- site not responding (can't ping)
- connectivity problems ? try pinging other sites
- name resolution problem ? see if you can ping 216.17.130.20. May be a temporary outage of your dns provider, or zwiki.org's nameservers, or due to a change in the dns records (should propagate within a day).
- server down ? due to Imeme problems or maintenance..
Problems
- #351 zwiki.org's zope server leaks memory (and hangs/restarts when it reaches quota)
find out what causes leakage (make LeakFinder? work)
make sure our normal peak memory usage is within quota (MemoryUsage)
- #226 on freebsd, python/zope may crash repeatably when browsing diffs or saving certain pages
this site's python has been patched so it won't crash
- #395 zwiki's regular expressions may fail with large pages/sites
Zwiki.org seems free from this at present. Most likely to occur when GeneralDiscussion grows huge.
- #341 a googlebot or other crawl repeatedly crashes zwiki.org
bots crawl the wiki and trigger the other problems. All bots except google are currently forbidden in robots.txt; specific problem bots can be blocked in apache httpd.conf. TheRobotProblem
- (not on this site: error code 136/SIGFPE - can happen after adding all zope objects (not just zwiki pages) to a wiki's catalog. A zope bug.)
- multithreading - site can hang up easily due to long-running operations and only 2 threads ?
we are running only 2 threads to conserve memory
- images are not cached as much as they should be ?
apache caches images from zope, but browsers don't appear to be caching images from apache
- the site favicon needs improving - transparency problem ?
Things I'd like to know:
- what's current zope memory usage ?
no working top on this system, so:
while true; do ps aux |egrep 'games.*z2'|egrep -v egrep; sleep 2; done
- when does memory usage increase ?
see MemoryUsage
- when does memory leak ?
http://zwiki.org/Control_Panel/DebugInfo/manage_main
http://zwiki.org/Control_Panel/LeakFinder/manage_usage:
Traceback (innermost last): Module ZPublisher.Publish, line 98, in publish Module ZPublisher.mapply, line 88, in mapply Module ZPublisher.Publish, line 39, in call_object Module Shared.DC.Scripts.Bindings, line 252, in __call__ Module Shared.DC.Scripts.Bindings, line 283, in _bindAndExec Module App.special_dtml, line 174, in _exec Module DocumentTemplate.DT_With, line 61, in render Module DocumentTemplate.DT_Util, line 201, in eval - __traceback_info__: REQUEST Module <string>, line 0, in ? Module Products.LeakFinder.LeakFinder, line 240, in manage_getSample Module Products.LeakFinder.LeakFinder, line 175, in getControlledRefcounts Module Products.LeakFinder.LeakFinder, line 188, in resetCache TypeError: function takes at most 2 arguments (3 given)
- what happens before a leak ?
see #351 zwiki.org's zope server leaks memory (and hangs/restarts when it reaches quota)
- when does memory hit the quota ?
see testzope.log left by the ZopeOutOfMemoryKiller?.. now, grep for
AutoLance
in events.log, possibly.. watch zopemem monitor script in shell window.. otherwise we are relying on user reports. Would be great if AutoLance? could be patched for freebsd. - what should be the normal maximum memory usage for my server ?
27M (base zope usage) + (P (number of pages) 20K (average page size) + overhead) 2 (threads). see MemoryUsage
- when is the site down ?
uptime: http://uptime.openacs.org/uptime/reports.tcl?monitor_id=5180 detects http outages >15 minutes
netcraft: http://uptime.netcraft.com/up/graph?site=http%3A//zwiki.org/ , shows host uptime (pingable)
- when and why does zope crash/restart ?
event.log and control panel's uptime will tell when. As of 2003/08/08, zope should stay up except when memory usage grows too large, the server deadlocks, and (hopefully) AutoLance? restarts it.
- what happens before a crash ?
should be no crashes right now
- when does zope stop responding but fail to restart ?
when we exceed our memory quota, sometimes; AutoLance? may or may not prevent this. No way to detect at present, except by users.
- when are we left with the storage server running but no zeo client ?
Imeme's 124.testzope monitor script detects this - but not always, why ? How often does it run ?
- which operations take a long time ?
http://zwiki.org/server-status
http://zwiki.org/Control_Panel/CallProfiler/configureForm
- AllPages
- search field and SearchPage?
- changing a large page (due to text formatting and pre-linking)
- renaming
- processing a mailin (in addition to the above ? curl process shows heavy cpu usage for some reason)
- the experimental /print method, called on a high node
- rebuilding the catalog
- packing the zodb
- when did someone see one of these errors ?
/var/log/apache/httpd-error.log
webalizer's Hits by Response Code
- how badly do these affect other users ?
- what's our hit rate ?
http://zwiki.org/server-status
webalizer reports at zwiki.org/logs/zwiki.org (robots keep out) (/var/log/imeme/zwiki.org)
- when are the webalizer logs updated ?
nightly
- when are we being hit by a bot ?
when it shows up in Z2.log or httpd_access.log
when bots and/or large traffic from a single host show up in the webalizer report
- when are these things changing suddenly ?
- how are these things changing over time ?
Information sources
- ps
- zope & apache logs
- webalizer logs
- apache server-status
- zope control panel
- zope error_log
- uptime
- netcraft
- here's a list of functional SiteTests
http://zope.org/Members/tseaver/Projects/HighlyAvailableZope/FrontPage
Discussion: see GeneralDiscussion