Submitted by : simon at: 2003-10-26T21:31:27+00:00 (14 years ago)
Name :
Category : Severity : Status :
Optional subject :  
Optional comment :

This is really two issues: zwiki.org's zope is leaking memory (1), and it stops responding when it reaches this server's 100Mb hard memory quota (2).

Currently I am working around this by restarting the server at five minutes to the hour with a cron job. This can happen in as little as a second or three but on this server it normally takes about 15 seconds (3) (due to various conflict errors & cruft in the zodb ?). During this time a visitor will see Proxy Error; I see this quite often. Also if a visitor happens to be the first hit after a restart there will be a ~10s delay while zope initializes it's zodb cache. To avoid this I have a another cron job doing a get at three minutes to the hour.

What's leaking ? Could be more than one thing.


JustinClift wrote, 2002/11/05 19:33 GMT (via web):

Is this a Zwiki, Zope, OS, etc problem?

Am concerned if this means we'll have to setup a cronjob for the server that Zwiki is installed on too.

Regards and best wishes,

Justin Clift


JustinClift, 2002/11/12 03:44 GMT (via web):
Is it practical to create a robots.txt that completely turns off Robot searching activity for the Zwiki site whilst the memory problems are occuring?

Also, the new search engine (www.teoma.com) are looking for pointers with things. If you've got suggestions for how robots should handle Zwiki's, then they're likely to listen closely. :)


Simon Michael, 2002/11/12 20:53 GMT (via mail):
Yes, it was set up this way until a month or two ago, thanks for reminding me. I'm interested in hardening zwiki in these situations but if it gets bad again I'll do that, until there's something new to try with this freebsd python issue.


Simon Michael, 2002/11/12 20:54 GMT (via mail):

> until there's something new to try with this freebsd python issue.

I meant: until we get some new solution for this memory leakage issue, as well as the freebsd python crashing issue.


JustinClift, 2002/11/14 09:17 GMT (via web):
Is this something that the PatchPython213 page addresses?


SimonMichael, 2002/11/14 20:41 GMT (via web):
Yes (if this = the crashing issue), but it didn't work for me. See #226 on freebsd, python/zope may crash repeatably when browsing diffs or saving certain pages.

SimonMichael, 2002/11/26 03:26 GMT (via web):
LeakFinder? is a new product that may help here, but it doesn't yet work for me (incompatible with my zope's PickleCache?).

I sometimes see 10-15Mb leaked when a mailin is being processed (not always). I suspect that simultaneous mailins/conflict errors may be part of the problem.

SimonMichael, 2002/12/03 21:24 GMT (via web):
We were getting a lot of memory errors due to another google crawl. I've disallowed google in robots.txt until this situation has improved.

SimonMichael, 2002/12/03 22:05 GMT (via web):
But also I see a rogue "SuperBot?" at 61.70.73.53 is repeatedly crawling the entire wiki at about one page per second.

This one appears to just be accessing just the normal page urls - I would think it's someone trying to take a snapshot from AllPages or something like that. I wouldn't expect this to trigger memory leaks. It's possible that simply accessing a lot of pages fast is growing zope's cache too large, and I did increase it to a larger target size recently (8000). I'll try reducing it again.

I also tried denying this host in apache's httpd.conf but haven't got it working yet.

PieterB, 2002/12/04 11:25 GMT (via web):
Simon, have a look at http://httpd.apache.org/docs/howto/auth.html#allowdeny Also look at: http://www.evolt.org/article/Using_Apache_to_stop_bad_robots/18/15126/ for more info on controlling bat robots....

PieterB, 2002/12/04 11:30 GMT (via web):
And also: http://evolt.org/article/Stopping_Spambots_II_The_Admin_Strikes_Back/18/21392/index.html for blocking bad bots' IP's...

SimonMichael, 2002/12/11 19:47 GMT (via web):
Usually when we hit the quota, a MemoryError? appears in the logs and the user will see either a Memory Error, Client Disconnected error, or the server will appear to be hung. Other pages may continue to work. The server will remain in this state until the ZEO connection fails, then the client dies and is restarted:

 2002-12-11T17:35:15 ERROR(200) zrpc:31994 Error caught in asyncore
 Traceback (innermost last):
  Module ZServer.medusa.asyncore, line 94, in poll
  Module ZServer.medusa.asyncore, line 395, in handle_read_event
  Module ZEO.zrpc.smac, line 144, in handle_read
  Module ZEO.zrpc.connection, line 198, in message_input
  Module ZEO.zrpc.log, line 60, in short_repr
 MemoryError

SimonMichael, 2002/12/12 00:29 GMT (via web):

I'm now using Control Panel -> Debug Information -> Cache extreme detail, and finding it fascinating. (!)

Try it some time, you'll see all your wiki pages in there. As expected, on the first page render zwiki tends to load all of a wiki's pages into cache, as it figures out freeform links etc, and if your target cache size is large enough it will do just that. (And if it isn't, presumably your pages would render more slowly.)

By rough calculation, when the above happens I see the zope process grow from ~25M to ~50M, and there are 1500 pages, so each zwiki page I have cached requires about 16K of memory on average. That sounds right.

So, if it's mainly serving wiki pages I might expect zwiki.org's zope process to run at about 25Mb (overhead, misc objects) + 16K * my target cache size. Or not, since some number of cached objects will not be zwiki pages. But take it as a rule of thumb. Then presumably any further growth comes from leaks.

This site's TCS has been 4000 so the rule of thumb would predict 89Mb. That is, perhaps coincidentally, the sort of number it wanted to be at earlier today. I'll reduce the TCS and just for interest, I'll try and fill this out to get a picture of what normal memory use looks like:

         memory
         usage
 target  after       after           after
 cache   TestPage    RecentChanges   TestPage
 size    render      render          edit

 4000                                89?
 2000    55                          55
 1000    47          54              54

2002/12/12 02:29 GMT (via web):
Big progress

Ahh, but there's more (why do I get the feeling this is all written down beautifully somewhere).. each server/zodb thread has it's own cache, so you could have multiple complete copies of your wiki in memory. New threads up to the maximum (4 by default) are created only when needed due to simultaneous web connections, then they are kept around for reuse. This is a big clue - what looked like intermittent memory leaks on zwiki.org were actually normal thread/cache growth when simultaneous hits happened. Two together would happen pretty quickly, three once in a while and four rarely. Depending on my target cache size, three or four threads would push us over quota.

After some experimentation while watching top, ab, logs, debug info and cache parameters - noting that caches can grow large momentarily, and that 500 is too small to serve the wiki at normal speed, it seems that a good setting for the present site (1500 pages) to run well within the 100Mb limit is 2 threads, target cache size 3000 each. So far zope seems stable at 82M with the complete wiki loaded in both caches and room to spare.

Dare I say it ? Welcome to the new era of tuned zwiki.org ? And I'll see this process again in the morning ?

That's 93592.

DOH! Forgot to turn off the hourly cron killer. That's 181.

SimonMichael, 2002/12/12 16:36 GMT (via web):
Woohoo! :

  PID USERNAME PRI NICE  SIZE    RES STATE    TIME   WCPU    CPU COMMAND
  181 games      2  20    98M 98132K poll    29:58  2.69%  2.69% python

  * Zope version: (Zope 2.6.0 (source release, python 2.1, linux2), python 2.1.3, freebsd4)
  * Python version: 2.1.3 (#1, Dec 11 2002, 10:14:59) [GCC 2.95.3 [FreeBSD] 20010315 (release)]
  * System Platform: freebsd4
  * SOFTWARE_HOME: /usr/local/zope/2.6.0/lib/python
  * INSTANCE_HOME: /usr/local/zope/instance
  * CLIENT_HOME: /usr/local/zope/instance/var
  * Process ID: 181 (155698176)
  * Running for: 13 hours 40 min 7 sec

SimonMichael, 2002/12/12 16:48 GMT (via web):
If you know zope tuning and can suggest improvements, please do.

SimonMichael, 2002/12/13 17:39 GMT (via web):
It inched up past quota overnight:

 PID USERNAME PRI NICE  SIZE    RES STATE    TIME   WCPU    CPU COMMAND
 181 games     74  20   102M   100M RUN     62:24  5.81%  5.81% python

 Zope Version  (Zope 2.6.0 (source release, python 2.1, linux2), python 2.1.3,
               freebsd4)                                                    
 Python        2.1.3 (#1, Dec 11 2002, 10:14:59) [GCC 2.95.3 [FreeBSD] 20010315
 Version       (release)]     
 System        freebsd4                             
 Platform                                   
 SOFTWARE_HOME /usr/local/zope/2.6.0/lib/python
 ZOPE_HOME     /usr/local/zope/2.6.0
 INSTANCE_HOME /usr/local/zope/instance      
 CLIENT_HOME   /usr/local/zope/instance/var
 Process Id    181 (155698176)                
 Running For   1 day 14 hours 24 min 6 sec    

I've reduced TCS to 2500.

watchdog script set up --SimonMichael, 2003/02/24 08:00 GMT
We are frequently growing past memory quota once again. I reduced target cache size to 2000 today. Also a cron job is now checking zope's memory usage every minute and should restart it when it's over quota.

repost from GD, for the record --Simon Michael, 2003/02/27 17:32 GMT
zwiki-wiki@zwiki.org (SimonMichael) writes:
> We are being hammered by "Webreaper" from 200.131.37.198 right now, ignoring
> all robot directives. (And the occasional genteel probe from googlebot).
>
> The "out of memory killer" - a new cron job run every minute which restarts
> zope if memory usage has grown too close to quota - seems to mostly be working
> well. Here's the log from the last couple of days::
>
> 2003/02/25 03:01:01 GMT *** zope too big, restarting
> 2003/02/25 07:07:00 GMT *** zope too big, restarting
> 2003/02/25 15:07:00 GMT *** zope too big, restarting
> 2003/02/25 20:00:10 GMT *** zope too big, restarting
> 2003/02/26 05:54:00 GMT *** zope too big, restarting
> 2003/02/26 08:17:00 GMT *** zope too big, restarting
> 2003/02/26 08:27:00 GMT *** zope too big, restarting
> 2003/02/26 09:05:00 GMT *** zope too big, restarting
> 2003/02/26 15:11:00 GMT *** zope too big, restarting
> 2003/02/26 15:59:01 GMT *** zope too big, restarting
> 2003/02/26 19:40:00 GMT *** zope too big, restarting
> 2003/02/27 09:30:04 GMT *** zope too big, restarting
> 2003/02/27 13:38:00 GMT *** zope too big, restarting
> 2003/02/27 15:44:00 GMT *** zope too big, restarting
>
> I've noticed this often restarting zope amazingly fast - 10 seconds! It's just
> doing an orderly stop & start of the ZEO client. I think in the past I've
> relied on the imeme emergencyzopestart.sh which kills all.
>
> So, although this is a workaround for the unknown memory leak(s), it should
> mean you're far less likely to see site errors due to #351 zwiki.org's zope server leaks memory (and hangs/restarts when it reaches quota). The only
> fly in the ointment is that the OOM killer failed to restart zope last time,
> after 10 minutes, not sure why.

leakfinder help ? --SimonMichael, 2003/02/27 17:35 GMT
By the way, to tackle the leakage problem LeakFinder? would be really helpful, but it's never worked for me. It complains about wrong number of arguments in cPickleCache or some such. Post if you have any clues.


comments:

AutoLance? --SimonMichael, Tue, 05 Aug 2003 17:39:26 +0000 reply
Cool, this might be better than my home-grown script: http://www.zope.org/Members/mcdonc/Products/AutoLance

AutoLance? --SimonMichael, Tue, 05 Aug 2003 17:47:22 +0000 reply
:

 Product to restart the Zope server when:

 - the asyncore socket_map size meets or exceeds a user-defined maximum
   size and at the same time the select loop times out.  This situation
   indicates a deadlock, perhaps caused by irreconcilable database
   connection consumption or leakage.

 OR

 - the process memory size in bytes exceeds the value of the ZOPE_MEMLIMIT
   variable size.  Limiting the memory size prevents swapping or
   other undesirable behavior.  When ZOPE_MEMLIMIT envvar is set to 0,
   this limit is not enforced.  This feature only works on Linux platform
   currently.

Dang it! Still, this sounds much better.

resolved, optimization continues --SimonMichael, Tue, 19 Aug 2003 06:04:01 +0000 reply
This has been resolved, for now, by running just one thread on zwiki.org so we stay within our memory quota. This may cause occasional long delays.. but not often as far as I can see. With this issue closed, the site is now basically crash- and hang- free. Optimization work continues on MemoryUsage.

property change --SimonMichael, Tue, 19 Aug 2003 06:04:35 +0000 reply
Status: open => closed