tahoe-lafs/trac-2024-07-25

web gateway memory grows without bound under load #891

New issue

Open

opened 2010-01-10 06:16:08 +00:00 by zooko · 6 comments

zooko commented

2010-01-10 06:16:08 +00:00

Owner

I watched as two allmydata.com web gateways slow grew to multiple GB of RAM, while consuming max CPU. I kept watching until their behavior killed my ssh session. Fortunately I left a flogtool tail running so we got to capture one's final minutes. It looks to me like a client is able to initiate jobs faster than the web gateway can complete them, and the client kept this up at a steady rate until the web gateway died.

I watched as two allmydata.com web gateways slow grew to multiple GB of RAM, while consuming max CPU. I kept watching until their behavior killed my ssh session. Fortunately I left a `flogtool tail` running so we got to capture one's final minutes. It looks to me like a client is able to initiate jobs faster than the web gateway can complete them, and the client kept this up at a steady rate until the web gateway died.

tahoe-lafs added the

labels 2010-01-10 06:16:08 +00:00

tahoe-lafs added this to the undecided milestone 2010-01-10 06:16:08 +00:00

zooko commented

2010-01-10 06:18:37 +00:00

Author

Owner

Attachment dump.flog.bz2 (86911 bytes) added

"flogtool tail --save-as=dump.flog" of the final minutes of the web gateway's life

**Attachment** dump.flog.bz2 (86911 bytes) added "flogtool tail --save-as=dump.flog" of the final minutes of the web gateway's life

dump.flog.bz2

85 KiB

zooko commented

2010-01-10 06:26:09 +00:00

Author

Owner

Attachment dump-2.flog.bz2 (32391 bytes) added

Another "flogtool tail --save-as=dump-2.log" run which overlaps with the previous one (named dump.log) but which has different contents...

**Attachment** dump-2.flog.bz2 (32391 bytes) added Another "flogtool tail --save-as=dump-2.log" run which *overlaps* with the previous one (named dump.log) but which has different contents...

dump-2.flog.bz2

32 KiB

zooko commented

2010-01-10 06:28:56 +00:00

Author

Owner

So while I was running flogtool tail --save-as=dump.flog I started a second tail, like this: flogtool tail --save-as=dump-2.flog. Here is the result of that second tail, which confusingly doesn't seem to have a contiguous subset of the the first, although maybe I'm just reading it wrong.

So while I was running `flogtool tail --save-as=dump.flog` I started a *second* tail, like this: `flogtool tail --save-as=dump-2.flog`. Here is the result of that second tail, which confusingly doesn't seem to have a contiguous subset of the the first, although maybe I'm just reading it wrong.

tahoe-lafs modified the milestone from undecided to 1.7.0

2010-02-27 09:07:13 +00:00

tahoe-lafs modified the milestone from 1.7.0 to soon

2010-06-16 03:58:49 +00:00

warner commented

2010-06-19 18:16:05 +00:00

Author

Owner

incidentally, the best way to grab logs from a doomed system like this is to get the target node's "logport.furl" (from BASEDIR/private/logport.furl"), and then run the flogtool tail command from another computer altogether. That way the flogtool command isn't competing with the doomed process for memory. You might have done it this way.. it's not immediately obvious to me.

I'll take a look at the logs as soon as I can.

incidentally, the best way to grab logs from a doomed system like this is to get the target node's "logport.furl" (from BASEDIR/private/logport.furl"), and then run the `flogtool tail` command from another computer altogether. That way the flogtool command isn't competing with the doomed process for memory. You might have done it this way.. it's not immediately obvious to me. I'll take a look at the logs as soon as I can.

zooko commented

2010-06-21 20:35:48 +00:00

Author

Owner

No I ran flogtool tail on the same system. If I recall correctly the system had enough memory available--it was just that the python process was approaching its 3 GB limit (per process vm limit which I forget why it exists).

No I ran `flogtool tail` on the same system. If I recall correctly the system had enough memory available--it was just that the python process was approaching its 3 GB limit (per process vm limit which I forget why it exists).

warner commented

2012-05-23 00:14:29 +00:00

Author

Owner

Hm, assuming we can reproduce this after two years, and assuming there's no bug causing pathological memory leaks, what would be the best sort of fix? We could impose an arbitrary limit on the number of parallel operations that the gateway is willing to perform. Or (on some OSes) have it monitor its own memory usage and refuse new operations when the footprint grows above a certain threshold. Both seem a bit unclean, but might be practical.