web gateway memory grows without bound under load #891

Open
opened 2010-01-10 06:16:08 +00:00 by zooko · 6 comments
zooko commented 2010-01-10 06:16:08 +00:00
Owner

I watched as two allmydata.com web gateways slow grew to multiple GB of RAM, while consuming max CPU. I kept watching until their behavior killed my ssh session. Fortunately I left a flogtool tail running so we got to capture one's final minutes. It looks to me like a client is able to initiate jobs faster than the web gateway can complete them, and the client kept this up at a steady rate until the web gateway died.

I watched as two allmydata.com web gateways slow grew to multiple GB of RAM, while consuming max CPU. I kept watching until their behavior killed my ssh session. Fortunately I left a `flogtool tail` running so we got to capture one's final minutes. It looks to me like a client is able to initiate jobs faster than the web gateway can complete them, and the client kept this up at a steady rate until the web gateway died.
tahoe-lafs added the
code-frontend-web
critical
defect
1.5.0
labels 2010-01-10 06:16:08 +00:00
tahoe-lafs added this to the undecided milestone 2010-01-10 06:16:08 +00:00
zooko commented 2010-01-10 06:18:37 +00:00
Author
Owner

Attachment dump.flog.bz2 (86911 bytes) added

"flogtool tail --save-as=dump.flog" of the final minutes of the web gateway's life

**Attachment** dump.flog.bz2 (86911 bytes) added "flogtool tail --save-as=dump.flog" of the final minutes of the web gateway's life
zooko commented 2010-01-10 06:26:09 +00:00
Author
Owner

Attachment dump-2.flog.bz2 (32391 bytes) added

Another "flogtool tail --save-as=dump-2.log" run which overlaps with the previous one (named dump.log) but which has different contents...

**Attachment** dump-2.flog.bz2 (32391 bytes) added Another "flogtool tail --save-as=dump-2.log" run which *overlaps* with the previous one (named dump.log) but which has different contents...
zooko commented 2010-01-10 06:28:56 +00:00
Author
Owner

So while I was running flogtool tail --save-as=dump.flog I started a second tail, like this: flogtool tail --save-as=dump-2.flog. Here is the result of that second tail, which confusingly doesn't seem to have a contiguous subset of the the first, although maybe I'm just reading it wrong.

So while I was running `flogtool tail --save-as=dump.flog` I started a *second* tail, like this: `flogtool tail --save-as=dump-2.flog`. Here is the result of that second tail, which confusingly doesn't seem to have a contiguous subset of the the first, although maybe I'm just reading it wrong.
tahoe-lafs modified the milestone from undecided to 1.7.0 2010-02-27 09:07:13 +00:00
tahoe-lafs modified the milestone from 1.7.0 to soon 2010-06-16 03:58:49 +00:00
warner commented 2010-06-19 18:16:05 +00:00
Author
Owner

incidentally, the best way to grab logs from a doomed system like this is to get the target node's "logport.furl" (from BASEDIR/private/logport.furl"), and then run the flogtool tail command from another computer altogether. That way the flogtool command isn't competing with the doomed process for memory. You might have done it this way.. it's not immediately obvious to me.

I'll take a look at the logs as soon as I can.

incidentally, the best way to grab logs from a doomed system like this is to get the target node's "logport.furl" (from BASEDIR/private/logport.furl"), and then run the `flogtool tail` command from another computer altogether. That way the flogtool command isn't competing with the doomed process for memory. You might have done it this way.. it's not immediately obvious to me. I'll take a look at the logs as soon as I can.
zooko commented 2010-06-21 20:35:48 +00:00
Author
Owner

No I ran flogtool tail on the same system. If I recall correctly the system had enough memory available--it was just that the python process was approaching its 3 GB limit (per process vm limit which I forget why it exists).

No I ran `flogtool tail` on the same system. If I recall correctly the system had enough memory available--it was just that the python process was approaching its 3 GB limit (per process vm limit which I forget why it exists).
warner commented 2012-05-23 00:14:29 +00:00
Author
Owner

Hm, assuming we can reproduce this after two years, and assuming there's no bug causing pathological memory leaks, what would be the best sort of fix? We could impose an arbitrary limit on the number of parallel operations that the gateway is willing to perform. Or (on some OSes) have it monitor its own memory usage and refuse new operations when the footprint grows above a certain threshold. Both seem a bit unclean, but might be practical.

Hm, assuming we can reproduce this after two years, and assuming there's no bug causing pathological memory leaks, what would be the best sort of fix? We could impose an arbitrary limit on the number of parallel operations that the gateway is willing to perform. Or (on some OSes) have it monitor its own memory usage and refuse new operations when the footprint grows above a certain threshold. Both seem a bit unclean, but might be practical.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: tahoe-lafs/trac-2024-07-25#891
No description provided.