tahoe-lafs/trac-2024-07-25

reducing memory footprint in share reception #97

New issue

Open

opened 2007-08-10 02:29:40 +00:00 by warner · 5 comments

warner commented

2007-08-10 02:29:40 +00:00

Owner

One source of large memory footprint is the known issue with twisted.web's POST handling, addressed in #29. This ticket is not about that.

The memory footprint as reported by 'make check-memory' for uploading large files directly from disk is currently about 29MB. This seems acceptable for now. This test does not allow the uploading node to hold its own shares.

The "upload-self" series of tests does allow the node to hold its own shares, and this increases the memory footprint considerably (it seems to stabilize at 59MB). I have two theories on this, both of which center on Foolscap's behavior:

Foolscap might not be freeing the arguments of inbound method calls quickly enough: perhaps the arguments are involved in a reference cycle of some sort. This could cause the shares to be kept around for a while, gradually accumulating memory footprint.
Foolscap's loopback broker (which is used when you ask the Tub to connect to itself) might have some similar misbehavior, causing the tokens that normally get sent over the wire to be retained longer than expected.

The first theory should be exercised by writing a variant of the check-memory test which, instead of asking the isolated process-under-test to send files, it should instead have it receive shares, and watch its memory usage as it receives large numbers of shares of various sizes. If the first theory is correct, then I'd expect to see its memory usage be high when given small numbers (10s) of large (1MB) shares, since gc isn't going to kick in until we've seen hundreds of objects get allocated and freed. Large numbers (1000s) of small (10-100B) shares ought to trigger gc before the overhead gets too high.

If that theory doesn't hold up, then work on the second theory should start by replacing the code in broker.LoopbackTransport.write with a queue.append. The theory is that the naieve eventually(self.peer.dataReceived, data) puts dozens of calls (plus the shares themselves) on the eventual-send queue, and then foolscap.eventual._SimpleCallQueue._turn isn't freeing the queue until all events have been processed. The second change to try out is to modify that _turn method to pop events off the queue as they are processed, to miminize the lifetime of those shares.

I also have some suspicions about the way that nested functions work: do they capture the state of all local variables, or just the ones that are referenced in their body? Is there some introspection trick (like locals() or dir() or something) that would obligate Python to keep all local variables alive? For this, we need to be aggressive about using del to get rid of shares/blocks the moment they are no longer necessary.

Also, Deferreds do not do tail-recursion, so any chains that involve things like defer.succeed will cause d.addCallback to run synchronously, and can result in very long call stacks (hundreds of frames long). This means that all the variables that were expected to go out of scope might stick around for a surprisingly long time, and when their lifetimes overlap, that increases the memory footprint. Having methods always return a Deferred (and using defer.succeed when the method is in fact synchronous) is an important pattern, because it improves flexibility for later: "Premature synchronization is the root of all evil (in an event-driven world)". To balance these, I think that foolscap.eventual.fireEventually can be used as a "reactor turn barrier", to ensure that the call chain is cut short. Alternatively, maybe we should investigate Deferred and see how it might be made to implement tail recursion properly.

Another direction that might be promising would be to implement custom foolscap Slicers and Unslicers for blocks. I think we'd have to write the shares to disk during encoding, but if we did that then a Slicer could read just a small amount of data from disk each time the transport buffer opened up (using the producer/consumer model). This could effectively push the memory footprint off to disk.

I think that in the long run, we'll be able to closely model the amount of memory footprint that each concurrent upload will represent (probably as a linear function of min(filesize, max_segment_size)), and then we can build an upload scheduler that sequences the upload work to keep our memory footprint below some configured ceiling. i.e. you can do many small uploads at once, but only a certain number of bigger ones, but also once the file gets larger than 1MB then it doesn't matter how big it is because we're only processing a single segment at a time.

One source of large memory footprint is the known issue with twisted.web's POST handling, addressed in #29. This ticket is not about that. The memory footprint as reported by 'make check-memory' for uploading large files directly from disk is currently about 29MB. This seems acceptable for now. This test does *not* allow the uploading node to hold its own shares. The "upload-self" series of tests *does* allow the node to hold its own shares, and this increases the memory footprint considerably (it seems to stabilize at 59MB). I have two theories on this, both of which center on Foolscap's behavior: * Foolscap might not be freeing the arguments of inbound method calls quickly enough: perhaps the arguments are involved in a reference cycle of some sort. This could cause the shares to be kept around for a while, gradually accumulating memory footprint. * Foolscap's loopback broker (which is used when you ask the Tub to connect to itself) might have some similar misbehavior, causing the tokens that normally get sent over the wire to be retained longer than expected. The first theory should be exercised by writing a variant of the check-memory test which, instead of asking the isolated process-under-test to *send* files, it should instead have it *receive* shares, and watch its memory usage as it receives large numbers of shares of various sizes. If the first theory is correct, then I'd expect to see its memory usage be high when given small numbers (10s) of large (1MB) shares, since gc isn't going to kick in until we've seen hundreds of objects get allocated and freed. Large numbers (1000s) of small (10-100B) shares ought to trigger gc before the overhead gets too high. If that theory doesn't hold up, then work on the second theory should start by replacing the code in broker.LoopbackTransport.write with a queue.append. The theory is that the naieve `eventually(self.peer.dataReceived, data)` puts dozens of calls (plus the shares themselves) on the eventual-send queue, and then foolscap.eventual._SimpleCallQueue._turn isn't freeing the queue until all events have been processed. The second change to try out is to modify that _turn method to pop events off the queue as they are processed, to miminize the lifetime of those shares. I also have some suspicions about the way that nested functions work: do they capture the state of *all* local variables, or just the ones that are referenced in their body? Is there some introspection trick (like locals() or dir() or something) that would obligate Python to keep all local variables alive? For this, we need to be aggressive about using `del` to get rid of shares/blocks the moment they are no longer necessary. Also, Deferreds do not do tail-recursion, so any chains that involve things like `defer.succeed` will cause `d.addCallback` to run synchronously, and can result in very long call stacks (hundreds of frames long). This means that all the variables that were expected to go out of scope might stick around for a surprisingly long time, and when their lifetimes overlap, that increases the memory footprint. Having methods always return a Deferred (and using defer.succeed when the method is in fact synchronous) is an important pattern, because it improves flexibility for later: "Premature synchronization is the root of all evil (in an event-driven world)". To balance these, I think that foolscap.eventual.fireEventually can be used as a "reactor turn barrier", to ensure that the call chain is cut short. Alternatively, maybe we should investigate Deferred and see how it might be made to implement tail recursion properly. Another direction that might be promising would be to implement custom foolscap Slicers and Unslicers for blocks. I think we'd have to write the shares to disk during encoding, but if we did that then a Slicer could read just a small amount of data from disk each time the transport buffer opened up (using the producer/consumer model). This could effectively push the memory footprint off to disk. I think that in the long run, we'll be able to closely model the amount of memory footprint that each concurrent upload will represent (probably as a linear function of `min(filesize, max_segment_size)`), and then we can build an upload scheduler that sequences the upload work to keep our memory footprint below some configured ceiling. i.e. you can do many small uploads at once, but only a certain number of bigger ones, but also once the file gets larger than 1MB then it doesn't matter how big it is because we're only processing a single segment at a time.

tahoe-lafs added the

labels 2007-08-10 02:29:40 +00:00

tahoe-lafs added this to the undecided milestone 2007-08-10 02:29:40 +00:00

warner commented

2007-08-11 01:43:55 +00:00

Author

Owner

There are some useful notes about simultaneous-peer share pushing and the way that foolscap uses memory in #29, even though the ticket as a whole is focusing on the web-POST problem that was just resolved.

tahoe-lafs removed the

code

label 2007-08-14 18:55:18 +00:00

tahoe-lafs changed title from ~~reducing memory footprint~~ to reducing memory footprint in share reception

2007-08-16 17:49:15 +00:00

warner commented

2007-08-16 17:56:32 +00:00

Author

Owner

This was bad enough that I had to disable the 100MB test. The buildslave that hosts this test is on a somewhat memory-constrained VM, and that 600MB RSS caused the rest of the machine to thrash and choke.

One data point: I observed the test framework process for the 100MB check_memory.py 'upload' test to hit 600MB. This framework process holds 4 or 5 nodes and is receiving all of the shares emitted by the program-under-test. It is the program-under-test whose memory footprint is stable at 29MB. This was bad enough that I had to disable the 100MB test. The buildslave that hosts this test is on a somewhat memory-constrained VM, and that 600MB RSS caused the rest of the machine to thrash and choke.

tahoe-lafs added this to the undecided milestone 2008-03-21 22:32:12 +00:00

tahoe-lafs modified the milestone from eventually to undecided

2008-06-01 21:02:50 +00:00

zooko commented

2009-05-04 17:46:02 +00:00

Author

Owner

If you love this ticket (#97), then you might like tickets #54 (port memory usage tests to windows), #419 (pycryptopp uses up too much RAM), #478 (add memory-usage to stats-provider numbers), and #277 (our automated memory measurements might be measuring the wrong thing).

tahoe-lafs added the

code

label 2009-12-04 05:13:55 +00:00

davidsarah commented

2011-07-23 02:07:57 +00:00

Author

Owner

warner wrote:

I also have some suspicions about the way that nested functions work: do they capture the state of all local variables, or just the ones that are referenced in their body?

You are right to be suspicious: in CPython closures keep a reference to the entire enclosing frame. See Owen S.'s answer at http://stackoverflow.com/questions/3145893/how-are-closures-implemented-in-python .

BTW, this approach not only leads to memory leaks, but also ~~gives the wrong -- or at least, unexpected by most programmers --~~ affects the semantics for captured mutable variables. And since changing that would be an incompatible change, it's not likely that either the memory leaks or the semantics will be fixed.

warner wrote: > I also have some suspicions about the way that nested functions work: do they capture the state of all local variables, or just the ones that are referenced in their body? You are right to be suspicious: in CPython closures keep a reference to the entire enclosing frame. See Owen S.'s answer at <http://stackoverflow.com/questions/3145893/how-are-closures-implemented-in-python> . BTW, this approach not only leads to memory leaks, but also ~~gives the wrong -- or at least, unexpected by most programmers --~~ affects the semantics for captured mutable variables. And since changing that would be an incompatible change, it's not likely that either the memory leaks or the semantics will be fixed.

davidsarah commented

2012-08-11 01:49:15 +00:00

Author

Owner

Alternatively, maybe we should investigate Deferred and see how it might be made to implement tail recursion properly.

Trunk now depends on a version of Twisted that solves the Deferred tail recursion problem.

> Alternatively, maybe we should investigate Deferred and see how it might be made to implement tail recursion properly. Trunk [now depends on](https://tahoe-lafs.org/trac/tahoe-lafs/changeset/26dad5044b33675f967200e85261f612613c6766/git/) a version of Twisted that solves the [Deferred tail recursion problem](https://twistedmatrix.com/trac/ticket/411).