memory usage in MDMF publish #1513

Open
opened 2011-08-28 22:28:07 +00:00 by warner · 8 comments
warner commented 2011-08-28 22:28:07 +00:00
Owner

I did a 'tahoe push --mdmf --mutable-type=mdmf foo' of a 210MB file. The client process swelled to 1.15GB RSS, making my entire system pretty unresponsive. The publish eventually succeeded, and the memory usage went back to normal.

I'm guessing that either there's a design problem in which it's trying to upload all segments in parallel, or there's a failure in the Pipeline code such that it's holding all shares in memory at the same time.

Since MDMF is supposed to make it possible to work with large files, I think the memory usage should be similar to CHK files: capped at a small constant times the segsize.

It would be nice to fix this for 1.9, but since MDMF is still experimental, I'm willing to ship without it.

I did a 'tahoe push --mdmf --mutable-type=mdmf foo' of a 210MB file. The client process swelled to 1.15GB RSS, making my entire system pretty unresponsive. The publish eventually succeeded, and the memory usage went back to normal. I'm guessing that either there's a design problem in which it's trying to upload all segments in parallel, or there's a failure in the Pipeline code such that it's holding all shares in memory at the same time. Since MDMF is supposed to make it possible to work with large files, I think the memory usage should be similar to CHK files: capped at a small constant times the segsize. It would be nice to fix this for 1.9, but since MDMF is still experimental, I'm willing to ship without it.
tahoe-lafs added the
code-mutable
minor
defect
1.9.0a1
labels 2011-08-28 22:28:07 +00:00
tahoe-lafs added this to the 1.9.0 milestone 2011-08-28 22:28:07 +00:00
warner commented 2011-08-28 22:41:19 +00:00
Author
Owner

Hm, there's a tension between reliability and memory-footprint-performance here. When making changes, we want each share to atomically jump from version1 to version2, without it being left in any intermediate state. But that means all of the changes need to be held in memory and applied at the same time.

When we're jumping from "no such share" to version1, those changes are the entire file. The data needs to be buffered somewhere. If we were allowed to write one segment at a time to the server's disk, then a server failure or lost connection would leave us in an intermediate state, where the share only had a portion of version1, which would effectively be a corrupt share.

I can think of a couple of ways to improve this:

  • special-case the initial share creation: give the client an API to incrementally write blocks to the new share, and either allow the world to see the incomplete share early, or put the partial share in a separate incoming/ directory and figure out a way to only make it visible to the client that's building it.
  • create an API to build a new version of the share one change at a time, then a second API call to finalize the change (and make the new version visible to the world). It might look something like the immutable share-building API.:
    • edithandle = share.start_editing()
    • edithandle.apply_delta(offset, newdata)
    • edithandle.finish()
    • edithandle.abort()
    • finish() is the test-and-set operation: it might fail if some other writer has completed their own start_editing()/apply_delta()/finish() sequence faster.

If we're willing to tolerate the disk-footprint, we could increase reliability against server crashes by making start_editing() create a full copy of the old share in a sibling directory (like incoming/, not visible to anyone but the edithandle). Then apply_delta() would do normal write()s to the copy, and finish() would atomically move the copy back into place. Everything in the incoming/ directory would be deleted at startup, and the temp copies would also be deleted when the connection to the client was lost. This would slow down the updates for large files (since a lot of data would need to be shuffled around before the edit could begin), and would consume more disk (twice the size of the share), but would allow edits to be spread across separate messages, which reduces the client's memory requirements. It would also reduce share corruption caused by the server being bounced during a mutable write.

Hm, there's a tension between reliability and memory-footprint-performance here. When making changes, we want each share to atomically jump from version1 to version2, without it being left in any intermediate state. But that means all of the changes need to be held in memory and applied at the same time. When we're jumping from "no such share" to version1, those changes are the entire file. The data needs to be buffered *somewhere*. If we were allowed to write one segment at a time to the server's disk, then a server failure or lost connection would leave us in an intermediate state, where the share only had a portion of version1, which would effectively be a corrupt share. I can think of a couple of ways to improve this: * special-case the initial share creation: give the client an API to incrementally write blocks to the new share, and either allow the world to see the incomplete share early, or put the partial share in a separate incoming/ directory and figure out a way to only make it visible to the client that's building it. * create an API to build a new version of the share one change at a time, then a second API call to finalize the change (and make the new version visible to the world). It might look something like the immutable share-building API.: * edithandle = share.start_editing() * edithandle.apply_delta(offset, newdata) * edithandle.finish() * edithandle.abort() * finish() is the test-and-set operation: it might fail if some other writer has completed their own start_editing()/apply_delta()/finish() sequence faster. If we're willing to tolerate the disk-footprint, we could increase reliability against server crashes by making start_editing() create a full copy of the old share in a sibling directory (like incoming/, not visible to anyone but the edithandle). Then apply_delta() would do normal write()s to the copy, and finish() would atomically move the copy back into place. Everything in the incoming/ directory would be deleted at startup, and the temp copies would also be deleted when the connection to the client was lost. This would slow down the updates for large files (since a lot of data would need to be shuffled around before the edit could begin), and would consume more disk (twice the size of the share), but would allow edits to be spread across separate messages, which reduces the client's memory requirements. It would also reduce share corruption caused by the server being bounced during a mutable write.
davidsarah commented 2011-08-28 23:14:38 +00:00
Author
Owner

Replying to warner:

  • create an API to build a new version of the share one change at a time, then a second API call to finalize the change (and make the new version visible to the world). It might look something like the immutable share-building API.:
    • edithandle = share.start_editing()
    • edithandle.apply_delta(offset, newdata)
    • edithandle.finish()
    • edithandle.abort()
    • finish() is the test-and-set operation: it might fail if some other writer has completed their own start_editing()/apply_delta()/finish() sequence faster.

I prefer this option: it allows the client to apply the deltas to all servers and confirm that those operations succeed, and only then send finish to all servers. But note that there needs to be an edithandle.truncate(new_size) operation, or alternatively .finish(new_size).

Replying to [warner](/tahoe-lafs/trac-2024-07-25/issues/1513#issuecomment-126905): > * create an API to build a new version of the share one change at a time, then a second API call to finalize the change (and make the new version visible to the world). It might look something like the immutable share-building API.: > * edithandle = share.start_editing() > * edithandle.apply_delta(offset, newdata) > * edithandle.finish() > * edithandle.abort() > * finish() is the test-and-set operation: it might fail if some other writer has completed their own start_editing()/apply_delta()/finish() sequence faster. I prefer this option: it allows the client to apply the deltas to all servers and confirm that those operations succeed, and only then send `finish` to all servers. But note that there needs to be an `edithandle.truncate(new_size)` operation, or alternatively `.finish(new_size)`.
davidsarah commented 2011-09-22 22:06:50 +00:00
Author
Owner

There are some memory usage measurements on the duplicate #1523. Particularly concerning is that there seems to be a rather large memory leak; it's not just high transient memory usage.

There are some memory usage measurements on the duplicate #1523. Particularly concerning is that there seems to be a rather large memory leak; it's not just high transient memory usage.
tahoe-lafs added
major
and removed
minor
labels 2011-09-22 22:06:50 +00:00
zooko commented 2011-09-22 23:09:59 +00:00
Author
Owner

Let's call it a memory "leak" if doing some operation repeatedly results in progressively greater memory usage, such that if you do that operation enough times it will use up all the memory in your system. Let's not call it a memory "leak" if it uses up way too much RAM. Note that last time I heard, CPython never releases memory back to the operating system: http://www.evanjones.ca/memoryallocator/

It sounds to me like there is a major problem here, which is that Tahoe-LAFS uses up way too much memory. I don't see evidence that there is a "leak" per se, and I don't consider it to be a major problem that CPython never releases memory back to the operating system.

Let's call it a memory "leak" if doing some operation repeatedly results in progressively greater memory usage, such that if you do that operation enough times it will use up all the memory in your system. Let's not call it a memory "leak" if it uses up way too much RAM. Note that last time I heard, CPython never releases memory back to the operating system: <http://www.evanjones.ca/memoryallocator/> It sounds to me like there is a major problem here, which is that Tahoe-LAFS uses up way too much memory. I don't see evidence that there is a "leak" per se, and I don't consider it to be a major problem that CPython never releases memory back to the operating system.
zooko commented 2011-09-22 23:10:37 +00:00
Author
Owner

We need to document this in 1.9 release's docs/performance.rst if it isn't fixed.

We need to document this in 1.9 release's docs/performance.rst if it isn't fixed.
davidsarah commented 2011-09-23 00:11:06 +00:00
Author
Owner

Subject to fragmentation issues, CPython does return memory to the OS: http://bugs.python.org/issue1123430. I tried to test whether uploading a second file resulted in the same additional memory usage (suggesting a leak) or less (suggesting that not returning memory is part of the problem), but couldn't complete the test because my machine became unresponsive. I'll try again when I have more free memory.

Note that it's RSS that we're measuring, not virtual memory. Memory pages that aren't being used shouldn't be counted in RSS (eventually).

Subject to fragmentation issues, CPython does return memory to the OS: <http://bugs.python.org/issue1123430>. I tried to test whether uploading a second file resulted in the same additional memory usage (suggesting a leak) or less (suggesting that not returning memory is part of the problem), but couldn't complete the test because my machine became unresponsive. I'll try again when I have more free memory. Note that it's RSS that we're measuring, not virtual memory. Memory *pages* that aren't being used shouldn't be counted in RSS (eventually).
warner commented 2011-10-13 17:07:00 +00:00
Author
Owner

not making it into 1.9

not making it into 1.9
tahoe-lafs modified the milestone from 1.9.0 to 1.10.0 2011-10-13 17:07:00 +00:00
zooko commented 2013-08-09 18:18:49 +00:00
Author
Owner

This isn't going to make it into [1.11.0]. I think it requires a deep change. Ultimately I think it actually requires end-to-end two-phase-commit (#1755)!

Let's see, does the docs/performance.rst already document this issue? source:trunk/docs/performance.rst?rev=514fb096be50464ce78933f4db48db4de40e7265#publishing-an-a-byte-mutable-file. Yes! Good.

This isn't going to make it into [1.11.0]. I think it requires a deep change. Ultimately I think it actually requires end-to-end two-phase-commit (#1755)! Let's see, does the docs/performance.rst already document this issue? source:trunk/docs/performance.rst?rev=514fb096be50464ce78933f4db48db4de40e7265#publishing-an-a-byte-mutable-file. Yes! Good.
tahoe-lafs modified the milestone from 1.11.0 to eventually 2013-08-09 18:18:49 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: tahoe-lafs/trac-2024-07-25#1513
No description provided.