pipeline upload segments to make upload faster #392

New issue

Closed

opened 2008-04-23 18:57:16 +00:00 by warner · 6 comments

warner commented

2008-04-23 18:57:16 +00:00

In ticket #252 we decided to reduce the max segment size from 1MiB to 128KiB. But this caused in-colo upload speed to drop by at least 50%.

We should see if we can pipeline two segments for upload, to get back the extra round-trip times that we lost with having more segments.

It's also possible that some of the slowdown is just from the extra overhead of computing more hashes, but I suspect the turnaround time more than overhead.

We need to do something similar for download too, since the download speed was reduced drastically by the segsize change too.

In ticket #252 we decided to reduce the max segment size from 1MiB to 128KiB. But this caused in-colo upload speed to drop by at least 50%. We should see if we can pipeline two segments for upload, to get back the extra round-trip times that we lost with having more segments. It's also possible that some of the slowdown is just from the extra overhead of computing more hashes, but I suspect the turnaround time more than overhead. We need to do something similar for download too, since the download speed was reduced drastically by the segsize change too.

warner added the

p/major

t/enhancement

v/1.0.0

labels 2008-04-23 18:57:16 +00:00

warner added this to the eventually milestone 2008-04-23 18:57:16 +00:00

warner self-assigned this 2008-04-23 18:57:16 +00:00

warner commented

2008-05-14 18:09:52 +00:00

Author

Oh, and I just thought of the right place to do this too: in the WriteBucketProxy. It should be allowed to keep a Nagle-like cache of
write vectors, and send them out in a batch when the cache gets larger than some
particular size (that will coalesce small writes into a single call, reducing the
round-trip time). In addition, it should be allowed to have multiple calls outstanding
if the total amount of data that it has sent (and therefore might be in the transport
buffer) is below some amount, say 128KiB. If k=3, then that should allow three segments to be on the wire at once, mitigating the slowdown due to round trips. As long as the RTT time is less than the bandwidth*windowsize, this should keep the pipe full.

Oh, and I just thought of the right place to do this too: in the `WriteBucketProxy`. It should be allowed to keep a Nagle-like cache of write vectors, and send them out in a batch when the cache gets larger than some particular size (that will coalesce small writes into a single call, reducing the round-trip time). In addition, it should be allowed to have multiple calls outstanding if the total amount of data that it has sent (and therefore might be in the transport buffer) is below some amount, say 128KiB. If k=3, then that should allow three segments to be on the wire at once, mitigating the slowdown due to round trips. As long as the RTT time is less than the bandwidth*windowsize, this should keep the pipe full.

warner commented

2008-06-01 22:09:39 +00:00

Author

#320 is related, since the storage-server protocol changes we talked about would make it easier to implement the pipelining.

warner commented

2009-04-15 19:23:12 +00:00

Author

Attachment pipeline.diff (14391 bytes) added

patch to add pipelining to immutable upload

**Attachment** pipeline.diff (14391 bytes) added patch to add pipelining to immutable upload

pipeline.diff

14 KiB

warner commented

2009-04-15 20:10:20 +00:00

Author

So, using the attached patch, I added pipelined writes to the immutable
upload operation. The Pipeline class allows up to 50KB in the pipe
before it starts blocking the sender (specifically, the calls to
WriteBucketProxy._write return defer.succeed until there is more
than 50KB of unacknowledged data in the pipe, after which it returns regular
Deferreds until some of those writes get retired. A terminal flush()
call causes the Upload to wait for the pipeline to drain before it is
considered complete).

A quick performance test (in the same environments that we do the buildbot
performance tests on: my home DSL line and tahoecs2 in colo) showed a
significant improvement in the DSL per-file overhead, but only about a 10%
improvement in the overall upload rate (for both DSL and colo).

Basically, the 7 writes used to write a small file (header, segment 0,
crypttext_hashtree, block_hashtree, share_hashtree, uri_extension, close) are
all put on the wire together, so they take bandwidth plus 1 RTT instead of
bandwidth plus 7 RTT. The savings of 6 RTT appears to save us about 1.8
seconds over my DSL line. (my ping time to the servers is about 11ms, but
then there's kernel/python/twisted/foolscap/tahoe overhead on top of that).

For a larger file, pipelining might increase the utilization of the wire,
particularly if you have a "long fat" pipe (high bandwidth but high latency).
However, with 10 shares going out at the same time, the wire is probably
pretty full already: the ratio of interest is segsize*N/k/BW / RTT . You send
N blocks for a single segment at once, then you wait for all the replies to
come back, then generate the next blocks. If the time it takes to send a
single block is greater than the server's turnaround time, then N-1 responses
will be received before the last block is finished sending, so you've only
got one RTT of idle time (while you wait for the last server to respond).
Pipelining will fill this last RTT, but my guess is that isn't that much of a
help, and that something else is needed to explain the performance hit we saw
in colo when we moved to larger segments.

DSL no pipelining:

TIME (startup): 2.36461615562 up, 0.719145059586 down
TIME (1x 200B): 2.38471603394 up, 0.734190940857 down
TIME (10x 200B): 21.7909920216 up, 8.98366594315 down
TIME (1MB): 45.8974239826 up, 5.21775698662 down
TIME (10MB): 449.196600914 up, 34.1318571568 down
upload per-file time: 2.179s
upload speed (1MB): 22.87kBps
upload speed (10MB): 22.37kBps

DSL with pipelining:

TIME (startup): 0.437352895737 up, 0.185742139816 down
TIME (1x 200B): 0.493880987167 up, 0.202013969421 down
TIME (10x 200B): 5.15211510658 up, 2.04516386986 down
TIME (1MB): 43.141931057 up, 2.09753513336 down
TIME (10MB): 416.777194977 up, 19.6058299541 down
upload per-file time: 0.515s
upload speed (1MB): 23.46kBps
upload speed (10MB): 24.02kBps

The in-colo tests showed roughly the same improvement to upload speed, but
very little change to the per-file time. The RTT time there is shorter (ping
time is about 120us), which might explain the difference. But I think the
slowdown lies elsewhere. Pipelining shaves about 30ms off each file, and
increases the overall upload speed by about 10%.

colo no pipelining:

TIME (startup): 0.29696393013 up, 0.0784759521484 down
TIME (1x 200B): 0.285771131516 up, 0.0790619850159 down
TIME (10x 200B): 3.23165798187 up, 0.849181175232 down
TIME (100x 200B): 31.7827451229 up, 8.95765590668 down
TIME (1MB): 1.00738477707 up, 0.347244977951 down
TIME (10MB): 7.12743496895 up, 2.9827849865 down
TIME (100MB): 70.9683670998 up, 25.6454920769 down
upload per-file time: 0.318s
upload per-file times-avg-RTT: 83.833386
upload per-file times-total-RTT: 20.958347
upload speed (1MB): 1.45MBps
upload speed (10MB): 1.47MBps
upload speed (100MB): 1.42MBps

colo with pipelining:

TIME (startup): 0.262734889984 up, 0.0758249759674 down
TIME (1x 200B): 0.271718025208 up, 0.0812950134277 down
TIME (10x 200B): 2.80361104012 up, 0.838641881943 down
TIME (100x 200B): 28.4790999889 up, 9.36092710495 down
TIME (1MB): 0.853738069534 up, 0.337486028671 down
TIME (10MB): 6.6658270359 up, 2.67381596565 down
TIME (100MB): 64.6233050823 up, 26.5593090057 down
upload per-file time: 0.285s
upload per-file times-avg-RTT: 77.205647
upload per-file times-total-RTT: 19.301412
upload speed (1MB): 1.76MBps
upload speed (10MB): 1.57MBps
upload speed (100MB): 1.55MBps

I want to run some more tests before landing this patch, to make sure it's
really doing what I though it should be doing. I'd also like to improve the
automated speed-test to do a simple TCP transfer to measure the available
upstream bandwidth, so we can compare tahoe's upload speed against the actual
wire.

So, using the attached patch, I added pipelined writes to the immutable upload operation. The `Pipeline` class allows up to 50KB in the pipe before it starts blocking the sender (specifically, the calls to `WriteBucketProxy._write` return `defer.succeed` until there is more than 50KB of unacknowledged data in the pipe, after which it returns regular Deferreds until some of those writes get retired. A terminal `flush()` call causes the Upload to wait for the pipeline to drain before it is considered complete). A quick performance test (in the same environments that we do the buildbot performance tests on: my home DSL line and tahoecs2 in colo) showed a significant improvement in the DSL per-file overhead, but only about a 10% improvement in the overall upload rate (for both DSL and colo). Basically, the 7 writes used to write a small file (header, segment 0, crypttext_hashtree, block_hashtree, share_hashtree, uri_extension, close) are all put on the wire together, so they take bandwidth plus 1 RTT instead of bandwidth plus 7 RTT. The savings of 6 RTT appears to save us about 1.8 seconds over my DSL line. (my ping time to the servers is about 11ms, but then there's kernel/python/twisted/foolscap/tahoe overhead on top of that). For a larger file, pipelining might increase the utilization of the wire, particularly if you have a "long fat" pipe (high bandwidth but high latency). However, with 10 shares going out at the same time, the wire is probably pretty full already: the ratio of interest is segsize*N/k/BW / RTT . You send N blocks for a single segment at once, then you wait for all the replies to come back, then generate the next blocks. If the time it takes to send a single block is greater than the server's turnaround time, then N-1 responses will be received before the last block is finished sending, so you've only got one RTT of idle time (while you wait for the last server to respond). Pipelining will fill this last RTT, but my guess is that isn't that much of a help, and that something else is needed to explain the performance hit we saw in colo when we moved to larger segments. DSL no pipelining: ``` TIME (startup): 2.36461615562 up, 0.719145059586 down TIME (1x 200B): 2.38471603394 up, 0.734190940857 down TIME (10x 200B): 21.7909920216 up, 8.98366594315 down TIME (1MB): 45.8974239826 up, 5.21775698662 down TIME (10MB): 449.196600914 up, 34.1318571568 down upload per-file time: 2.179s upload speed (1MB): 22.87kBps upload speed (10MB): 22.37kBps ``` DSL with pipelining: ``` TIME (startup): 0.437352895737 up, 0.185742139816 down TIME (1x 200B): 0.493880987167 up, 0.202013969421 down TIME (10x 200B): 5.15211510658 up, 2.04516386986 down TIME (1MB): 43.141931057 up, 2.09753513336 down TIME (10MB): 416.777194977 up, 19.6058299541 down upload per-file time: 0.515s upload speed (1MB): 23.46kBps upload speed (10MB): 24.02kBps ``` The in-colo tests showed roughly the same improvement to upload speed, but very little change to the per-file time. The RTT time there is shorter (ping time is about 120us), which might explain the difference. But I think the slowdown lies elsewhere. Pipelining shaves about 30ms off each file, and increases the overall upload speed by about 10%. colo no pipelining: ``` TIME (startup): 0.29696393013 up, 0.0784759521484 down TIME (1x 200B): 0.285771131516 up, 0.0790619850159 down TIME (10x 200B): 3.23165798187 up, 0.849181175232 down TIME (100x 200B): 31.7827451229 up, 8.95765590668 down TIME (1MB): 1.00738477707 up, 0.347244977951 down TIME (10MB): 7.12743496895 up, 2.9827849865 down TIME (100MB): 70.9683670998 up, 25.6454920769 down upload per-file time: 0.318s upload per-file times-avg-RTT: 83.833386 upload per-file times-total-RTT: 20.958347 upload speed (1MB): 1.45MBps upload speed (10MB): 1.47MBps upload speed (100MB): 1.42MBps ``` colo with pipelining: ``` TIME (startup): 0.262734889984 up, 0.0758249759674 down TIME (1x 200B): 0.271718025208 up, 0.0812950134277 down TIME (10x 200B): 2.80361104012 up, 0.838641881943 down TIME (100x 200B): 28.4790999889 up, 9.36092710495 down TIME (1MB): 0.853738069534 up, 0.337486028671 down TIME (10MB): 6.6658270359 up, 2.67381596565 down TIME (100MB): 64.6233050823 up, 26.5593090057 down upload per-file time: 0.285s upload per-file times-avg-RTT: 77.205647 upload per-file times-total-RTT: 19.301412 upload speed (1MB): 1.76MBps upload speed (10MB): 1.57MBps upload speed (100MB): 1.55MBps ``` I want to run some more tests before landing this patch, to make sure it's really doing what I though it should be doing. I'd also like to improve the automated speed-test to do a simple TCP transfer to measure the available upstream bandwidth, so we can compare tahoe's upload speed against the actual wire.

Rows
Columns