tahoe-lafs/trac-2024-07-25

Batch sizes when uploading immutables are hardcoded #3787

New issue

Open

opened 2021-09-02 19:28:19 +00:00 by itamarst · 4 comments

itamarst commented

2021-09-02 19:28:19 +00:00

Owner

Updated issue description: there is a single hardcoded value for batching (formerly known as pipelining) immutable uploads, and it might be better to be dynamic. Or higher, at least.

Initial issue description:

Pipeline class was added in #392, but I really don't understand the reasoning.

It makes a bit more sense if you replace the word "pipeline" with "batcher" when reading the code, but I still don't understand why round-trip-time is improved by this approach.

Updated issue description: there is a single hardcoded value for batching (formerly known as pipelining) immutable uploads, and it might be better to be dynamic. Or higher, at least. ---- Initial issue description: Pipeline class was added in #392, but I really don't understand the reasoning. It makes a bit more sense if you replace the word "pipeline" with "batcher" when reading the code, but I still don't understand why round-trip-time is improved by this approach.

tahoe-lafs added the

labels 2021-09-02 19:28:19 +00:00

tahoe-lafs added this to the HTTP Storage Protocol milestone 2021-09-02 19:28:19 +00:00

itamarst commented

2021-10-04 13:27:33 +00:00

Author

Owner

Brian provided this highly detailed explanation:


If my dusty memory serves, the issue was that uploads have a number of
small writes (headers and stuff) in addition to the larger chunks
(output of the erasure coding). Also, the "larger" chunks are still
pretty small. And the code that calls _write() is going to wait for the
returned Deferred to fire before starting on the next step. So the
client will send a tiny bit of data, wait a roundtrip for it to be
accepted, then start on the next bit, wait another roundtrip, etc. This
limits your network utilization (the percentage of your continuous
upstream bandwidth that you're actually using): the wire is sitting idle
most of the time. It gets massively worse with the round trip time.

The general fix is to use a windowed protocol that optimistically sends
lots of data, well in advance of what's been acknowledged. But you don't
want to send too much, because then you're just bloating the transmit
buffer (it all gets held up in the kernel, or in the userspace-side
socket buffer). So you send enough data to keep X bytes "in the air",
unacked, and each time you see another ack, you send out more. If you
can keep your local socket/kernel buffer from ever draining to zero,
you'll get 100% utilization of the network.

IIRC the Pipeline class was a wrapper that attempted to do something
like this for a RemoteReference. Once wrapped, the caller doesn't need
to know about the details, it can just do a bunch of tiny writes, and
the Deferred it gets back will lie and claim the write was complete
(i.e. it fires right away), when in fact the data has been sent but not
yet acked. It keeps doing this until the sent-but-not-acked data exceeds
the size limit (looks like 50kB, OMG networks were slow back then), at
which point it waits to fire the Deferreds until something actually gets
acked. Then, at the end, to make sure all the data really *did* get
sent, you have to call .flush(), which waits until the last real call's
Deferred fires before firing its own returned Deferred.

So it doesn't reduce the number of round trips, but it reduces the
waiting for them, which should increase utilization significantly.

Or, it would, if the size limit were appropriate for the network speed.
There's a thing in TCP flow control called "bandwidth delay product"[1],
I forget the details, but I think the rule is that bandwidth times round
trip time is the amount of unacked data you can have outstanding "on the
wire" without 1: buffering anything on your end (consumes memory, causes
bufferbloat) or 2: letting the pipe run dry (reducing utilization). I'm
pretty sure the home DSL line I cited in that ticket was about 1.5Mbps
upstream, and I bet I had RTTs of 100ms or so, for a BxD of 150kbits of
20kB. These days I've got gigabit fiber, and maybe 50ms latency, for a
BxD of 6MB.

As the comments say, we're overlapping multiple shares during the same
upload, so we don't need to pipeline the full 6MB, but I think if I were
using modern networks, I'd increase that 50kB to at least 500kB and
maybe 1MB or so. I'd want to run upload-speed experiments with a couple
of different networking configurations (apparently there's a macOS thing
called "Network Link Conditioner" that simulates slow/lossy network
connections) to see what the effects would be, to choose a better value
for that pipelining depth.

And of course the "right" way to do it would be to actively track how
fast the ACKs are returning, and somehow adjust the pipeline depth until
the pipe was optimally filled. Like how TCP does congestion/flow
control, but in userspace. But that sounds like way too much work.

Brian provided this highly detailed explanation: ``` If my dusty memory serves, the issue was that uploads have a number of small writes (headers and stuff) in addition to the larger chunks (output of the erasure coding). Also, the "larger" chunks are still pretty small. And the code that calls _write() is going to wait for the returned Deferred to fire before starting on the next step. So the client will send a tiny bit of data, wait a roundtrip for it to be accepted, then start on the next bit, wait another roundtrip, etc. This limits your network utilization (the percentage of your continuous upstream bandwidth that you're actually using): the wire is sitting idle most of the time. It gets massively worse with the round trip time. The general fix is to use a windowed protocol that optimistically sends lots of data, well in advance of what's been acknowledged. But you don't want to send too much, because then you're just bloating the transmit buffer (it all gets held up in the kernel, or in the userspace-side socket buffer). So you send enough data to keep X bytes "in the air", unacked, and each time you see another ack, you send out more. If you can keep your local socket/kernel buffer from ever draining to zero, you'll get 100% utilization of the network. IIRC the Pipeline class was a wrapper that attempted to do something like this for a RemoteReference. Once wrapped, the caller doesn't need to know about the details, it can just do a bunch of tiny writes, and the Deferred it gets back will lie and claim the write was complete (i.e. it fires right away), when in fact the data has been sent but not yet acked. It keeps doing this until the sent-but-not-acked data exceeds the size limit (looks like 50kB, OMG networks were slow back then), at which point it waits to fire the Deferreds until something actually gets acked. Then, at the end, to make sure all the data really *did* get sent, you have to call .flush(), which waits until the last real call's Deferred fires before firing its own returned Deferred. So it doesn't reduce the number of round trips, but it reduces the waiting for them, which should increase utilization significantly. Or, it would, if the size limit were appropriate for the network speed. There's a thing in TCP flow control called "bandwidth delay product"[1], I forget the details, but I think the rule is that bandwidth times round trip time is the amount of unacked data you can have outstanding "on the wire" without 1: buffering anything on your end (consumes memory, causes bufferbloat) or 2: letting the pipe run dry (reducing utilization). I'm pretty sure the home DSL line I cited in that ticket was about 1.5Mbps upstream, and I bet I had RTTs of 100ms or so, for a BxD of 150kbits of 20kB. These days I've got gigabit fiber, and maybe 50ms latency, for a BxD of 6MB. As the comments say, we're overlapping multiple shares during the same upload, so we don't need to pipeline the full 6MB, but I think if I were using modern networks, I'd increase that 50kB to at least 500kB and maybe 1MB or so. I'd want to run upload-speed experiments with a couple of different networking configurations (apparently there's a macOS thing called "Network Link Conditioner" that simulates slow/lossy network connections) to see what the effects would be, to choose a better value for that pipelining depth. And of course the "right" way to do it would be to actively track how fast the ACKs are returning, and somehow adjust the pipeline depth until the pipe was optimally filled. Like how TCP does congestion/flow control, but in userspace. But that sounds like way too much work. ```

itamarst commented

2021-10-04 14:01:02 +00:00

Author

Owner

From the above we can extract two problems:

A need for backpressure.
_write() waiting for Deferred to fire before continuing. If the need for backpressure didn't exist, this would be bad. Given backpressure is necessary... this might be OK. Or not, perhaps there is a better mechanism.

So step 1 is probably figure out how to implement backpressure.

From the above we can extract two problems: 1. A need for backpressure. 2. `_write()` waiting for Deferred to fire before continuing. If the need for backpressure didn't exist, this would be bad. Given backpressure is necessary... this might be OK. Or not, perhaps there is a better mechanism. So step 1 is probably figure out how to implement backpressure.

itamarst commented

2021-10-04 14:13:42 +00:00

Author

Owner

Instead of hardcoding buffer size, we could...

Figure out latency by sending HTTP echo to server.
Start with some reasonable batch buffer size.
Keep increasing buffer size until the latency from sending a batch is higher than minimal expected latency from step 1. This implies that we've hit the bandwidth limit.

Instead of hardcoding buffer size, we could... 1. Figure out latency by sending HTTP echo to server. 2. Start with some reasonable batch buffer size. 3. Keep increasing buffer size until the latency from sending a batch is higher than minimal expected latency from step 1. This implies that we've hit the bandwidth limit.

itamarst commented

2022-11-23 15:19:43 +00:00

Author

Owner

Once #3939 is fixed, the Pipeline class will no longer be used. However, there will still be a batching mechanism via allmydata.immutable.layout._WriteBuffer, which suffers from basically the same issue of having a single hardcoded number that isn't necessarily adapted to network conditions.

So this still should be thought about based on discussion above, but changing the summary and description.

Once #3939 is fixed, the Pipeline class will no longer be used. However, there will still be a batching mechanism via `allmydata.immutable.layout._WriteBuffer`, which suffers from basically the same issue of having a single hardcoded number that isn't necessarily adapted to network conditions. So this still should be thought about based on discussion above, but changing the summary and description.

tahoe-lafs modified the milestone from HTTP Storage Protocol to HTTP Storage Protocol v2

2022-11-23 15:19:43 +00:00

tahoe-lafs changed title from ~~Is the use of Pipeline for write actually necessary?~~ to Batch sizes when uploading immutables are hardcoded

2022-11-28 16:15:33 +00:00