analyze mutable-file retrieval performance: count roundtrips #304

Closed
opened 2008-02-08 03:07:51 +00:00 by warner · 4 comments
warner commented 2008-02-08 03:07:51 +00:00
Owner

I'd like to spend some time analyzing the control flow in mutable.Retrieve .
My original hope was to pull down the whole contents of the file in a single
round-trip, but it turned out to be trickier than I'd thought. I think we're
currently doing more like 3 or 4 RTT.

Directory traversal would be considerably faster with fewer roundtrips.

Or would it? Finding out how much of the time is spend waiting for the
network would be a great starting point. The first step would be to implement
a "Retrieval Results" object (similar to the immutable-file "Upload Results"
object I just wrote this week), into which we could throw timing information
during retrieval. Then find a good way to display this information: maybe in
a separate box at the bottom of the webish directory display (if some debug
checkbox were selected).

I'd like to spend some time analyzing the control flow in mutable.Retrieve . My original hope was to pull down the whole contents of the file in a single round-trip, but it turned out to be trickier than I'd thought. I think we're currently doing more like 3 or 4 RTT. Directory traversal would be considerably faster with fewer roundtrips. Or would it? Finding out how much of the time is spend waiting for the network would be a great starting point. The first step would be to implement a "Retrieval Results" object (similar to the immutable-file "Upload Results" object I just wrote this week), into which we could throw timing information during retrieval. Then find a good way to display this information: maybe in a separate box at the bottom of the webish directory display (if some debug checkbox were selected).
tahoe-lafs added the
minor
task
0.7.0
labels 2008-02-08 03:07:51 +00:00
tahoe-lafs added this to the undecided milestone 2008-02-08 03:07:51 +00:00
warner commented 2008-02-08 03:20:44 +00:00
Author
Owner

Note that the only call we ever make to the storage servers is "readv", so
the issue is simpler than with immutable files (in which there is a call to
get_buckets, followed by some number of 'read' calls, and foolscap does not
yet do promise pipelining).

One potential improvement is to avoid doing more than one readv per server.
This means reading enough data on the first call that we don't need to ask
for more. To do this, we must make a guess as to a reasonable amount of data
to ask for: we need the pubkey from one peer, the signed version info from
k+epsilon peers, and full share data from 'k' peers.

The other potential improvement is to exploit as much parallelism as
possible. There are plenty of tradeoffs here. If we knew 'k' ahead of time,
we would know how many signed-version-info queries we need.

Hm, maybe something like this:

  • make a guess at 'k', using the same defaults we'd use for publishing a new
    mutable file. For DirectoryNodes, the container should provide a
    hint, so that if dirnodes use 1-of-5 instead of 3-of-7, our guess for k is
    '1'
  • make sure that readv() correctly handles oversized read requests. I'm 90%
    sure that it does, but I want to double-check
  • ask 'k' servers for the full SDMF 1MB limit, and then and extra epsilon
    servers for the signed-version-info+pubkey data.
  • The pubkey is checked from whatever share arrives first.
  • If we learn that 'k' is higher, immediately send out additional full-size
    queries.
  • If we get back NAKs from the servers, send out an extra query for each
    NAK.

If we're correct (or high) about 'k', and all the servers have shares, then
this will do the job in a single RTT. If we underestimated 'k', we'll need
2RTT. If some servers did not have shares, we'll need about 2RTT (or
perhaps a fraction more, depending upon the spread of response times).

Note that the only call we ever make to the storage servers is "readv", so the issue is simpler than with immutable files (in which there is a call to get_buckets, followed by some number of 'read' calls, and foolscap does not yet do promise pipelining). One potential improvement is to avoid doing more than one readv per server. This means reading enough data on the first call that we don't need to ask for more. To do this, we must make a guess as to a reasonable amount of data to ask for: we need the pubkey from one peer, the signed version info from k+epsilon peers, and full share data from 'k' peers. The other potential improvement is to exploit as much parallelism as possible. There are plenty of tradeoffs here. If we knew 'k' ahead of time, we would know how many signed-version-info queries we need. Hm, maybe something like this: * make a guess at 'k', using the same defaults we'd use for publishing a new mutable file. For `DirectoryNode`s, the container should provide a hint, so that if dirnodes use 1-of-5 instead of 3-of-7, our guess for k is '1' * make sure that readv() correctly handles oversized read requests. I'm 90% sure that it does, but I want to double-check * ask 'k' servers for the full SDMF 1MB limit, and then and extra epsilon servers for the signed-version-info+pubkey data. * The pubkey is checked from whatever share arrives first. * If we learn that 'k' is higher, immediately send out additional full-size queries. * If we get back NAKs from the servers, send out an extra query for each NAK. If we're correct (or high) about 'k', and all the servers have shares, then this will do the job in a single RTT. If we underestimated 'k', we'll need 2*RTT. If some servers did not have shares, we'll need about 2*RTT (or perhaps a fraction more, depending upon the spread of response times).
warner commented 2008-02-13 23:44:25 +00:00
Author
Owner

I just did some dirnode-size analysis, and put the results in
docs/dirnodes.txt . The summary is that each child adds about 336 bytes to
the dirnode, and therefore 112 bytes to a k=3 -encoded share. The SDMF format
stores about 935 bytes of pubkey+sig+hashes, followed by share data, followed
by 1216 bytes of encprivkey. So if we fetch 2kB of data on the initial read,
we should be able to retrieve a 9-entry dirnode. If we fetch 3kB, we should
be able to get 18 entries, or 7 entries plus the encprivkey. 4kB would get us
27 entries, or 16 plus the encprivkey.

I just did some dirnode-size analysis, and put the results in docs/dirnodes.txt . The summary is that each child adds about 336 bytes to the dirnode, and therefore 112 bytes to a k=3 -encoded share. The SDMF format stores about 935 bytes of pubkey+sig+hashes, followed by share data, followed by 1216 bytes of encprivkey. So if we fetch 2kB of data on the initial read, we should be able to retrieve a 9-entry dirnode. If we fetch 3kB, we should be able to get 18 entries, or 7 entries plus the encprivkey. 4kB would get us 27 entries, or 16 plus the encprivkey.
warner commented 2008-02-14 01:16:45 +00:00
Author
Owner

I did an experiment to examine the effects of changing the Publish initial
read_size. Using 2048-bit keys, and using the same speed-test setup we use on
the buildbot, I got:

  • colo: (2048-bit keys)
    • 1kB (default)
      • create per-file time: 1.772s/2.806s
      • upload: 184ms/185ms
    • 3kB
      • create: 1.705s/2.299s
      • upload: 184ms/185ms
  • DSL:
    • 1kB (default)
      • create: 3.317s/3.117s
      • upload: 338ms/327ms
    • 3kB
      • create: 4.124s/4.151s
      • upload: 324ms/323ms

I was expecting the 3kB case to upload files one RTT faster (about 16ms on
DSL, 3ms in colo). I can't see any evidence of such a speedup, although it is
clear that one RTT is somewhat lost in the noise.

We need to investigate mutable-file behavior more closely. As pointed out
above, the real first step will be to add "Retrieval Results" / "Publish
Results" with detailed timing data.

Incidentally (for completeness), an extra experiment I did with 522-bit keys
is bogus, because the whole share is smaller, and a 1kB read is probably
enough to get everything. The results of that experiment were:

  • colo: (522-bit keys)

    • 1kB (default)
      • create per-file time: 152ms/142ms
      • upload: 119ms/123ms
    • 3kB
      • create: 135ms/136ms
      • upload: 117ms/119ms
  • DSL:

    • 1kB (default)
      • create: 262ms/279ms
      • upload: 210ms/212ms
    • 3kB
      • create: 256ms/292ms
      • upload: 211ms/207ms
I did an experiment to examine the effects of changing the Publish initial read_size. Using 2048-bit keys, and using the same speed-test setup we use on the buildbot, I got: * colo: (2048-bit keys) * 1kB (default) * create per-file time: 1.772s/2.806s * upload: 184ms/185ms * 3kB * create: 1.705s/2.299s * upload: 184ms/185ms * DSL: * 1kB (default) * create: 3.317s/3.117s * upload: 338ms/327ms * 3kB * create: 4.124s/4.151s * upload: 324ms/323ms I was expecting the 3kB case to upload files one RTT faster (about 16ms on DSL, 3ms in colo). I can't see any evidence of such a speedup, although it is clear that one RTT is somewhat lost in the noise. We need to investigate mutable-file behavior more closely. As pointed out above, the real first step will be to add "Retrieval Results" / "Publish Results" with detailed timing data. Incidentally (for completeness), an extra experiment I did with 522-bit keys is bogus, because the whole share is smaller, and a 1kB read is probably enough to get everything. The results of that experiment were: * colo: (522-bit keys) * 1kB (default) * create per-file time: 152ms/142ms * upload: 119ms/123ms * 3kB * create: 135ms/136ms * upload: 117ms/119ms * DSL: * 1kB (default) * create: 262ms/279ms * upload: 210ms/212ms * 3kB * create: 256ms/292ms * upload: 211ms/207ms
warner commented 2008-03-05 03:30:19 +00:00
Author
Owner

I added some retrieval-side timing data to the new webish "status" page,
which now includes how long each query took. Analyzing the results revealed
that the Retrieve code was doing a refetch to grab the encrypted private key,
which of course really isn't useful there (you only need the private key to
create a new version of the file, so Publish wants it more than Retrieve). As
a result, the default initial-read size of 2000 bytes was causing us to do
two round-trips, even for small directories.

I changed the code [#2245] to have separate unpack_share() and
unpack_share_data() methonds. The latter raises NeedMoreDataError if it
can't get the encrypted private key, the former only raises
NeedMoreDataError if it can't get the share data (and is silent about the
encprivkey). Retrieve now uses unpack_share_data() exclusively.

The result is one round trip instead of two for directories with fewer than
about 9 or 10 entries, allowing them to complete in roughly half the time.

I added some retrieval-side timing data to the new webish "status" page, which now includes how long each query took. Analyzing the results revealed that the Retrieve code was doing a refetch to grab the encrypted private key, which of course really isn't useful there (you only need the private key to create a new version of the file, so Publish wants it more than Retrieve). As a result, the default initial-read size of 2000 bytes was causing us to do two round-trips, even for small directories. I changed the code [#2245] to have separate unpack_share() and unpack_share_data() methonds. The latter raises NeedMoreDataError if it can't get the encrypted private key, the former only raises NeedMoreDataError if it can't get the share data (and is silent about the encprivkey). Retrieve now uses unpack_share_data() exclusively. The result is one round trip instead of two for directories with fewer than about 9 or 10 entries, allowing them to complete in roughly half the time.
tahoe-lafs modified the milestone from undecided to 0.9.0 (Allmydata 3.0 final) 2008-03-13 17:09:56 +00:00
tahoe-lafs added the
fixed
label 2008-03-13 17:51:31 +00:00
zooko closed this issue 2008-03-13 17:51:31 +00:00
Sign in to join this conversation.
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: tahoe-lafs/trac-2024-07-25#304
No description provided.