download: tolerate lost or missing servers #287

Open
opened 2008-01-26 00:25:30 +00:00 by warner · 33 comments

I don't have a failing unit test to prove it, but I'm fairly sure that the
current code will abort a download if one of the servers we're using is lost
during the download. This is a problem.

A related problem is that downloads will run at the rate of the slowest used
peer, and we may be able to get significantly faster downloads by using one
of the other N-k available servers. For example, if you have most of your
servers in colo, but one or two is distant, then a helper which is also in
colo might prefer to pull shares entirely from in-colo machines.

The necessary change should be to keep a couple of extra servers in reserve,
such that used_peers is a list (sorted by preference/speed) with some
extra members, rather than a minimal set of exactly 'k' undifferentiated
peers.

If a block request hasn't completed within some "reasonable" amount of time
(say, 2x the time of the other requests?), we should move the slow server to
the bottom of the list and make a new query for that block (using a server
that's not currently in use but which appears at a higher priority than the
slowpoke). If the server was actually missing (and it just takes TCP a while
to decide that it's gone), it will eventually go away (and the query will
fail with a DeadReferenceError), in which case we'll remove it from the list
altogether (which is what the current code does, modulo the newly-reopened
#17 bug).

Without this, many of the client downloads in progress when we bounce a
storage server will fail, which would be pretty annoying for the clients.

(#798 is Brian's downloader rewrite)

I don't have a failing unit test to prove it, but I'm fairly sure that the current code will abort a download if one of the servers we're using is lost during the download. This is a problem. A related problem is that downloads will run at the rate of the slowest used peer, and we may be able to get significantly faster downloads by using one of the other N-k available servers. For example, if you have most of your servers in colo, but one or two is distant, then a helper which is also in colo might prefer to pull shares entirely from in-colo machines. The necessary change should be to keep a couple of extra servers in reserve, such that `used_peers` is a list (sorted by preference/speed) with some extra members, rather than a minimal set of exactly 'k' undifferentiated peers. If a block request hasn't completed within some "reasonable" amount of time (say, 2x the time of the other requests?), we should move the slow server to the bottom of the list and make a new query for that block (using a server that's not currently in use but which appears at a higher priority than the slowpoke). If the server was actually missing (and it just takes TCP a while to decide that it's gone), it will eventually go away (and the query will fail with a DeadReferenceError), in which case we'll remove it from the list altogether (which is what the current code does, modulo the newly-reopened #17 bug). Without this, many of the client downloads in progress when we bounce a storage server will fail, which would be pretty annoying for the clients. (#798 is Brian's downloader rewrite)
warner added the
c/code-encoding
p/major
t/defect
v/0.7.0
labels 2008-01-26 00:25:30 +00:00
warner added this to the 0.9.0 (Allmydata 3.0 final) milestone 2008-01-26 00:25:30 +00:00
zooko modified the milestone from 0.9.0 (Allmydata 3.0 final) to undecided 2008-03-08 02:54:11 +00:00
Author

I think I've identified three main problems:

  • if any servers are silently partitioned (e.g. a laptop that's been
    suspended, any connection that TCP hasn't realized is gone yet) when a
    download starts, peer-selection will not complete until that connection is
    finally abandoned
  • if an active server is silently partitioned during a download, the
    download will stall until TCP gives up on them. At that point, the
    download ought to resume, using some other server (if a server is actively
    lost during a download, such that TCP gives us a connectionLost, then the
    download should immediately switch to a different server.. however I need
    to test this more carefully).
  • the download will be as slow as the slowest active server.

The task is to fix download to:

  • start downloading segments as soon as peer-selection finds 'k' shares.
    get_buckets responses that arrive after segment download has begun should
    be added to the alternates list.
  • if any response takes more than, say, three times as long as the longest
    response for that segment, move the slow server to the bottom of the
    alternates list and start fetching a different share.

Basically the download process must turn into a state machine. Each known
share has a state (which hashes have been fetched, which block queries are
outstanding). The initial peer-selection process causes shares to be added to
the known list.

I think I've identified three main problems: * if any servers are silently partitioned (e.g. a laptop that's been suspended, any connection that TCP hasn't realized is gone yet) when a download starts, peer-selection will not complete until that connection is finally abandoned * if an active server is silently partitioned during a download, the download will stall until TCP gives up on them. At that point, the download ought to resume, using some other server (if a server is actively lost during a download, such that TCP gives us a connectionLost, then the download should immediately switch to a different server.. however I need to test this more carefully). * the download will be as slow as the slowest active server. The task is to fix download to: * start downloading segments as soon as peer-selection finds 'k' shares. get_buckets responses that arrive after segment download has begun should be added to the alternates list. * if any response takes more than, say, three times as long as the longest response for that segment, move the slow server to the bottom of the alternates list and start fetching a different share. Basically the download process must turn into a state machine. Each known share has a state (which hashes have been fetched, which block queries are outstanding). The initial peer-selection process causes shares to be added to the known list.
warner changed title from download needs to be tolerant of lost peers to download: tolerate lost/missing peers 2008-08-27 01:33:34 +00:00
Author

#193 and #253 are probably related to this one

#193 and #253 are probably related to this one

The Allmydata.com production grid experienced this problem today, when the storage server "prodtahoe7" failed in such a way that the other nodes kept waiting indefinitely for answers to their foolscap queries to that server. At least, we think that is why the downloads hung until we turned off prodtahoe7. However, I don't understand why the downloads continued to hang after the prodtahoe7 machine was powered off, until the clients that were using prodtahoe7 (in this case the webapi nodes) were restarted.

Shouldn't the absence of prodtahoe7 at the IP level have triggered the TCP connections to break the next time the clients tried to send packets, which should have triggered the foolscap connection to break, which should have triggered the download to abort?

Ah! But then even if that happened and that download were aborted, would the next download try to use prodtahoe7 storage server nodes again, and if it did, would it wait for a long time for a TCP connection attempt?

Anyway, we need to investigate in the logs of today's events to see exactly why the webapi nodes had to be restarted, after prodtahoe7 was gone, before they would start working again.

The Allmydata.com production grid experienced this problem today, when the storage server "prodtahoe7" failed in such a way that the other nodes kept waiting indefinitely for answers to their foolscap queries to that server. At least, we think that is why the downloads hung until we turned off prodtahoe7. However, I don't understand why the downloads continued to hang after the prodtahoe7 machine was powered off, until the clients that were using prodtahoe7 (in this case the webapi nodes) were restarted. Shouldn't the absence of prodtahoe7 at the IP level have triggered the TCP connections to break the next time the clients tried to send packets, which should have triggered the foolscap connection to break, which should have triggered the download to abort? Ah! But then even if that happened and that download were aborted, would the next download try to use prodtahoe7 storage server nodes again, and if it did, would it wait for a long time for a TCP connection attempt? Anyway, we need to investigate in the logs of today's events to see exactly why the webapi nodes had to be restarted, after prodtahoe7 was gone, before they would start working again.
Author

It looks like prodtahoe7 had a RAID controller failure, or possibly several
simultaneous disk failures, and got stuck in a weird way: TCP connections
probably remained alive, but the Tahoe storage nodes were not responding to
queries. This is pretty close to the "silent connection loss" case, but
worse: TCP keepalives wouldn't tell you the connection was dead, because the
prodtahoe7 kernel was still running and responding with ACKs. So the fix
described above should improve application behavior in yesterday's prodtahoe7
problem, as well as in the more common close-the-laptop-and-walk-away
problem.

For download, this fix means a tradeoff between the setup work (i.e. hash
tree fetching) needed to start using a new share, against how long we want to
wait to distinguish between a slow server and a stuck one. I don't know what
sort of heuristic we should use for this: we must take into account slow
links and large segments, and remember that parallel segment requests will be
competing with each other.

For upload, this is another time-vs-work tradeoff, but slightly trickier. If
we give up on the server early, during peer selection, then the consequences
are minor: we may put the share on a non-ideal server, such that the eventual
downloading client will have to search further around the ring to find the
share. If we are forced to give up on the server late, we must either give up
on that share (i.e. the file is now unhealthy, with perhaps 9 shares instead
of 10), or restart the upload from the beginning, or spend memory (or disk)
on holding all shares so that we have something to give to the replacement
server. Of these choices, I think I prever giving up on the share (and
scheduling a re-upload, or a repair if the original data is not available in
a non-streaming place).

It looks like prodtahoe7 had a RAID controller failure, or possibly several simultaneous disk failures, and got stuck in a weird way: TCP connections probably remained alive, but the Tahoe storage nodes were not responding to queries. This is pretty close to the "silent connection loss" case, but worse: TCP keepalives wouldn't tell you the connection was dead, because the prodtahoe7 kernel was still running and responding with ACKs. So the fix described above should improve application behavior in yesterday's prodtahoe7 problem, as well as in the more common close-the-laptop-and-walk-away problem. For download, this fix means a tradeoff between the setup work (i.e. hash tree fetching) needed to start using a new share, against how long we want to wait to distinguish between a slow server and a stuck one. I don't know what sort of heuristic we should use for this: we must take into account slow links and large segments, and remember that parallel segment requests will be competing with each other. For upload, this is another time-vs-work tradeoff, but slightly trickier. If we give up on the server early, during peer selection, then the consequences are minor: we may put the share on a non-ideal server, such that the eventual downloading client will have to search further around the ring to find the share. If we are forced to give up on the server late, we must either give up on that share (i.e. the file is now unhealthy, with perhaps 9 shares instead of 10), or restart the upload from the beginning, or spend memory (or disk) on holding all shares so that we have something to give to the replacement server. Of these choices, I think I prever giving up on the share (and scheduling a re-upload, or a repair if the original data is not available in a non-streaming place).

See also #193 and #253 and #521.

See also #193 and #253 and #521.
zooko changed title from download: tolerate lost/missing peers to download: tolerate lost/missing servers 2009-12-12 04:45:15 +00:00

I've observed this happening quite a lot on the allmydata.com prod grid. I haven't yet figured out exactly which server is responding strangely or what that server is doing wrong, but exactly one of the (currently) 89 servers on the prod grid fails to respond to the do-you-have-shares query and causes downloads to hang. Restarting the gateway node causes it to start downloading correctly, which means that whichever server it is that is behaving badly either doesn't connect to the gateway after the gateway restarts, or it behaves better after it has reconnected to the gateway.

I've observed this happening quite a lot on the allmydata.com prod grid. I haven't yet figured out exactly which server is responding strangely or what that server is doing wrong, but exactly one of the (currently) 89 servers on the prod grid fails to respond to the do-you-have-shares query and causes downloads to hang. Restarting the gateway node causes it to start downloading correctly, which means that whichever server it is that is behaving badly either doesn't connect to the gateway after the gateway restarts, or it behaves better after it has reconnected to the gateway.

This probably also affects upload, as mentioned in comment:365182, but there seems to be no separate ticket for that. (#782 is possibly relevant but not confirmed to have the same cause.)

We should probably have a test that simulates a hanging server, and/or a server that disconnects.

This probably also affects upload, as mentioned in [comment:365182](/tahoe-lafs/trac/issues/287#issuecomment-365182), but there seems to be no separate ticket for that. (#782 is possibly relevant but not confirmed to have the same cause.) We should probably have a test that simulates a hanging server, and/or a server that disconnects.
daira changed title from download: tolerate lost/missing servers to download/upload: tolerate lost or missing servers 2009-12-27 00:57:30 +00:00
Author

I just created #873 for the upload case. Both are important, but I'd like to leave this ticket specific for the download case: the code paths and necessary implementation details are completely different.

I just created #873 for the upload case. Both are important, but I'd like to leave this ticket specific for the download case: the code paths and necessary implementation details are completely different.
warner changed title from download/upload: tolerate lost or missing servers to download: tolerate lost or missing servers 2009-12-27 04:52:06 +00:00

Many of the problems that I've observed which I thought were a case of this ticket have actually turned out to be a case of #928 (start downloading as soon as you know where to get K shares). That is: it was not the case that a server failed and got into a hung state during a download. (I never could understand how this problem could be so common if it required this particular timing!) Instead it was the case that if a server failed and got into a hung state then all subsequent downloads would hang. This was happening quite a lot on the allmydata.com prod grid recently because servers were experiencing MemoryError and then going into this state.

Many of the problems that I've observed which I thought were a case of this ticket have actually turned out to be a case of #928 (start downloading as soon as you know where to get K shares). That is: it was *not* the case that a server failed and got into a hung state during a download. (I never could understand how this problem could be so common if it required this particular timing!) Instead it was the case that if a server failed and got into a hung state then all subsequent downloads would hang. This was happening quite a lot on the allmydata.com prod grid recently because servers were experiencing MemoryError and then going into this state.

I think that the original post was slightly imprecise. I think that download would correctly fail-over if a server disconnected during download (or if the server returned an error or if it dropped the TCP connection), but it would hang if the server stayed connected but didn't answer the requests at all. In fact, until the fix for #928 was committed, downloads would hang if there was such a stuck server on the grid at all, even if that server had been in its stuck state since before the download began and even if that server didn't have any of the shares that the download needed!

Okay, so the fix for #928 has been committed to trunk, which means that downloads now proceed even if there is a stuck server on the grid but with the current version (changeset:ea3954372a06a36c) it means that download proceeds without knowing about all the shares that are out there and currently the downloader ignores the information about shares which arrives late.

Here is a patch in unified diff form which fixes this -- making download accept and use information that arrives after "stage 4" of download has begun, and also has incomplete changes to the unit tests to deterministically exercise this case.

I think that the original post was slightly imprecise. I think that download *would* correctly fail-over if a server disconnected during download (or if the server returned an error or if it dropped the TCP connection), but it would hang if the server stayed connected but didn't answer the requests at all. In fact, until the fix for #928 was committed, downloads would hang if there was such a stuck server on the grid at all, even if that server had been in its stuck state since before the download began and even if that server didn't have any of the shares that the download needed! Okay, so the fix for #928 has been committed to trunk, which means that downloads now proceed even if there is a stuck server on the grid *but* with the current version (changeset:ea3954372a06a36c) it means that download proceeds without knowing about all the shares that are out there and currently the downloader ignores the information about shares which arrives late. Here is a patch in unified diff form which fixes this -- making download accept and use information that arrives after "stage 4" of download has begun, and also has incomplete changes to the unit tests to deterministically exercise this case.

Attachment p1.diff.txt (9941 bytes) added

**Attachment** p1.diff.txt (9941 bytes) added

Here is a version of my patch in which there is a new test named test_failover_during_stage_4. The intent of this test is:
1 Set servers 3 through 9 to the hung state.
2 Start download.
3 As soon as stage 4 of download is reached, which means that the client got responses to get_buckets from servers 0, 1, and 2, then unhang server 3 and cause server 2 to have a corrupted share.
4 Assert that download completes successfully.
Oh, writing that makes me realize that server 2 might as well have the share corrupted before the download starts!
I'm not sure if the currently implementation of the test will unhang server 3 before the downloader finishes downloading all the shares from server 0, 1, and 2. My intent is to test the case that the downloader does hear back from a new server, after stage 4 has begun but before stage 4 has ended. I definitely do not want to add a delay to the downloader once it runs out of buckets in the hopes that another bucket will come in. Brian is considering such tricky tactics for his post-1.6 downloader rewrite, but that's out of scope for this.
David-Sarah is currently implementing a method used in this patch named _corrupt_share_in.

Here is a version of my patch in which there is a new test named `test_failover_during_stage_4`. The intent of this test is: 1 Set servers 3 through 9 to the hung state. 2 Start download. 3 As soon as stage 4 of download is reached, which means that the client got responses to `get_buckets` from servers 0, 1, and 2, then unhang server 3 and cause server 2 to have a corrupted share. 4 Assert that download completes successfully. Oh, writing that makes me realize that server 2 might as well have the share corrupted before the download starts! I'm not sure if the currently implementation of the test will unhang server 3 before the downloader finishes downloading all the shares from server 0, 1, and 2. My intent is to test the case that the downloader *does* hear back from a new server, *after* stage 4 has begun but before stage 4 has ended. I definitely do *not* want to add a delay to the downloader once it runs out of buckets in the hopes that another bucket will come in. Brian is considering such tricky tactics for his post-1.6 downloader rewrite, but that's out of scope for this. David-Sarah is currently implementing a method used in this patch named `_corrupt_share_in`.

Attachment p2.diff.txt (9881 bytes) added

**Attachment** p2.diff.txt (9881 bytes) added

Okay here's a version of the tests which I think is correct except that it doesn't have "corrupt a share" method yet (David-Sarah is contributing that).

Okay here's a version of the tests which I think is correct except that it doesn't have "corrupt a share" method yet (David-Sarah is contributing that).

Attachment p3.diff.txt (9914 bytes) added

**Attachment** p3.diff.txt (9914 bytes) added

Attachment p4.diff.txt (12652 bytes) added

**Attachment** p4.diff.txt (12652 bytes) added

Attachment p4a.diff.txt (12642 bytes) added

**Attachment** p4a.diff.txt (12642 bytes) added

Okay here is a complete version including tests. Thanks to David-Sarah for helping with the tests. Please review! (It is okay for David-Sarah to be the reviewer even though they helped with the tests.)

Okay here is a complete version including tests. Thanks to David-Sarah for helping with the tests. Please review! (It is okay for David-Sarah to be the reviewer even though they helped with the tests.)

Attachment accept-late-buckets.darcspatch.txt (60208 bytes) added

**Attachment** accept-late-buckets.darcspatch.txt (60208 bytes) added

Attachment accept-late-buckets2.darcspatch.txt (60852 bytes) added

That patch wasn't pyflakes-clean (unused variables in the test code), here is one that is:

**Attachment** accept-late-buckets2.darcspatch.txt (60852 bytes) added That patch wasn't pyflakes-clean (unused variables in the test code), here is one that is:

Attachment davidsarah-current-tree.diff.txt (6334 bytes) added

**Attachment** davidsarah-current-tree.diff.txt (6334 bytes) added

Attachment accept-late-buckets3.darcspatch.txt (62039 bytes) added

There were a couple of bugs in that one. Here's one with no bugs in it!

**Attachment** accept-late-buckets3.darcspatch.txt (62039 bytes) added There were a couple of bugs in that one. Here's one with no bugs in it!

Attachment accept-late-buckets4.darcspatch.2.txt (61102 bytes) added

The precondition checks that we added while debugging cause the code to fail under some tests because in that case the object is a fake ReadBucketProxy not a real one, so the precondition rejects it. This patch is just like accept-late-buckets3.darcspatch.txt except without those two checks.

**Attachment** accept-late-buckets4.darcspatch.2.txt (61102 bytes) added The precondition checks that we added while debugging cause the code to fail under some tests because in that case the object is a fake ReadBucketProxy not a real one, so the precondition rejects it. This patch is just like accept-late-buckets3.darcspatch.txt except without those two checks.

Attachment fast-servers-first-0.darcspatch.txt (76475 bytes) added

Here is a patch which adds a new feature: remember the order servers answered and use the first servers first. Tests by David-Sarah.

**Attachment** fast-servers-first-0.darcspatch.txt (76475 bytes) added Here is a patch which adds a new feature: remember the order servers answered and use the first servers first. Tests by David-Sarah.

Committed changeset:3e4342ecb3625899 which makes it so that downloaders accept late-arriving shares and use them. Thanks to David-Sarah for help especially with the test!

Committed changeset:3e4342ecb3625899 which makes it so that downloaders accept late-arriving shares and use them. Thanks to David-Sarah for help especially with the test!

Attachment fast-servers-first-1.darcspatch.txt (62592 bytes) added

**Attachment** fast-servers-first-1.darcspatch.txt (62592 bytes) added

fast-servers-first-1.darcspatch.txt doesn't pass the new test that David-Sarah wrote for it: allmydata.test.test_hung_server.HungServerDownloadTest.test_use_first_servers_to_reply, and also it causes this test to go from pass to fail:

allmydata.test.test_mutable
  Problems
    test_publish_all_servers_bad ... Traceback (most recent call last):
  File "/Users/wonwinmcbrootles/playground/allmydata/tahoe/trunk/new-preserve-order/src/allmydata/test/common_util.py", line 71, in done
    (which, expected_failure, res))
twisted.trial.unittest.FailTest: test_publish_all_servers_bad was supposed to raise <class 'allmydata.mutable.common.NotEnoughServersError'>, not get '<MutableFileNode 33b7b20 RW 2qvahvls>'
[FAIL]

===============================================================================
[FAIL]: allmydata.test.test_mutable.Problems.test_publish_all_servers_bad

Traceback (most recent call last):
  File "/Users/wonwinmcbrootles/playground/allmydata/tahoe/trunk/new-preserve-order/src/allmydata/test/common_util.py", line 71, in done
    (which, expected_failure, res))
twisted.trial.unittest.FailTest: test_publish_all_servers_bad was supposed to raise <class 'allmydata.mutable.common.NotEnoughServersError'>, not get '<MutableFileNode 33b7b20 RW 2qvahvls>'
fast-servers-first-1.darcspatch.txt doesn't pass the new test that David-Sarah wrote for it: `allmydata.test.test_hung_server.HungServerDownloadTest.test_use_first_servers_to_reply`, and also it causes this test to go from pass to fail: ``` allmydata.test.test_mutable Problems test_publish_all_servers_bad ... Traceback (most recent call last): File "/Users/wonwinmcbrootles/playground/allmydata/tahoe/trunk/new-preserve-order/src/allmydata/test/common_util.py", line 71, in done (which, expected_failure, res)) twisted.trial.unittest.FailTest: test_publish_all_servers_bad was supposed to raise <class 'allmydata.mutable.common.NotEnoughServersError'>, not get '<MutableFileNode 33b7b20 RW 2qvahvls>' [FAIL] =============================================================================== [FAIL]: allmydata.test.test_mutable.Problems.test_publish_all_servers_bad Traceback (most recent call last): File "/Users/wonwinmcbrootles/playground/allmydata/tahoe/trunk/new-preserve-order/src/allmydata/test/common_util.py", line 71, in done (which, expected_failure, res)) twisted.trial.unittest.FailTest: test_publish_all_servers_bad was supposed to raise <class 'allmydata.mutable.common.NotEnoughServersError'>, not get '<MutableFileNode 33b7b20 RW 2qvahvls>' ```

Okay, I plan to release v1.6 without further work on this "use the fastest servers first" patch. Brian is going to completely rewrite downloader after v1.6 -- hopefully this patch will inform his rewrite or serve as a benchmark to run against his new downloader.

Okay, I plan to release v1.6 without further work on this "use the fastest servers first" patch. Brian is going to completely rewrite downloader after v1.6 -- hopefully this patch will inform his rewrite or serve as a benchmark to run against his new downloader.
zooko modified the milestone from eventually to 1.7.0 2010-02-27 06:41:39 +00:00
Author

FYI, #798 is the new downloader. It's coming along nicely. Almost passes a test or two.

FYI, #798 is the new downloader. It's coming along nicely. Almost passes a test or two.

If you like this ticket, you might also like the "Brian's New Downloader" bundle of tickets: #605 (two-hour delay to connect to a grid from Win32, if there are many storage servers unreachable), #800 (improve alacrity by downloading only the part of the Merkle Tree that you need), #809 (Measure how segment size affects upload/download speed.), #798 (improve random-access download to retrieve/decrypt less data), and #448 (download: speak to as few servers as possible).

If you like this ticket, you might also like the "Brian's New Downloader" bundle of tickets: #605 (two-hour delay to connect to a grid from Win32, if there are many storage servers unreachable), #800 (improve alacrity by downloading only the part of the Merkle Tree that you need), #809 (Measure how segment size affects upload/download speed.), #798 (improve random-access download to retrieve/decrypt less data), and #448 (download: speak to as few servers as possible).

Brian's New Downloader is now planned for v1.8.0.

Brian's New Downloader is now planned for v1.8.0.
zooko modified the milestone from 1.7.0 to 1.8.0 2010-05-08 22:49:00 +00:00

New Downloader is in 1.8, but I'm unclear to what extent it addresses this ticket. I think it's a partial fix for immutable downloads, is that right?

New Downloader is in 1.8, but I'm unclear to what extent it addresses this ticket. I think it's a partial fix for immutable downloads, is that right?
Author

The #798 new downloader (at least in the form that will probably appear in
tahoe-1.8.0) addresses somebut not all of this ticket.

  • servers which disconnect during download: these ought to be handled
    perfectly: new servers will be located and spun up, necessary hashes will
    be retrieved, and the download should continue without a hitch
  • servers which are in a stuck state (e.g. a silent disconnect) before the
    download begins will be tolerated: DYHB requests to them will stall, but
    other servers will be queried, and the download proper will begin as soon
    as enough shares are located. There is a hard-coded 10 second timeout, and
    DYHB queries which are not answered within this time will be replaced with
    a new query. The downloader will allow 10 non-overdue queries to be
    outstanding at any given time.
  • servers which enter a stuck state after the DYHB query has been answered
    are not yet handled well. There is code to react to an "OVERDUE"
    state (by switching to new shares), but there is not yet any code to
    actually declare an OVERDUE state (I couldn't settle on a reasonable
    heuristic to distinguish between a stuck server and one that is merely
    slow).

The goals described in this ticket's description are still desireable:

  • have a list of peers, sorted by "goodness" (probably speed)
  • when a server hasn't responded in a while, move it to the bottom of the
    list
  • keep a couple of extra shares in reserve, to quickly fill in for a server
    that gets stuck

So we should at least keep this ticket open until the new downloader is
capable of declaring an OVERDUE state and thus becomes tolerant to servers
that get stuck after the DYHB queries. And probably the criteria for closing
it should be the implementation of the scheme where we have a list of shares
sorted by responsiveness.

The #798 new downloader (at least in the form that will probably appear in tahoe-1.8.0) addresses somebut not all of this ticket. * servers which disconnect during download: these ought to be handled perfectly: new servers will be located and spun up, necessary hashes will be retrieved, and the download should continue without a hitch * servers which are in a stuck state (e.g. a silent disconnect) before the download begins will be tolerated: DYHB requests to them will stall, but other servers will be queried, and the download proper will begin as soon as enough shares are located. There is a hard-coded 10 second timeout, and DYHB queries which are not answered within this time will be replaced with a new query. The downloader will allow 10 non-overdue queries to be outstanding at any given time. * servers which enter a stuck state after the DYHB query has been answered are **not** yet handled well. There is code to react to an "OVERDUE" state (by switching to new shares), but there is not yet any code to actually declare an OVERDUE state (I couldn't settle on a reasonable heuristic to distinguish between a stuck server and one that is merely slow). The goals described in this ticket's description are still desireable: * have a list of peers, sorted by "goodness" (probably speed) * when a server hasn't responded in a while, move it to the bottom of the list * keep a couple of extra shares in reserve, to quickly fill in for a server that gets stuck So we should at least keep this ticket open until the new downloader is capable of declaring an OVERDUE state and thus becomes tolerant to servers that get stuck after the DYHB queries. And probably the criteria for closing it should be the implementation of the scheme where we have a list of shares sorted by responsiveness.
zooko modified the milestone from 1.8.0 to eventually 2010-08-15 06:18:04 +00:00
Sign in to join this conversation.
No labels
c/code
c/code-dirnodes
c/code-encoding
c/code-frontend
c/code-frontend-cli
c/code-frontend-ftp-sftp
c/code-frontend-magic-folder
c/code-frontend-web
c/code-mutable
c/code-network
c/code-nodeadmin
c/code-peerselection
c/code-storage
c/contrib
c/dev-infrastructure
c/docs
c/operational
c/packaging
c/unknown
c/website
kw:2pc
kw:410
kw:9p
kw:ActivePerl
kw:AttributeError
kw:DataUnavailable
kw:DeadReferenceError
kw:DoS
kw:FileZilla
kw:GetLastError
kw:IFinishableConsumer
kw:K
kw:LeastAuthority
kw:Makefile
kw:RIStorageServer
kw:StringIO
kw:UncoordinatedWriteError
kw:about
kw:access
kw:access-control
kw:accessibility
kw:accounting
kw:accounting-crawler
kw:add-only
kw:aes
kw:aesthetics
kw:alias
kw:aliases
kw:aliens
kw:allmydata
kw:amazon
kw:ambient
kw:annotations
kw:anonymity
kw:anonymous
kw:anti-censorship
kw:api_auth_token
kw:appearance
kw:appname
kw:apport
kw:archive
kw:archlinux
kw:argparse
kw:arm
kw:assertion
kw:attachment
kw:auth
kw:authentication
kw:automation
kw:avahi
kw:availability
kw:aws
kw:azure
kw:backend
kw:backoff
kw:backup
kw:backupdb
kw:backward-compatibility
kw:bandwidth
kw:basedir
kw:bayes
kw:bbfreeze
kw:beta
kw:binaries
kw:binutils
kw:bitcoin
kw:bitrot
kw:blacklist
kw:blocker
kw:blocks-cloud-deployment
kw:blocks-cloud-merge
kw:blocks-magic-folder-merge
kw:blocks-merge
kw:blocks-raic
kw:blocks-release
kw:blog
kw:bom
kw:bonjour
kw:branch
kw:branding
kw:breadcrumbs
kw:brians-opinion-needed
kw:browser
kw:bsd
kw:build
kw:build-helpers
kw:buildbot
kw:builders
kw:buildslave
kw:buildslaves
kw:cache
kw:cap
kw:capleak
kw:captcha
kw:cast
kw:centos
kw:cffi
kw:chacha
kw:charset
kw:check
kw:checker
kw:chroot
kw:ci
kw:clean
kw:cleanup
kw:cli
kw:cloud
kw:cloud-backend
kw:cmdline
kw:code
kw:code-checks
kw:coding-standards
kw:coding-tools
kw:coding_tools
kw:collection
kw:compatibility
kw:completion
kw:compression
kw:confidentiality
kw:config
kw:configuration
kw:configuration.txt
kw:conflict
kw:connection
kw:connectivity
kw:consistency
kw:content
kw:control
kw:control.furl
kw:convergence
kw:coordination
kw:copyright
kw:corruption
kw:cors
kw:cost
kw:coverage
kw:coveralls
kw:coveralls.io
kw:cpu-watcher
kw:cpyext
kw:crash
kw:crawler
kw:crawlers
kw:create-container
kw:cruft
kw:crypto
kw:cryptography
kw:cryptography-lib
kw:cryptopp
kw:csp
kw:curl
kw:cutoff-date
kw:cycle
kw:cygwin
kw:d3
kw:daemon
kw:darcs
kw:darcsver
kw:database
kw:dataloss
kw:db
kw:dead-code
kw:deb
kw:debian
kw:debug
kw:deep-check
kw:defaults
kw:deferred
kw:delete
kw:deletion
kw:denial-of-service
kw:dependency
kw:deployment
kw:deprecation
kw:desert-island
kw:desert-island-build
kw:design
kw:design-review-needed
kw:detection
kw:dev-infrastructure
kw:devpay
kw:directory
kw:directory-page
kw:dirnode
kw:dirnodes
kw:disconnect
kw:discovery
kw:disk
kw:disk-backend
kw:distribute
kw:distutils
kw:dns
kw:do_http
kw:doc-needed
kw:docker
kw:docs
kw:docs-needed
kw:dokan
kw:dos
kw:download
kw:downloader
kw:dragonfly
kw:drop-upload
kw:duplicity
kw:dusty
kw:earth-dragon
kw:easy
kw:ec2
kw:ecdsa
kw:ed25519
kw:egg-needed
kw:eggs
kw:eliot
kw:email
kw:empty
kw:encoding
kw:endpoint
kw:enterprise
kw:enum34
kw:environment
kw:erasure
kw:erasure-coding
kw:error
kw:escaping
kw:etag
kw:etch
kw:evangelism
kw:eventual
kw:example
kw:excess-authority
kw:exec
kw:exocet
kw:expiration
kw:extensibility
kw:extension
kw:failure
kw:fedora
kw:ffp
kw:fhs
kw:figleaf
kw:file
kw:file-descriptor
kw:filename
kw:filesystem
kw:fileutil
kw:fips
kw:firewall
kw:first
kw:floatingpoint
kw:flog
kw:foolscap
kw:forward-compatibility
kw:forward-secrecy
kw:forwarding
kw:free
kw:freebsd
kw:frontend
kw:fsevents
kw:ftp
kw:ftpd
kw:full
kw:furl
kw:fuse
kw:garbage
kw:garbage-collection
kw:gateway
kw:gatherer
kw:gc
kw:gcc
kw:gentoo
kw:get
kw:git
kw:git-annex
kw:github
kw:glacier
kw:globalcaps
kw:glossary
kw:google-cloud-storage
kw:google-drive-backend
kw:gossip
kw:governance
kw:grid
kw:grid-manager
kw:gridid
kw:gridsync
kw:grsec
kw:gsoc
kw:gvfs
kw:hackfest
kw:hacktahoe
kw:hang
kw:hardlink
kw:heartbleed
kw:heisenbug
kw:help
kw:helper
kw:hint
kw:hooks
kw:how
kw:how-to
kw:howto
kw:hp
kw:hp-cloud
kw:html
kw:http
kw:https
kw:i18n
kw:i2p
kw:i2p-collab
kw:illustration
kw:image
kw:immutable
kw:impressions
kw:incentives
kw:incident
kw:init
kw:inlineCallbacks
kw:inotify
kw:install
kw:installer
kw:integration
kw:integration-test
kw:integrity
kw:interactive
kw:interface
kw:interfaces
kw:interoperability
kw:interstellar-exploration
kw:introducer
kw:introduction
kw:iphone
kw:ipkg
kw:iputil
kw:ipv6
kw:irc
kw:jail
kw:javascript
kw:joke
kw:jquery
kw:json
kw:jsui
kw:junk
kw:key-value-store
kw:kfreebsd
kw:known-issue
kw:konqueror
kw:kpreid
kw:kvm
kw:l10n
kw:lae
kw:large
kw:latency
kw:leak
kw:leasedb
kw:leases
kw:libgmp
kw:license
kw:licenss
kw:linecount
kw:link
kw:linux
kw:lit
kw:localhost
kw:location
kw:locking
kw:logging
kw:logo
kw:loopback
kw:lucid
kw:mac
kw:macintosh
kw:magic-folder
kw:manhole
kw:manifest
kw:manual-test-needed
kw:map
kw:mapupdate
kw:max_space
kw:mdmf
kw:memcheck
kw:memory
kw:memory-leak
kw:mesh
kw:metadata
kw:meter
kw:migration
kw:mime
kw:mingw
kw:minimal
kw:misc
kw:miscapture
kw:mlp
kw:mock
kw:more-info-needed
kw:mountain-lion
kw:move
kw:multi-users
kw:multiple
kw:multiuser-gateway
kw:munin
kw:music
kw:mutability
kw:mutable
kw:mystery
kw:names
kw:naming
kw:nas
kw:navigation
kw:needs-review
kw:needs-spawn
kw:netbsd
kw:network
kw:nevow
kw:new-user
kw:newcaps
kw:news
kw:news-done
kw:news-needed
kw:newsletter
kw:newurls
kw:nfc
kw:nginx
kw:nixos
kw:no-clobber
kw:node
kw:node-url
kw:notification
kw:notifyOnDisconnect
kw:nsa310
kw:nsa320
kw:nsa325
kw:numpy
kw:objects
kw:old
kw:openbsd
kw:openitp-packaging
kw:openssl
kw:openstack
kw:opensuse
kw:operation-helpers
kw:operational
kw:operations
kw:ophandle
kw:ophandles
kw:ops
kw:optimization
kw:optional
kw:options
kw:organization
kw:os
kw:os.abort
kw:ostrom
kw:osx
kw:osxfuse
kw:otf-magic-folder-objective1
kw:otf-magic-folder-objective2
kw:otf-magic-folder-objective3
kw:otf-magic-folder-objective4
kw:otf-magic-folder-objective5
kw:otf-magic-folder-objective6
kw:p2p
kw:packaging
kw:partial
kw:password
kw:path
kw:paths
kw:pause
kw:peer-selection
kw:performance
kw:permalink
kw:permissions
kw:persistence
kw:phone
kw:pickle
kw:pip
kw:pipermail
kw:pkg_resources
kw:placement
kw:planning
kw:policy
kw:port
kw:portability
kw:portal
kw:posthook
kw:pratchett
kw:preformance
kw:preservation
kw:privacy
kw:process
kw:profile
kw:profiling
kw:progress
kw:proxy
kw:publish
kw:pyOpenSSL
kw:pyasn1
kw:pycparser
kw:pycrypto
kw:pycrypto-lib
kw:pycryptopp
kw:pyfilesystem
kw:pyflakes
kw:pylint
kw:pypi
kw:pypy
kw:pysqlite
kw:python
kw:python3
kw:pythonpath
kw:pyutil
kw:pywin32
kw:quickstart
kw:quiet
kw:quotas
kw:quoting
kw:raic
kw:rainhill
kw:random
kw:random-access
kw:range
kw:raspberry-pi
kw:reactor
kw:readonly
kw:rebalancing
kw:recovery
kw:recursive
kw:redhat
kw:redirect
kw:redressing
kw:refactor
kw:referer
kw:referrer
kw:regression
kw:rekey
kw:relay
kw:release
kw:release-blocker
kw:reliability
kw:relnotes
kw:remote
kw:removable
kw:removable-disk
kw:rename
kw:renew
kw:repair
kw:replace
kw:report
kw:repository
kw:research
kw:reserved_space
kw:response-needed
kw:response-time
kw:restore
kw:retrieve
kw:retry
kw:review
kw:review-needed
kw:reviewed
kw:revocation
kw:roadmap
kw:rollback
kw:rpm
kw:rsa
kw:rss
kw:rst
kw:rsync
kw:rusty
kw:s3
kw:s3-backend
kw:s3-frontend
kw:s4
kw:same-origin
kw:sandbox
kw:scalability
kw:scaling
kw:scheduling
kw:schema
kw:scheme
kw:scp
kw:scripts
kw:sdist
kw:sdmf
kw:security
kw:self-contained
kw:server
kw:servermap
kw:servers-of-happiness
kw:service
kw:setup
kw:setup.py
kw:setup_requires
kw:setuptools
kw:setuptools_darcs
kw:sftp
kw:shared
kw:shareset
kw:shell
kw:signals
kw:simultaneous
kw:six
kw:size
kw:slackware
kw:slashes
kw:smb
kw:sneakernet
kw:snowleopard
kw:socket
kw:solaris
kw:space
kw:space-efficiency
kw:spam
kw:spec
kw:speed
kw:sqlite
kw:ssh
kw:ssh-keygen
kw:sshfs
kw:ssl
kw:stability
kw:standards
kw:start
kw:startup
kw:static
kw:static-analysis
kw:statistics
kw:stats
kw:stats_gatherer
kw:status
kw:stdeb
kw:storage
kw:streaming
kw:strports
kw:style
kw:stylesheet
kw:subprocess
kw:sumo
kw:survey
kw:svg
kw:symlink
kw:synchronous
kw:tac
kw:tahoe-*
kw:tahoe-add-alias
kw:tahoe-admin
kw:tahoe-archive
kw:tahoe-backup
kw:tahoe-check
kw:tahoe-cp
kw:tahoe-create-alias
kw:tahoe-create-introducer
kw:tahoe-debug
kw:tahoe-deep-check
kw:tahoe-deepcheck
kw:tahoe-lafs-trac-stream
kw:tahoe-list-aliases
kw:tahoe-ls
kw:tahoe-magic-folder
kw:tahoe-manifest
kw:tahoe-mkdir
kw:tahoe-mount
kw:tahoe-mv
kw:tahoe-put
kw:tahoe-restart
kw:tahoe-rm
kw:tahoe-run
kw:tahoe-start
kw:tahoe-stats
kw:tahoe-unlink
kw:tahoe-webopen
kw:tahoe.css
kw:tahoe_files
kw:tahoewapi
kw:tarball
kw:tarballs
kw:tempfile
kw:templates
kw:terminology
kw:test
kw:test-and-set
kw:test-from-egg
kw:test-needed
kw:testgrid
kw:testing
kw:tests
kw:throttling
kw:ticket999-s3-backend
kw:tiddly
kw:time
kw:timeout
kw:timing
kw:to
kw:to-be-closed-on-2011-08-01
kw:tor
kw:tor-protocol
kw:torsocks
kw:tox
kw:trac
kw:transparency
kw:travis
kw:travis-ci
kw:trial
kw:trickle
kw:trivial
kw:truckee
kw:tub
kw:tub.location
kw:twine
kw:twistd
kw:twistd.log
kw:twisted
kw:twisted-14
kw:twisted-trial
kw:twitter
kw:twn
kw:txaws
kw:type
kw:typeerror
kw:ubuntu
kw:ucwe
kw:ueb
kw:ui
kw:unclean
kw:uncoordinated-writes
kw:undeletable
kw:unfinished-business
kw:unhandled-error
kw:unhappy
kw:unicode
kw:unit
kw:unix
kw:unlink
kw:update
kw:upgrade
kw:upload
kw:upload-helper
kw:uri
kw:url
kw:usability
kw:use-case
kw:utf-8
kw:util
kw:uwsgi
kw:ux
kw:validation
kw:variables
kw:vdrive
kw:verify
kw:verlib
kw:version
kw:versioning
kw:versions
kw:video
kw:virtualbox
kw:virtualenv
kw:vista
kw:visualization
kw:visualizer
kw:vm
kw:volunteergrid2
kw:volunteers
kw:vpn
kw:wapi
kw:warners-opinion-needed
kw:warning
kw:weapi
kw:web
kw:web.port
kw:webapi
kw:webdav
kw:webdrive
kw:webport
kw:websec
kw:website
kw:websocket
kw:welcome
kw:welcome-page
kw:welcomepage
kw:wiki
kw:win32
kw:win64
kw:windows
kw:windows-related
kw:winscp
kw:workaround
kw:world-domination
kw:wrapper
kw:write-enabler
kw:wui
kw:x86
kw:x86-64
kw:xhtml
kw:xml
kw:xss
kw:zbase32
kw:zetuptoolz
kw:zfec
kw:zookos-opinion-needed
kw:zope
kw:zope.interface
p/blocker
p/critical
p/major
p/minor
p/normal
p/supercritical
p/trivial
r/cannot reproduce
r/duplicate
r/fixed
r/invalid
r/somebody else's problem
r/was already fixed
r/wontfix
r/worksforme
t/defect
t/enhancement
t/task
v/0.2.0
v/0.3.0
v/0.4.0
v/0.5.0
v/0.5.1
v/0.6.0
v/0.6.1
v/0.7.0
v/0.8.0
v/0.9.0
v/1.0.0
v/1.1.0
v/1.10.0
v/1.10.1
v/1.10.2
v/1.10a2
v/1.11.0
v/1.12.0
v/1.12.1
v/1.13.0
v/1.14.0
v/1.15.0
v/1.15.1
v/1.2.0
v/1.3.0
v/1.4.1
v/1.5.0
v/1.6.0
v/1.6.1
v/1.7.0
v/1.7.1
v/1.7β
v/1.8.0
v/1.8.1
v/1.8.2
v/1.8.3
v/1.8β
v/1.9.0
v/1.9.0-s3branch
v/1.9.0a1
v/1.9.0a2
v/1.9.0b1
v/1.9.1
v/1.9.2
v/1.9.2a1
v/cloud-branch
v/unknown
No milestone
No project
No assignees
3 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: tahoe-lafs/trac#287
No description provided.