tahoe-lafs/trac-2024-07-25

upload needs to be tolerant of lost peers #17

New issue

Closed

opened 2007-04-28 19:10:27 +00:00 by warner · 4 comments

warner commented

2007-04-28 19:10:27 +00:00

Owner

But in the current source:src/allmydata/encode.py, if any of those peers go away while we're uploading, the entire upload fails. (worse yet, the failure is not reported properly.. there are a lot of unhandled errback in Deferreds in there).

encode.Encoder._encoded_segment needs to be changed to count failures rather than allowing them to kill off the whole segment (and thus the whole file). When the encode/upload process finishes, it needs to return both the roothash and a count of how many shares were successfully placed, so that the enclosing upload.py code can decide whether it's done or whether it needs to try again.

At the moment, since we're bouncing storage nodes every hour to deal with the silent-lost connection issues, any upload that is in progress at :10 or :20 or :30 is going to fail in this fashion.

When we upload a file, we can tolerate not having enough peers (or those peers not offering enough space), based upon a threshold named "shares_of_happiness". We want to place 100 shares by default, and as long as we can find homes for at least 75 of them, we're happy. But in the current source:src/allmydata/encode.py, if any of those peers go away while we're uploading, the entire upload fails. (worse yet, the failure is not reported properly.. there are a lot of unhandled errback in Deferreds in there). encode.Encoder._encoded_segment needs to be changed to count failures rather than allowing them to kill off the whole segment (and thus the whole file). When the encode/upload process finishes, it needs to return both the roothash and a count of how many shares were successfully placed, so that the enclosing upload.py code can decide whether it's done or whether it needs to try again. At the moment, since we're bouncing storage nodes every hour to deal with the silent-lost connection issues, any upload that is in progress at :10 or :20 or :30 is going to fail in this fashion.

tahoe-lafs added the

major

defect

labels 2007-04-28 19:10:27 +00:00

tahoe-lafs added the

code

label 2007-04-28 19:15:36 +00:00

tahoe-lafs added

critical

and removed

major

labels 2007-05-04 05:15:12 +00:00

warner commented

2007-05-30 00:48:14 +00:00

Author

Owner

Oh, I think it gets worse.. from some other tests I was doing, it looks like if you lose all peers, then the upload process goes into an infinite loop and slowly consumes more and more memory.

Oh, I think it gets worse.. from some other tests I was doing, it looks like if you lose **all** peers, then the upload process goes into an infinite loop and slowly consumes more and more memory.

warner commented

2007-06-06 19:50:05 +00:00

Author

Owner

If we lose so many peers that we go below shares-of-happiness, the upload fails with a NotEnoughPeersError exception.

changeset:6bb9debc166df756 and changeset:f4c048bbeba15f51 should address this: now we keep going as long as we can still place 'shares_of_happiness' shares (which defaults to 75, in our 25-of-100 encoding). There are log messages generated when this happens, to indicate how close we are to giving up. If we lose so many peers that we go below shares-of-happiness, the upload fails with a [NotEnoughPeersError](wiki/NotEnoughPeersError) exception.

tahoe-lafs added the

fixed

label 2007-06-06 19:50:05 +00:00

warner closed this issue

2007-06-06 19:50:05 +00:00

warner commented

2008-01-26 00:15:28 +00:00

Author

Owner

oops, it turns out that there is still a problem: if the peer is quietly lost
before the upload starts, then the initial
storage.WriteBucketProxy.start call (which writes a bunch of offsets
into the remote share) will fail with some sort of connection-lost error
(either when TCP times out, or when the storage server reconnects and
replaces the existing connection). Failures in this particular method call
are not caught in the same way as later failures, and any such failure will
cause the upload to fail.

The task is to modify encode.Encoder.start():213 where it says:


        for l in self.landlords.values():
            d.addCallback(lambda res, l=l: l.start())

to wrap the start() calls in the same kind of drop-that-server-on-error code
that all the other remote calls use.

This might be the cause of #193 (if the upload was stalled waiting for the
lost peer's TCP connection to close), although I kind of doubt it. It might
also be the cause of #253.

oops, it turns out that there is still a problem: if the peer is quietly lost before the upload starts, then the initial `storage.WriteBucketProxy.start` call (which writes a bunch of offsets into the remote share) will fail with some sort of connection-lost error (either when TCP times out, or when the storage server reconnects and replaces the existing connection). Failures in this particular method call are not caught in the same way as later failures, and any such failure will cause the upload to fail. The task is to modify `encode.Encoder.start():213` where it says: ``` for l in self.landlords.values(): d.addCallback(lambda res, l=l: l.start()) ``` to wrap the start() calls in the same kind of drop-that-server-on-error code that all the other remote calls use. This might be the cause of #193 (if the upload was stalled waiting for the lost peer's TCP connection to close), although I kind of doubt it. It might also be the cause of #253.

tahoe-lafs added

and removed

labels 2008-01-26 00:15:28 +00:00

tahoe-lafs modified the milestone from 0.3.0 to 0.9.0 (Allmydata 3.0 final)

2008-01-26 00:15:28 +00:00

warner reopened this issue