tahoe-lafs/trac

UCWE when mapupdate gives up too early, then server errors require replacement servers #893

New issue

Open

opened 2010-01-10 07:27:51 +00:00 by warner · 11 comments

warner commented

2010-01-10 07:27:51 +00:00

mapupdate(MODE_WRITE) wants to find all shares, so they can all
be updated. Ideally, the shares are concentrated at the beginning
of the permuted peerlist, on the first N servers.
to avoid traversing the whole grid, mapupdate(MODE_WRITE) has a
heuristic: if we've seen enough shares, and we've also seen a
span of contiguous servers (in permuted order) that tell us they
do not have a share, then we stop the search. The size of this
span was intended to be N+epsilon (where 'epsilon' is a tradeoff
between performance and safety, and is set to k). Unfortunately
1.5.0 had a bug, and the span size was set to k+epsilon instead.

In the Incident:

mapupdate sent out 10 queries, but decided to finish before all
the responses had returned, because it found a boundary early.
I'm not entirely sure how the sharemap was shaped, but it looks
like it stopped with a span of 3 servers ("found our boundary,
11000"), where k=3 and N=10, and returned a sharemap with 8
shares in it, some doubled up (I think there were 5 servers
involved)
about 150ms after mapupdate finished, one more response came back
(from w6o6, with two shares, but was ignored
Publish starts, and sends out updates to 7 servers.
Unfortunately, two of these (both owned by secorp) experienced
"Permission Denied" errors when attempting to write out the new
shares, suggesting a configuration error (maybe the tahoe node
process is owned by the wrong user)
so Publish falls over to new servers. Some of the new servers it
picks suffer from the same error, so the fallover process repeats
a few times.
finally, it falls over to the server w6o6, and sends it a
new share, thinking that w6o6 has no shares (because the
servermap was not updated to include w6o6's late response).
w6o6 then responds with a writev failure, because it contains
shares that the test vector did not expect, causing a UCWE error
most of the shares were updated, so the write may have actually
happened, even though UCWE was raised.

The biggest problem with this failure is that it is persistent. We
don't record any information that would tell a subsequent operation
to look further for existing shares, so exactly the same thing will
happen the next time we try to modify or repair the directory.

If secorp's servers weren't throwing errors, then I think the
condition would eventually fix itself: new shares would be placed on
his servers, bridging the span of servers without shares, and then
later mapupdate calls would keep going until they'd really seen all
of the shares.

Recently (changeset:eb1868628465a243, 08-Dec-2009) I fixed mapupdate to use N+epsilon
instead of k+epsilon. But the incident report suggests that it
stopped with a span of only 3 no-share servers. Looking more closely
at the code, I think it only waits for a span of epsilon (not
k+epsilon or N+epsilon), and that changeset:eb1868628465a243 changed something
different. I don't know if the thing that was changed would have
prevented this issue or not. It's possible that this is a
manifestation of #547 (mapupdate triggers on a false boundary), or
#549, or one of the other problems described in #546.

In general, we need to query more servers. But even if we increase
the span size or epsilon or whatever, there will always be a weird
situtation that could be handled better if we queried more servers.
We'd like to have something more adaptive: if the code hits UCWE
because it didn't try hard enough, then it should try harder next
time.

How should we deal with this? We need something to persist from one
operation on a given mutable filenode to the next, some sort of hint
that says "Hey, last time we were surprised, so next time you should
look further". Or something that tells us that we learned about
shares on servers X+Y+Z, and so the next time we do a mapupdate, we
shouldn't consider it complete until we've gotten responses from
those servers (in addition to any others that we might decide to
query).

The most natural place to keep this state would be on the mutable
filenode instance. This would help with UCWE that occurs inside a
modify() call, because the same filenode is used for each retry, but
in general filenodes are pretty short-lived. We don't want to keep
the mutable filenode around in RAM forever. Maybe a LRU cache that
keeps filenodes around for a few minutes, so that users who
experience UCWE and retry the operation can benefit from recent
history.

A storage protocol that included "where-are-the-other-shares" hints
(#599) would help: this would improve the reliability of mapupdate,
since the persistent information would be kept on the storage
servers, next to the shares.

A publish process which rebalanced the shares (#232, or
#661/#543/#864) might help, by filling in the gaps, except that here
the gap was caused by a batch of servers all suffering from the same
configuration problem.

The right answer probably lies in having UCWE triggering an
immediate repair, and having repair fill in the gaps. But it'd be
nice if there were a way to stash some information on the shares
before the gaps that let later operations know that they should look
past the gap.

The Incidents that were reported in #877 (by Zooko, referring to mutable write errors experienced by nejucomo) indicate a thorny problem that is distinct from both #877 (caused by a reentrancy error) and #540 (caused by a logic bug that affects small grids where a publish wraps around the peer circle). Here's the setup: * mapupdate(MODE_WRITE) wants to find all shares, so they can all be updated. Ideally, the shares are concentrated at the beginning of the permuted peerlist, on the first N servers. * to avoid traversing the whole grid, mapupdate(MODE_WRITE) has a heuristic: if we've seen enough shares, and we've also seen a span of contiguous servers (in permuted order) that tell us they do not have a share, then we stop the search. The size of this span was intended to be N+epsilon (where 'epsilon' is a tradeoff between performance and safety, and is set to k). Unfortunately 1.5.0 had a bug, and the span size was set to k+epsilon instead. In the Incident: * mapupdate sent out 10 queries, but decided to finish before all the responses had returned, because it found a boundary early. I'm not entirely sure how the sharemap was shaped, but it looks like it stopped with a span of 3 servers ("found our boundary, 11000"), where k=3 and N=10, and returned a sharemap with 8 shares in it, some doubled up (I think there were 5 servers involved) * about 150ms after mapupdate finished, one more response came back (from `w6o6`, with two shares, but was ignored * Publish starts, and sends out updates to 7 servers. Unfortunately, two of these (both owned by secorp) experienced "Permission Denied" errors when attempting to write out the new shares, suggesting a configuration error (maybe the tahoe node process is owned by the wrong user) * so Publish falls over to new servers. Some of the new servers it picks suffer from the same error, so the fallover process repeats a few times. * finally, it falls over to the server `w6o6`, and sends it a new share, thinking that w6o6 has no shares (because the servermap was not updated to include w6o6's late response). * w6o6 then responds with a writev failure, because it contains shares that the test vector did not expect, causing a UCWE error * most of the shares were updated, so the write may have actually happened, even though UCWE was raised. The biggest problem with this failure is that it is persistent. We don't record any information that would tell a subsequent operation to look further for existing shares, so exactly the same thing will happen the next time we try to modify or repair the directory. If secorp's servers weren't throwing errors, then I think the condition would eventually fix itself: new shares would be placed on his servers, bridging the span of servers without shares, and then later mapupdate calls would keep going until they'd really seen all of the shares. Recently (changeset:eb1868628465a243, 08-Dec-2009) I fixed mapupdate to use N+epsilon instead of k+epsilon. But the incident report suggests that it stopped with a span of only 3 no-share servers. Looking more closely at the code, I think it only waits for a span of epsilon (not k+epsilon or N+epsilon), and that changeset:eb1868628465a243 changed something different. I don't know if the thing that was changed would have prevented this issue or not. It's possible that this is a manifestation of #547 (mapupdate triggers on a false boundary), or #549, or one of the other problems described in #546. In general, we need to query more servers. But even if we increase the span size or epsilon or whatever, there will always be a weird situtation that could be handled better if we queried more servers. We'd like to have something more adaptive: if the code hits UCWE because it didn't try hard enough, then it should try harder next time. How should we deal with this? We need something to persist from one operation on a given mutable filenode to the next, some sort of hint that says "Hey, last time we were surprised, so next time you should look further". Or something that tells us that we learned about shares on servers X+Y+Z, and so the next time we do a mapupdate, we shouldn't consider it complete until we've gotten responses from those servers (in addition to any others that we might decide to query). The most natural place to keep this state would be on the mutable filenode instance. This would help with UCWE that occurs inside a modify() call, because the same filenode is used for each retry, but in general filenodes are pretty short-lived. We don't want to keep the mutable filenode around in RAM forever. Maybe a LRU cache that keeps filenodes around for a few minutes, so that users who experience UCWE and retry the operation can benefit from recent history. A storage protocol that included "where-are-the-other-shares" hints (#599) would help: this would improve the reliability of mapupdate, since the persistent information would be kept on the storage servers, next to the shares. A publish process which rebalanced the shares (#232, or #661/#543/#864) might help, by filling in the gaps, except that here the gap was caused by a batch of servers all suffering from the same configuration problem. The right answer probably lies in having UCWE triggering an immediate repair, and having repair fill in the gaps. But it'd be nice if there were a way to stash some information on the shares before the gaps that let later operations know that they should look past the gap.

warner added the

labels 2010-01-10 07:27:51 +00:00

warner added this to the undecided milestone 2010-01-10 07:27:51 +00:00

warner commented

2010-01-10 07:37:03 +00:00

Author

looking at the changeset:eb1868628465a243 change more closely, I think it could be improved. The change increases self.num_peers_to_query, but that's only actually used by MODE_READ. The finish-criteria code for MODE_WRITE (ServermapUpdater._check_for_done) should not let the process finish unless we've received answers from at least self.num_peers_to_query, just like MODE_READ (or if we've run out of servers to query).

The gap size was always defined to be epsilon, but the minimum num_peers_to_query value was intended to make sure we'd looked far enough.

looking at the changeset:eb1868628465a243 change more closely, I think it could be improved. The change increases `self.num_peers_to_query`, but that's only actually used by MODE_READ. The finish-criteria code for MODE_WRITE (`ServermapUpdater._check_for_done`) should not let the process finish unless we've received answers from at least `self.num_peers_to_query`, just like MODE_READ (or if we've run out of servers to query). The gap size was always defined to be epsilon, but the minimum `num_peers_to_query` value was intended to make sure we'd looked far enough.

warner commented

2010-01-10 07:43:21 +00:00

Author

I don't know if this is really a 1.6.0 or critical issue, but since #877 from whence it spawned was both, I guess I should set it to the same. Feel free to downgrade or defer it. I don't think we can find a good long-term solution in the 1.6.0 timeframe, but maybe the num_peers_to_query change could fit in.

warner added

p/critical

and removed

p/major

labels 2010-01-10 07:43:21 +00:00

warner modified the milestone from undecided to 1.6.0

2010-01-10 07:43:21 +00:00

zooko commented

2010-01-10 17:58:31 +00:00

Maybe the uploader should have been unsurprised to find shares on w6o6. "Surprises" are when a server told you something during the mapupdate and then told you something different during the next phase. If you didn't ask a server during the mapupdate, then you shouldn't be surprised what he says in the next phase.

I guess this is because I think of surprises as indicating UncoordinatedWriteError, rather than indicating "the world of storage servers are in a stranger state than you assumed".

What do you think of that, Brian? If you already decided to upload version V+1 of your file, and then during the process you find some previously unused storage servers which are holding shares <= V then you should just unsurprisedly resend your write request updated to say "Yes I know you have version U, and please replace it with this version V+1.". Now if you find previously unused servers that have a version > V+1 or a version ==V and with a different root hash then you should stop (but not with UncoordinatedWriteError).

Maybe the uploader should have been unsurprised to find shares on `w6o6`. "Surprises" are when a server told you something during the mapupdate and then told you something different during the next phase. If you didn't ask a server during the mapupdate, then you shouldn't be surprised what he says in the next phase. I guess this is because I think of surprises as indicating UncoordinatedWriteError, rather than indicating "the world of storage servers are in a stranger state than you assumed". What do you think of that, Brian? If you already decided to upload version `V+1` of your file, and then during the process you find some previously unused storage servers which are holding shares `<= V` then you should just unsurprisedly resend your write request updated to say "Yes I know you have version U, and please replace it with this version V+1.". Now if you find previously unused servers that have a version `> V+1` or a version `==V` and with a different root hash then you should stop (but not with UncoordinatedWriteError).

zooko commented

2010-01-10 18:01:13 +00:00

I think extending the finish criteria for MODE_WRITE (ServermapUpdater._check_for_done) would be a good improvement (it would have solved nejucomo's problem) and is probably feasible for Tahoe-LAFS v1.6.

I think extending the finish criteria for MODE_WRITE (`ServermapUpdater._check_for_done`) would be a good improvement (it would have solved nejucomo's problem) and is probably feasible for Tahoe-LAFS v1.6.

warner commented

2010-01-10 21:07:41 +00:00

Author

I think that discovering a version (=V+1,!=newroothash) or (>V+1) should fire UCWE.. sounds to me like the exact definition of UCWE. Why do you think this should use a different error?

I'm not sure what it should do if it sees (=V,!=oldroothash). This indicates that a UCWE occurred in the past, and now there are two equal-version contenders for the file. Ideally the earlier write should have noticed this and performed an immediate repair. For the current publish, it should stop, but I'll agree that UCWE might not be the most accurate way to report it. Dunno.

hm. Yeah, I suppose that if we were writing out version V+1, then it would be safe to overwrite any version (<V) or (=V,=oldroothash) that we saw. This would require either a clever test vector or some code to receive the "test failed, so no write" response and send out a new request with a new test vector. Not trivial, but not too hard. I think that discovering a version (=V+1,!=newroothash) or (>V+1) should fire UCWE.. sounds to me like the exact definition of UCWE. Why do you think this should use a different error? I'm not sure what it should do if it sees (=V,!=oldroothash). This indicates that a UCWE occurred in the past, and now there are two equal-version contenders for the file. Ideally the earlier write should have noticed this and performed an immediate repair. For the current publish, it should stop, but I'll agree that UCWE might not be the most accurate way to report it. Dunno.

zooko commented

2010-01-10 23:22:29 +00:00

hm. Yeah, I suppose that if we were writing out version V+1, then it would be safe to overwrite any version (<V) or (=V,=oldroothash) that we saw. This would require either a clever test vector or some code to receive the "test failed, so no write" response and send out a new request with a new test vector.

Hm, the server doesn't currently support an operation that will apply a test-and-set if the share is present or else just apply the set if the share is absent, does it? source:src/allmydata/storage/server.py?rev=4118#L475 Oh look! It does! "# compare the vectors against an empty share, in which all reads return empty strings" So the current test vector should already work for this case, right? The client just needs to send a test vector saying if the server has a version less than the client's version then overwrite.

I think that discovering a version (=V+1,!=newroothash) or (>V+1) should fire UCWE.. sounds to me like the exact definition of UCWE. Why do you think this should use a different error?

Nejucomo expostulated: "UncoordinatedWriteError! I'm the only person in the universe who knows this write cap!"

And he was right -- there was never an uncoordinated write event. The client can't tell for sure, but if you never saw a server claim that it had version V and then claim that it had version V+u, then you should perhaps blame your problem on random network problems or bugs sooner than on UncoordinatedWriteError. Could we have a related exception named something like MultipleVersionsExist of which UncoordinatedWriteError is the subclass that we use when we know that either some other client just now wrote when we were writing or else some server is doing a "roll-forward" in order to make it look like some other client has done so?

I'm not sure what it should do if it sees (=V,!=oldroothash). This indicates that a UCWE occurred in the past, and now there are two equal-version contenders for the file. Ideally the earlier write should have noticed this and performed an immediate repair. For the current publish, it should stop, but I'll agree that UCWE might not be the most accurate way to report it.

This is the same case as server's V > client's V, right? So having the client send a testv which was < V should handle both cases.

> hm. Yeah, I suppose that if we were writing out version V+1, then it would be safe to overwrite any version (<V) or (=V,=oldroothash) that we saw. This would require either a clever test vector or some code to receive the "test failed, so no write" response and send out a new request with a new test vector. Hm, the server doesn't currently support an operation that will apply a test-and-set if the share is present or else just apply the set if the share is absent, does it? source:src/allmydata/storage/server.py?rev=4118#L475 Oh look! It does! "`# compare the vectors against an empty share, in which all reads return empty strings`" So the current test vector should already work for this case, right? The client just needs to send a test vector saying if the server has a version less than the client's version then overwrite. > I think that discovering a version (=V+1,!=newroothash) or (>V+1) should fire UCWE.. sounds to me like the exact definition of UCWE. Why do you think this should use a different error? Nejucomo expostulated: "UncoordinatedWriteError! I'm the only person in the universe who knows this write cap!" And he was right -- there was never an uncoordinated write event. The client can't tell for *sure*, but if you never saw a server claim that it had version `V` and then claim that it had version `V+u`, then you should perhaps blame your problem on random network problems or bugs sooner than on `UncoordinatedWriteError`. Could we have a related exception named something like `MultipleVersionsExist` of which `UncoordinatedWriteError` is the subclass that we use when we *know* that either some other client just now wrote when we were writing or else some server is doing a "roll-forward" in order to make it look like some other client has done so? > I'm not sure what it should do if it sees (=V,!=oldroothash). This indicates that a UCWE occurred in the past, and now there are two equal-version contenders for the file. Ideally the earlier write should have noticed this and performed an immediate repair. For the current publish, it should stop, but I'll agree that UCWE might not be the most accurate way to report it. This is the same case as server's V > client's V, right? So having the client send a testv which was `< V` should handle both cases.

zooko commented

2010-01-13 23:53:39 +00:00

So there are three proposals for code changes in this ticket.

The finish-criteria code for MODE_WRITE (ServermapUpdater._check_for_done) should not let the process finish unless we've received answers from at least self.num_peers_to_query, just like MODE_READ (or if we've run out of servers to query).
When sending a share to a server that you do not have any information about in your servermap, then go ahead and include a test vector saying "This share is intended to overwrite any share with version less than V.". Brian: does that sound like it could cause some weird problem? It sounds pretty safe to me.
Invent a new exception named MultipleVersionsExisted and raise that instead of UncoordinatedWriteError if you didn't see actual evidence of a server changing its store between your mapupdate and your write. I earlier proposed that UncoordinatedWriteError could be a subtype of MultipleVersionsExisted, but I'm not sure whether it should be a subtype or just a separate type. Either way is fine with me.

So there are three proposals for code changes in this ticket. 1. The finish-criteria code for MODE_WRITE (`ServermapUpdater._check_for_done`) should not let the process finish unless we've received answers from at least `self.num_peers_to_query`, just like MODE_READ (or if we've run out of servers to query). 2. When sending a share to a server that you do *not* have any information about in your servermap, then go ahead and include a test vector saying "This share is intended to overwrite any share with version less than V.". Brian: does that sound like it could cause some weird problem? It sounds pretty safe to me. 3. Invent a new exception named `MultipleVersionsExisted` and raise that instead of `UncoordinatedWriteError` if you didn't see actual evidence of a server changing its store between your mapupdate and your write. I earlier proposed that `UncoordinatedWriteError` could be a subtype of `MultipleVersionsExisted`, but I'm not sure whether it should be a subtype or just a separate type. Either way is fine with me.

zooko commented

2010-01-14 00:02:41 +00:00

This might be related to #899, newly reported by Kyle Markley and Andrej Falout.

zooko modified the milestone from 1.6.0 to 1.7.0

2010-01-28 15:04:36 +00:00

daira commented

2010-02-15 19:35:24 +00:00

Not sure if we can fix this for 1.6.1, but it's definitely a candidate.

daira modified the milestone from 1.7.0 to 1.6.1

2010-02-15 19:35:24 +00:00

zooko commented

2010-02-16 05:32:09 +00:00

We're bumping this out of v1.6.1 just because we're not 100% sure that proposals 2 and 3 from comment:375412 are actually easy and safe.

We're bumping this out of v1.6.1 just because we're not 100% sure that proposals 2 and 3 from [comment:375412](/tahoe-lafs/trac/issues/893#issuecomment-375412) are *actually* easy and safe.

zooko modified the milestone from 1.6.1 to 1.7.0

2010-02-16 05:32:16 +00:00

zooko commented

2010-05-26 14:43:29 +00:00

It's really bothering me that mutable file upload and download behavior is so finicky, buggy, inefficient, hard to understand, different from immutable file upload and download behavior, etc. So I'm putting a bunch of tickets into the "1.8" Milestone. I am not, however, at this time, volunteering to work on these tickets, so it might be a mistake to put them into the 1.8 Milestone, but I really hope that someone else will volunteer or that I will decide to do it myself. :-)

Rows
Columns