Incomplete ServerMap triggers UncoordinatedWriteError upon mutable Publish #1795

Open
opened 2012-07-25 03:04:43 +00:00 by jean · 2 comments
Owner

This error has been seen in the wild while working on the Tamias system that uses tahoe-lafs as a storage layer. It seems to show up much more often in our testing environment where we do have a lot of clients connecting and leaving the network at high frequencies (client nodes, not storage nodes).

Before overwriting a mutable file, the client builds a servermap using MODE_WRITE. This mode does not query all servers but stops querying when 'epsilon' consecutive servers stated that they do not have a share. When this happens (hit boundary, in the log) the servermap is considered to be done if all servers on the left of the boundary have answered.

In some corner cases, all those servers have answered but specific timing makes it so that the server is marked as having a share but the share information has not been processed yet. Because there are several concurrent calls to check_for_done, one of them might consider that the servermapupdate can stop running, actually preventing the processing of the last share.

This results in a partial servermap. When the Publish operation starts, it might select the last server - the one missing from the servermap - as a candidate for the missing share. It will then issue a testv that checks for the absence of a share. This testv fails because there is a share, and a UCW is triggered.

This can be seen in the attached log starting from event 6750, the boundary is found at 6898 and 6899 stops the servermap update. Event 6900 has the partial servermap, and events 6903,6904 show the last share processing that is filtered because the servermap update has already been stopped. 6918 and 6920 show the servermap before (partial) and after (unforunately chosing the 'hidden' server whose answer was discarded). This leads to the eventual UCW at event 6955 triggered by the failed testv at 6953.

In out testing environment, we use the attached workaround that moves the addition to the good_servers list at the very bottom of the deferedlist that is built per-server. This is expected to cause problems when servers have multiple shares, but it is just a temporary fix anyway.

This error has been seen in the wild while working on the Tamias system that uses tahoe-lafs as a storage layer. It seems to show up much more often in our testing environment where we do have a lot of clients connecting and leaving the network at high frequencies (client nodes, not storage nodes). Before overwriting a mutable file, the client builds a servermap using MODE_WRITE. This mode does not query all servers but stops querying when 'epsilon' consecutive servers stated that they do not have a share. When this happens (hit boundary, in the log) the servermap is considered to be done if all servers on the left of the boundary have answered. In some corner cases, all those servers have answered but specific timing makes it so that the server is marked as having a share but the share information has not been processed yet. Because there are several concurrent calls to check_for_done, one of them might consider that the servermapupdate can stop running, actually preventing the processing of the last share. This results in a partial servermap. When the Publish operation starts, it might select the last server - the one missing from the servermap - as a candidate for the missing share. It will then issue a testv that checks for the absence of a share. This testv fails because there is a share, and a UCW is triggered. This can be seen in the attached log starting from event 6750, the boundary is found at 6898 and 6899 stops the servermap update. Event 6900 has the partial servermap, and events 6903,6904 show the last share processing that is filtered because the servermap update has already been stopped. 6918 and 6920 show the servermap before (partial) and after (unforunately chosing the 'hidden' server whose answer was discarded). This leads to the eventual UCW at event 6955 triggered by the failed testv at 6953. In out testing environment, we use the attached workaround that moves the addition to the good_servers list at the very bottom of the deferedlist that is built per-server. This is expected to cause problems when servers have multiple shares, but it is just a temporary fix anyway.
tahoe-lafs added the
code-mutable
normal
defect
1.9.2
labels 2012-07-25 03:04:43 +00:00
tahoe-lafs added this to the undecided milestone 2012-07-25 03:04:43 +00:00
Author
Owner

Attachment changeset_r9db2f65ebb8eaa4f6094f2f99eff928ba285f5f5.diff (904 bytes) added

workaround

**Attachment** changeset_r9db2f65ebb8eaa4f6094f2f99eff928ba285f5f5.diff (904 bytes) added workaround
Author
Owner

Attachment ucw_text_transcript.log (14887 bytes) added

Transcript of the incident report

**Attachment** ucw_text_transcript.log (14887 bytes) added Transcript of the incident report
tahoe-lafs added
major
and removed
normal
labels 2012-07-25 04:25:27 +00:00
tahoe-lafs modified the milestone from undecided to 1.10.0 2012-07-25 04:25:27 +00:00
tahoe-lafs modified the milestone from 1.10.0 to 1.11.0 2012-09-04 16:59:49 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: tahoe-lafs/trac-2024-07-25#1795
No description provided.