tahoe-lafs/trac

reduce write-collision clobbering: use 'eq' instead of 'le' in test+set #347

New issue

Closed

opened 2008-03-13 00:53:57 +00:00 by warner · 2 comments

warner commented

2008-03-13 00:53:57 +00:00

Over lunch today, as we discussed uncoordinated writes in mutable files,
Zooko astutely identified a significant problem: two clients who write to a
mutable file at the same time have a high probability of losing one writer's
changes. Specifically, writers Alice and Bob write at the same time, and much
of the time Alice will believe that her changes have been successfully
placed, Bob will see an UncoordinatedWriteError, and the file will wind
up with Bob's changes but not Alice's. We think there is an equal probability
that Bob will see the UCW but the file will have Alice's changes (but not
Bob's), in which case the usual retry response to a UCW will result in a file
with both changes.

The general problem is that our design goals have been shifting around. When
we first designed the SMDF update algorithm, we assumed that the Prime
Coordination Directive (i.e. "don't do that") was enough, and that bad
behavior in the face of uncoordinated writes was excusable. Therefore our
goal was to maximize the health of some version of the file (not
necessarily the most recent one) when a UCW occurred. However, the past few
months have shown that coordinating writes is a hassle, and developers of the
next few layers above mutable files would be much happier if tahoe were to
handle this requirement for then. In addition, we defined a number of
fallback strategies, all of which serve to improve the behavior when UCW
occurs.

At some point, we started thinking that these fallbacks were a suitable
replacement for write coordination. This is wrong.

We designed the fallbacks to improve file health, not to obtain the
consistency/atomic-update characteristics that developers would like to see
in mutable files. Specifically it would be nice if there were a reliable
test-and-set operation for mutable files, such that you could be sure that
either 1) your version wins, or 2) your version lost. But our fallbacks were
not designed to do this.

The specific problem is that our testv-and-reav-and-writev calls are being
made with a test vector comparison operator that instructs the storage server
to apply the write if the old checkstring is "le" (less than or equal to) the
new one. The sequence number is at the MSB end of the checkstring, followed
by the root hash. The most common case is for the new checkstring to have a
seqnum that is one higher than the old one, and for the old roothash to be
the same as the one that we saw on that server during our survey pass. This
case is accepted by the test vector, so the write goes through. (note that we
compare the old roothash after the testv+writev returns, rather than putting
it into the test vector).

However, if some other writer Bob has raced ahead of us/Alice, they will have
written a new version of the share into place: say they replaced the previous
v0 with v1b, and now we're trying to write our v1a into the same place. The
roothash that we retrieve will be different than what we surveyed (since it
is v1b instead of v0), so we'll signal UCW when we're done. But 50% of the
time, our v1a roothash will be higher than Bob's v1b roothash, and our write
vectors will be applied despite the UCW. In this case, we will overwrite
Bob's data, and Bob (who saw no errors and has retired the write) will have
gone away, resulting in the loss of Bob's changes.

In the other 50% of the time, Bob's v1b roothash will be higher than our v1a
roothash, which means that our write vectors will be rejected, preserving
Bob's changes. Alice will see the UCW and retry, merging her changes with
Bob's, and everybody will be happy.

So the net result is that overlapping writes have a 50% chance of losing one
set of changes.

We're guessing that we can fix this by changing the test vector to be (eq
oldseqnum oldroothash) instead of (le newseqnum newroothash). This would give
the same UCW function (i.e. any combinations that would have resulted in a
UCW in the 'le' case will also result in a UCW in the 'eq' case), but should
refrain from doing the share writes in all of the UCW cases instead of only a
subset. This would result in Alice's writes leaving Bob's alone. In the 50%
change-losing case, Alice would see the UCW as before (and Bob would not),
but v1b would still be in place, allowing Alice to retry and merge without
losing the data.

The reason we didn't make this choice earlier is that, without recovery,
not overwriting the shares will leave them in a less healthy state than
overwriting them with a convergent version. Basically it is easier for
everybody to agree on which version should be reinforced by having a strong
total ordering on those versions, which the 'le' case provides. Without this,
different clients will be requesting different things, and there will be a
higher chance of the update phase finishing with a variety of shares (and in
this case, more variety means less robust).

If we assume that recovery will be performed, then this loss of robustness
isn't too serious, and switching to the 'eq' form seems like a better idea.

We need to analyze this carefully first. We spent several weeks going back
and forth on this design when we first made it, so a change like this is
deserving of more discussion. Also, we need to reevaluate the algorithm in
light of our shifting design goals: if we want mutable files to give good
update properties without coordination, then we must keep that in mind when
we review the design.

Over lunch today, as we discussed uncoordinated writes in mutable files, Zooko astutely identified a significant problem: two clients who write to a mutable file at the same time have a high probability of losing one writer's changes. Specifically, writers Alice and Bob write at the same time, and much of the time Alice will believe that her changes have been successfully placed, Bob will see an `UncoordinatedWriteError`, and the file will wind up with Bob's changes but not Alice's. We think there is an equal probability that Bob will see the UCW but the file will have Alice's changes (but not Bob's), in which case the usual retry response to a UCW will result in a file with both changes. The general problem is that our design goals have been shifting around. When we first designed the SMDF update algorithm, we assumed that the Prime Coordination Directive (i.e. "don't do that") was enough, and that bad behavior in the face of uncoordinated writes was excusable. Therefore our goal was to maximize the health of *some* version of the file (not necessarily the most recent one) when a UCW occurred. However, the past few months have shown that coordinating writes is a hassle, and developers of the next few layers above mutable files would be much happier if tahoe were to handle this requirement for then. In addition, we defined a number of fallback strategies, all of which serve to improve the behavior when UCW occurs. At some point, we started thinking that these fallbacks were a suitable replacement for write coordination. This is wrong. We designed the fallbacks to improve file health, not to obtain the consistency/atomic-update characteristics that developers would like to see in mutable files. Specifically it would be nice if there were a reliable test-and-set operation for mutable files, such that you could be sure that either 1) your version wins, or 2) your version lost. But our fallbacks were not designed to do this. The specific problem is that our testv-and-reav-and-writev calls are being made with a test vector comparison operator that instructs the storage server to apply the write if the old checkstring is "le" (less than or equal to) the new one. The sequence number is at the MSB end of the checkstring, followed by the root hash. The most common case is for the new checkstring to have a seqnum that is one higher than the old one, and for the old roothash to be the same as the one that we saw on that server during our survey pass. This case is accepted by the test vector, so the write goes through. (note that we compare the old roothash after the testv+writev returns, rather than putting it into the test vector). However, if some other writer Bob has raced ahead of us/Alice, they will have written a new version of the share into place: say they replaced the previous v0 with v1b, and now we're trying to write our v1a into the same place. The roothash that we retrieve will be different than what we surveyed (since it is v1b instead of v0), so we'll signal UCW when we're done. But 50% of the time, our v1a roothash will be higher than Bob's v1b roothash, and our write vectors will be applied despite the UCW. In this case, we will overwrite Bob's data, and Bob (who saw no errors and has retired the write) will have gone away, resulting in the loss of Bob's changes. In the other 50% of the time, Bob's v1b roothash will be higher than our v1a roothash, which means that our write vectors will be rejected, preserving Bob's changes. Alice will see the UCW and retry, merging her changes with Bob's, and everybody will be happy. So the net result is that overlapping writes have a 50% chance of losing one set of changes. We're guessing that we can fix this by changing the test vector to be (eq oldseqnum oldroothash) instead of (le newseqnum newroothash). This would give the same UCW function (i.e. any combinations that would have resulted in a UCW in the 'le' case will also result in a UCW in the 'eq' case), but should refrain from doing the share writes in all of the UCW cases instead of only a subset. This would result in Alice's writes leaving Bob's alone. In the 50% change-losing case, Alice would see the UCW as before (and Bob would not), but v1b would still be in place, allowing Alice to retry and merge without losing the data. The reason we didn't make this choice earlier is that, without recovery, *not* overwriting the shares will leave them in a less healthy state than overwriting them with a convergent version. Basically it is easier for everybody to agree on which version should be reinforced by having a strong total ordering on those versions, which the 'le' case provides. Without this, different clients will be requesting different things, and there will be a higher chance of the update phase finishing with a variety of shares (and in this case, more variety means less robust). If we assume that recovery will be performed, then this loss of robustness isn't too serious, and switching to the 'eq' form seems like a better idea. We need to analyze this carefully first. We spent several weeks going back and forth on this design when we first made it, so a change like this is deserving of more discussion. Also, we need to reevaluate the algorithm in light of our shifting design goals: if we want mutable files to give good update properties *without* coordination, then we must keep that in mind when we review the design.

warner added the

c/code-encoding

labels 2008-03-13 00:53:57 +00:00

warner added this to the undecided milestone 2008-03-13 00:53:57 +00:00

warner commented

2008-03-25 22:50:33 +00:00

Author

This issue is making us nervous, so we'd like to see it be a priority for a soon-after-1.0 release.

This issue is making us nervous, so we'd like to see it be a priority for a soon-after-1.0 release.

warner modified the milestone from undecided to 1.1.0

2008-03-25 22:50:33 +00:00

warner commented

2008-04-24 23:30:16 +00:00

Author

The new mutable-file refactoring (moving to 'servermaps') now does eq instead of le, so this issue is resolved. The recovery code tracked in #272 may wish to have 'le' instead, but it seems better to avoid the confusing write in the first place and let recovery deal with the slightly reduced file health instead.

The new mutable-file refactoring (moving to 'servermaps') now does eq instead of le, so this issue is resolved. The recovery code tracked in #272 may wish to have 'le' instead, but it seems better to avoid the confusing write in the first place and let recovery deal with the slightly reduced file health instead.

warner added

and removed

c/code-encoding

labels 2008-04-24 23:30:16 +00:00

warner closed this issue

2008-04-24 23:30:16 +00:00

Sign in to join this conversation.

No labels

c/code-dirnodes

c/code-encoding

c/code-frontend

c/code-frontend-cli

c/code-frontend-ftp-sftp

c/code-frontend-magic-folder

c/code-frontend-web

c/code-nodeadmin

c/code-peerselection

c/dev-infrastructure

kw:AttributeError

kw:DataUnavailable

kw:DeadReferenceError

kw:GetLastError

kw:IFinishableConsumer

kw:LeastAuthority

kw:RIStorageServer

kw:UncoordinatedWriteError

kw:access-control

kw:accessibility

kw:accounting-crawler

kw:anti-censorship

kw:api_auth_token

kw:authentication

kw:availability

kw:backward-compatibility

kw:blocks-cloud-deployment

kw:blocks-cloud-merge

kw:blocks-magic-folder-merge

kw:blocks-merge

kw:blocks-release

kw:brians-opinion-needed

kw:build-helpers

kw:cloud-backend

kw:coding-standards

kw:coding-tools

kw:coding_tools

kw:compatibility

kw:confidentiality

kw:configuration

kw:configuration.txt

kw:connectivity

kw:control.furl

kw:coordination

kw:coveralls.io

kw:create-container

kw:cryptography

kw:cryptography-lib

kw:denial-of-service

kw:desert-island

kw:desert-island-build

kw:design-review-needed

kw:dev-infrastructure

kw:directory-page

kw:disk-backend

kw:earth-dragon

kw:erasure-coding

kw:excess-authority

kw:extensibility

kw:file-descriptor

kw:floatingpoint

kw:forward-compatibility

kw:forward-secrecy

kw:garbage-collection

kw:google-cloud-storage

kw:google-drive-backend

kw:grid-manager

kw:illustration

kw:inlineCallbacks

kw:integration-test

kw:interoperability

kw:interstellar-exploration

kw:introduction

kw:key-value-store

kw:magic-folder

kw:manual-test-needed

kw:more-info-needed

kw:mountain-lion

kw:multiuser-gateway

kw:needs-review

kw:notification

kw:notifyOnDisconnect

kw:openitp-packaging

kw:operation-helpers

kw:optimization

kw:organization

kw:otf-magic-folder-objective1

kw:otf-magic-folder-objective2

kw:otf-magic-folder-objective3

kw:otf-magic-folder-objective4

kw:otf-magic-folder-objective5

kw:otf-magic-folder-objective6

kw:peer-selection

kw:pkg_resources

kw:preservation

kw:pycrypto-lib

kw:pyfilesystem

kw:random-access

kw:raspberry-pi

kw:release-blocker

kw:removable-disk

kw:reserved_space

kw:response-needed

kw:response-time

kw:review-needed

kw:self-contained

kw:servers-of-happiness

kw:setup_requires

kw:setuptools_darcs

kw:simultaneous

kw:space-efficiency

kw:static-analysis

kw:stats_gatherer

kw:tahoe-add-alias

kw:tahoe-archive

kw:tahoe-backup

kw:tahoe-create-alias

kw:tahoe-create-introducer

kw:tahoe-deep-check

kw:tahoe-deepcheck

kw:tahoe-lafs-trac-stream

kw:tahoe-list-aliases

kw:tahoe-magic-folder

kw:tahoe-manifest

kw:tahoe-restart

kw:tahoe-unlink

kw:tahoe-webopen

kw:test-and-set

kw:test-from-egg

kw:ticket999-s3-backend

kw:to-be-closed-on-2011-08-01

kw:tor-protocol

kw:transparency

kw:tub.location

kw:twisted-trial

kw:uncoordinated-writes

kw:unfinished-business

kw:unhandled-error

kw:upload-helper

kw:visualization

kw:volunteergrid2

kw:warners-opinion-needed

kw:welcome-page

kw:windows-related

kw:world-domination

kw:write-enabler

kw:zookos-opinion-needed

kw:zope.interface

p/supercritical

r/cannot reproduce

r/somebody else's problem

r/was already fixed

v/1.9.0-s3branch

No milestone

No project

No assignees

1 participant

Notifications

Due date

The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: tahoe-lafs/trac#347

No description provided.