handle disk-full situations properly #426

New issue

Closed

opened 2008-05-30 00:35:10 +00:00 by warner · 2 comments

warner commented

2008-05-30 00:35:10 +00:00

We need this in place before any of the allmydata.com prodnet storage servers
get close to running out of space, because otherwise the out-of-space error
raised during a write() call will interact badly with the client's upload
algorithm: worst case, the upload will fail, but unless the client restarts,
the servers will claim that the upload is still in progress, and the client
won't try to use other servers.

The plan is to have a config file of some sort that specifies a minimum free
space. The server will use 'df' or its python equivalent to measure the free
space in storage/ before each allocate_bucket() call, and if the free space
minus the request size is below this threshold, the lease request will be
rejected.

We also need to make sure that mutable file lease requests can be rejected
properly.

We need to add code that implements a "min-free-space=" disk usage model. Specifically, you should be able to tell a Tahoe node that it must refuse new leases if its remaining disk space is less than some threshold. We need this in place before any of the allmydata.com prodnet storage servers get close to running out of space, because otherwise the out-of-space error raised during a write() call will interact badly with the client's upload algorithm: worst case, the upload will fail, but unless the client restarts, the servers will claim that the upload is still in progress, and the client won't try to use other servers. The plan is to have a config file of some sort that specifies a minimum free space. The server will use 'df' or its python equivalent to measure the free space in storage/ before each allocate_bucket() call, and if the free space minus the request size is below this threshold, the lease request will be rejected. We also need to make sure that mutable file lease requests can be rejected properly.

warner added the

labels 2008-05-30 00:35:10 +00:00

warner added this to the 1.1.0 milestone 2008-05-30 00:35:10 +00:00

warner self-assigned this 2008-05-30 00:35:10 +00:00

warner commented

2008-06-02 06:03:13 +00:00

Author

code review:

Server side 1.0 Behavior

In 1.0, if the server were to run out of room, or if the partition it is
using for NODEDIR/storage/ were to be mounted read-only, or if
NODEDIR/storage/ were chmoded o-r, then:

remote_allocate_buckets(), remote_renew_lease(), and remote_cancel_lease() would raise an IOError exception
remote_slot_testv_and_readv_and_writev() would raise IOError instead
of writing (if the test vectors did not match, it would return the usual
non-exception resposne)
BucketWriter.remote_write would also raise IOError, if the
allocate_buckets succeeded but we ran out of space later.

If NODEDIR/readonly_storage exists, then:

remote_allocate_buckets() would return the usual non-exception
response (i.e. an empty 'bucketwriters' dict), indicating that the lease
is rejected
remote_renew_lease() and remote_cancel_lease() would succeed
remote_slot_testv_and_readv_and_writev() would succeed.

Our proposal for 1.1 is to transform the IOError that is triggered by writes
to a full or readonly filesystem into a well-defined remote exception, and to
react to NODEDIR/readonly_storage by raising the same IOError. In addition,
we plan to add a "df"-based reserved-space threshold, and if this plus the
size of all current reservations is exceeded, to raise the same IOError.

http://allmydata.org/pipermail/tahoe-dev/2008-May/000630.html contains some
relevant discussion, as well as some API plans for post-1.1

So the requirement is that all supported client versions must tolerate an
exception during write.

Client side 1.0 Behavior

In 1.0, if an immutable upload receives an exception during
allocate_buckets(), a log.UNUSUAL message is logged ("got error during peer
selection"), but otherwise the peer selection code will proceed normally. If
an exception is received during share write, another log.UNUSUAL message is
logged, and the shareholder is dropped. However, since this takes place after
peer selection, no new shareholder will be found to take their place, and
that share will not be uploaded, resulting in a slightly unhealthy file
(fewer than N shares present). If this happens to enough shares, the
shares_of_happiness threshold will not be met, and the upload will fail.
Since uploads do not automatically abandon their shares partial shares, the
server will still see a non-zero reference count for the BucketWriter
object, so the partial share data will remain in
NODEDIR/storage/shares/incoming/, and therefore it is likely that the next
allocate_buckets() call will fail. However, the partial shares in incoming/
will cause allocate_buckets to believe that someone else is currently
uploading those shares, and the client will treat them as "alreadygot", which
means it will no attempt to find new (better) homes for them. So, worst case,
the first upload will fail, the second upload will appear to succeed, but the
file will not actually be retrieveable from the grid. Badness.

If a mutable publish receives an exception during
remote_slot_testv_and_readv_and_writev, the unfortunate DeferredList created
by Publish._send_shares() will fill with (False, Failure) pairs, and the lack
of code to detect this condition means that the publish will appear to
succeed when in fact the file is still in its original state.

Client side 1.1 (current trunk) Behavior

The immutable upload code in 1.1 is the same as in 1.0 . A storage server
which discovers that it is full after allocate_buckets will cause silent
failures.

The mutable upload code in 1.1 is new. The servermap update phase has no way
to ask if the server will accept a new share or not, but the publish phase
uses a full peerlist, and will fall back to later peers if earlier ones have
problems. The IOError will cause a log.UNUSUAL event to be recorded, but
otherwise peer-selection will work correctly. Since mutable share writes are
performed by a single remote_slot_testv_and_readv_and_writev call (instead of
being broken up into allocate, write, and close calls like immutable shares),
they are not vulnerable to the problems that will occur with immutable files
and late exceptions. Mutable file publish in the face of IOError will require
multiple roundtrips, though, since we must wait until the publish phase to
determine which peers will help. I expect this to make the publish phase
require 2 RTT instead of 1, bringing the total from 2 to 3.

Out-of-space exceptions for initial mutable-file creation should be tolerated
well, however out-of-space during subsequent modification calls is a problem.
The client will detect the error and find another server to put the new
(larger) share on, but they do not then remove the old (smaller) share from
the server that raised IOError. As a result, the old version will still be
there, and once this happens to several primary servers, rollback will occur
(i.e. the first few k+epsilon shares that the client sees will be old ones,
so it won't see the later version).

Necessary Changes

We need to reduce the chance that (immutable) allocate_buckets will succeed
but a later write() call will fail, since that will cause significant
problems. Likewise, assuming that we can't get rid of all 1.0 clients for a
while, we need to reduce the chance that mutable r_s_t_a_r_a_w() will get an
exception.

To do this, we should set NODEDIR/readonly_storage on storage servers that
are getting close to full (say, with about 10GB to spare). That will cause
allocate_buckets() to start rejecting shares, avoiding failures in write().
readonly_storage does not yet affect r_s_t_a_r_a_w(), so clients will
continue to write mutable shares to the somewhat-readonly servers.

The next phase is to get rid of all the 1.0 peers, to avoid the bad behavior
that occurs when they experience an exception during publish.

Then we can change the server-side storage code in trunk to partially respect
NODEDIR/readonly_storage by rejecting new mutable shares (raising an
exception) but have it continue to accept modifications of existing shares.
This will allow 1.1 clients to behave well, while still avoiding the problems
that occur when 1.1 clients get errors while modifying existing shares.

Next, we change the client-side mutable publish code in trunk to be able to
move shares (specifically give it the ability to delete shares). This
requires new server-side methods. Publish should respond to an out-of-space
error by locating a server which can hold the share, uploading it to them,
then deleting the old one. Once all clients are able to do this, it will
become safe to allow servers to raise an out-of-space exception in
r_s_t_a_r_a_w.

Then, we can change the server-side code to fully respect readonly_storage,
except that we need to change its meaning: something more like "stop getting
bigger". The new flag must allow the deletion of mutable shares, and could
possibly allow modifications to them as long as the shares do not get bigger.

Eventually, we want the server to pay attention to its free space (the 'df'
reserved threshold) and reject allocation requests when they would cause this
threshold to be exceeded.

code review: ## Server side 1.0 Behavior In 1.0, if the server were to run out of room, or if the partition it is using for NODEDIR/storage/ were to be mounted read-only, or if NODEDIR/storage/ were chmoded o-r, then: * `remote_allocate_buckets(), remote_renew_lease(), and remote_cancel_lease()` would raise an IOError exception * `remote_slot_testv_and_readv_and_writev()` would raise IOError instead of writing (if the test vectors did not match, it would return the usual non-exception resposne) * `BucketWriter.remote_write` would also raise IOError, if the allocate_buckets succeeded but we ran out of space later. If NODEDIR/readonly_storage exists, then: * `remote_allocate_buckets()` would return the usual non-exception response (i.e. an empty 'bucketwriters' dict), indicating that the lease is rejected * `remote_renew_lease() and remote_cancel_lease()` would succeed * `remote_slot_testv_and_readv_and_writev()` would succeed. Our proposal for 1.1 is to transform the IOError that is triggered by writes to a full or readonly filesystem into a well-defined remote exception, and to react to NODEDIR/readonly_storage by raising the same IOError. In addition, we plan to add a "df"-based reserved-space threshold, and if this plus the size of all current reservations is exceeded, to raise the same IOError. <http://allmydata.org/pipermail/tahoe-dev/2008-May/000630.html> contains some relevant discussion, as well as some API plans for post-1.1 So the requirement is that all supported client versions must tolerate an exception during write. ## Client side 1.0 Behavior In 1.0, if an immutable upload receives an exception during allocate_buckets(), a log.UNUSUAL message is logged ("got error during peer selection"), but otherwise the peer selection code will proceed normally. If an exception is received during share write, another log.UNUSUAL message is logged, and the shareholder is dropped. However, since this takes place after peer selection, no new shareholder will be found to take their place, and that share will not be uploaded, resulting in a slightly unhealthy file (fewer than N shares present). If this happens to enough shares, the shares_of_happiness threshold will not be met, and the upload will fail. Since uploads do not automatically abandon their shares partial shares, the server will still see a non-zero reference count for the BucketWriter object, so the partial share data will remain in NODEDIR/storage/shares/incoming/, and therefore it is likely that the next allocate_buckets() call will fail. However, the partial shares in incoming/ will cause allocate_buckets to believe that someone else is currently uploading those shares, and the client will treat them as "alreadygot", which means it will no attempt to find new (better) homes for them. So, worst case, the first upload will fail, the second upload will appear to succeed, but the file will not actually be retrieveable from the grid. Badness. If a mutable publish receives an exception during remote_slot_testv_and_readv_and_writev, the unfortunate DeferredList created by Publish._send_shares() will fill with (False, Failure) pairs, and the lack of code to detect this condition means that the publish will appear to succeed when in fact the file is still in its original state. ## Client side 1.1 (current trunk) Behavior The immutable upload code in 1.1 is the same as in 1.0 . A storage server which discovers that it is full after allocate_buckets will cause silent failures. The mutable upload code in 1.1 is new. The servermap update phase has no way to ask if the server will accept a new share or not, but the publish phase uses a full peerlist, and will fall back to later peers if earlier ones have problems. The IOError will cause a log.UNUSUAL event to be recorded, but otherwise peer-selection will work correctly. Since mutable share writes are performed by a single remote_slot_testv_and_readv_and_writev call (instead of being broken up into allocate, write, and close calls like immutable shares), they are not vulnerable to the problems that will occur with immutable files and late exceptions. Mutable file publish in the face of IOError will require multiple roundtrips, though, since we must wait until the publish phase to determine which peers will help. I expect this to make the publish phase require 2 RTT instead of 1, bringing the total from 2 to 3. Out-of-space exceptions for initial mutable-file creation should be tolerated well, however out-of-space during subsequent modification calls is a problem. The client will detect the error and find another server to put the new (larger) share on, but they do not then remove the old (smaller) share from the server that raised IOError. As a result, the old version will still be there, and once this happens to several primary servers, rollback will occur (i.e. the first few k+epsilon shares that the client sees will be old ones, so it won't see the later version). ## Necessary Changes We need to reduce the chance that (immutable) allocate_buckets will succeed but a later write() call will fail, since that will cause significant problems. Likewise, assuming that we can't get rid of all 1.0 clients for a while, we need to reduce the chance that mutable r_s_t_a_r_a_w() will get an exception. To do this, we should set NODEDIR/readonly_storage on storage servers that are getting close to full (say, with about 10GB to spare). That will cause allocate_buckets() to start rejecting shares, avoiding failures in write(). readonly_storage does not yet affect r_s_t_a_r_a_w(), so clients will continue to write mutable shares to the somewhat-readonly servers. The next phase is to get rid of all the 1.0 peers, to avoid the bad behavior that occurs when they experience an exception during publish. Then we can change the server-side storage code in trunk to partially respect NODEDIR/readonly_storage by rejecting new mutable shares (raising an exception) but have it continue to accept modifications of existing shares. This will allow 1.1 clients to behave well, while still avoiding the problems that occur when 1.1 clients get errors while modifying existing shares. Next, we change the client-side mutable publish code in trunk to be able to move shares (specifically give it the ability to delete shares). This requires new server-side methods. Publish should respond to an out-of-space error by locating a server which *can* hold the share, uploading it to them, then deleting the old one. Once all clients are able to do this, it will become safe to allow servers to raise an out-of-space exception in r_s_t_a_r_a_w. Then, we can change the server-side code to fully respect readonly_storage, except that we need to change its meaning: something more like "stop getting bigger". The new flag must allow the deletion of mutable shares, and could possibly allow modifications to them as long as the shares do not get bigger. Eventually, we want the server to pay attention to its free space (the 'df' reserved threshold) and reject allocation requests when they would cause this threshold to be exceeded.

warner commented

2008-06-04 01:02:32 +00:00

Author

Ok, so the plan is:

set readonly_storage on the prodnet storage servers when we have, say, 20GB left on each
that will stop immutable shares, but mutable shares will continue to arrive
get rid of all the 1.0 clients out there
modify the servers to raise an exception when creating a new mutable share (when
readonly_storage is set), but continue to allow modifications to exising mutable shares.
This will reduce the rate of inbound data to practically nothing, and if we make this change
with perhaps 10GB left, we can probably survive in this state for years.

Later, we'll be overhauling the storage API to handle all of this better. We'll probably deploy that change through the introducer (so that 1.1 clients will see different storage objects than newer clients).

So, we don't really need to make any changes to 1.1 to make it behave well according to this plan, so I'm closing this ticket.

Ok, so the plan is: * set readonly_storage on the prodnet storage servers when we have, say, 20GB left on each * that will stop immutable shares, but mutable shares will continue to arrive * get rid of all the 1.0 clients out there * modify the servers to raise an exception when creating a new mutable share (when readonly_storage is set), but continue to allow modifications to exising mutable shares. This will reduce the rate of inbound data to practically nothing, and if we make this change with perhaps 10GB left, we can probably survive in this state for years. Later, we'll be overhauling the storage API to handle all of this better. We'll probably deploy that change through the introducer (so that 1.1 clients will see different storage objects than newer clients). So, we don't really need to make any changes to 1.1 to make it behave well according to this plan, so I'm closing this ticket.

Rows
Columns