handle out-of-disk-space condition #871

New issue

Open

opened 2009-12-25 21:04:32 +00:00 by zooko · 12 comments

zooko commented

2009-12-25 21:04:32 +00:00

How does a Tahoe-LAFS node handle it when it runs out of disk space? This happens somewhat frequently with the allmydata.com nodes because they are configured to keep about 10 GB of space free (in order to allow updates to mutable shares, using the reserved_space configuration), and when someone uses the storage servers as a web gateway (they are all configured to serve as web gateways) then sometimes the download cache fills up the remaining 10 GB and causes the download to fail, and then the cache doesn't get cleaned up, and then whenever the node runs it gets out-of-disk-space problems such as being unable to open the twistd.log file. I will open another ticket about the fact that the cache isn't getting cleaned up, but this ticket is about making the tahoe-lafs node fail gracefully and with a useful error message when there is no disk space.

How does a Tahoe-LAFS node handle it when it runs out of disk space? This happens somewhat frequently with the allmydata.com nodes because they are configured to keep about 10 GB of space free (in order to allow updates to mutable shares, using the `reserved_space` configuration), and when someone uses the storage servers as a web gateway (they are all configured to serve as web gateways) then sometimes the download cache fills up the remaining 10 GB and causes the download to fail, and then the cache doesn't get cleaned up, and then whenever the node runs it gets out-of-disk-space problems such as being unable to open the `twistd.log` file. I will open another ticket about the fact that the cache isn't getting cleaned up, but this ticket is about making the tahoe-lafs node fail gracefully and with a useful error message when there is no disk space.

zooko added the

labels 2009-12-25 21:04:32 +00:00

zooko added this to the undecided milestone 2009-12-25 21:04:32 +00:00

warner commented

2009-12-26 02:59:09 +00:00

heh, I think you mean "fail gracefully without an error message".. where would the message go? :)

More seriously though, this is a tricky situation. A lot of operations can continue to work normally. We certainly want storage server reads to keep working, and these should never require additional disk space. Many client operations should work: full immutable downloads are held entirely in RAM (since we do streaming downloads and pause the process until the HTTP client accepts each segment), and small uploads are entirely RAM. Large uploads (either mutable or immutable) cause twisted.web to use a tempfile, and random-access immutable downloads currently use a tempfile. All mutable downloads are RAM based, as are all directory operations.

I suppose that when the log stops working due to a full disk, it would be nice if we could connect via 'flogtool tail' and find out about the node's predicament. The easiest way to emit a message that will be retrievable a long time later is to emit one at a very high severity level. This will trigger an incident, which won't be writable because the disk is full, so we need to make sure foolscap handles that reasonably.

I hesitate to suggest something so complex, but perhaps we should consider a second form of reserved-space parameter, which is applied to the non-storage-server disk consumers inside the node. Or maybe we could track down the non-storage-server disk consumers and make them all obey the same reserved-space parameter that the storage server tries to obey. With this sort of feature, the node would fail sort-of gracefully when the reserved-space limit was exceeded, by refusing to accept large uploads or perform large random-access downloads that would require more disk space. We'd have to decide what sorts of logging would be subject to this limit. Maybe a single Incident when the threshold was crossed (which would be logged successfully, using some of the remaining space), would at least put notice of impending space exhaustion on the disk where it could be found by operators later.

heh, I think you mean "fail gracefully without an error message".. where would the message go? :) More seriously though, this is a tricky situation. A lot of operations can continue to work normally. We certainly want storage server reads to keep working, and these should never require additional disk space. Many client operations should work: full immutable downloads are held entirely in RAM (since we do streaming downloads and pause the process until the HTTP client accepts each segment), and small uploads are entirely RAM. Large uploads (either mutable or immutable) cause twisted.web to use a tempfile, and random-access immutable downloads currently use a tempfile. All mutable downloads are RAM based, as are all directory operations. I suppose that when the log stops working due to a full disk, it would be nice if we could connect via 'flogtool tail' and find out about the node's predicament. The easiest way to emit a message that will be retrievable a long time later is to emit one at a very high severity level. This will trigger an incident, which won't be writable because the disk is full, so we need to make sure foolscap handles that reasonably. I hesitate to suggest something so complex, but perhaps we should consider a second form of reserved-space parameter, which is applied to the non-storage-server disk consumers inside the node. Or maybe we could track down the non-storage-server disk consumers and make them all obey the same reserved-space parameter that the storage server tries to obey. With this sort of feature, the node would fail sort-of gracefully when the reserved-space limit was exceeded, by refusing to accept large uploads or perform large random-access downloads that would require more disk space. We'd have to decide what sorts of logging would be subject to this limit. Maybe a single Incident when the threshold was crossed (which would be logged successfully, using some of the remaining space), would at least put notice of impending space exhaustion on the disk where it could be found by operators later.

daira commented

2009-12-27 04:10:46 +00:00

If the fix causes other operations than share upload to respect the reserved_space setting, then there should still be enough space to log the failure. (There can be a slightly smaller reserved_space limitation for writing to the logfile.)

If the fix causes other operations than share upload to respect the `reserved_space` setting, then there should still be enough space to log the failure. (There can be a slightly smaller reserved_space limitation for writing to the logfile.)

warner commented

2009-12-27 04:58:03 +00:00

well, unless the full disk is the result of some other non-Tahoe process altogether, which is completely ignorant of tahoe's reserved_space concept. Gotta plan for the worst..

zooko commented

2009-12-27 21:18:03 +00:00

Author

So your idea is to make all of the node operations respect the reserved_space parameter except for the logging operations, and then add a (high severity) log message showing that the reserved_space limit has been reached? That sounds good. Oh, yeah but as you say, what should the node do when there isn't any disk space? What would be complicated about triggering a very high-severity incident when an out-of-disk-space condition is detected? That sounds straightforward to me, and as far as I understand foolscap, an investigator who later connected with a flogtool tail would then see that high-severity incident report, right?

So your idea is to make all of the node operations respect the `reserved_space` parameter except for the logging operations, and then add a (high severity) log message showing that the `reserved_space` limit has been reached? That sounds good. Oh, yeah but as you say, what should the node do when there isn't any disk space? What would be complicated about triggering a very high-severity incident when an out-of-disk-space condition is detected? That sounds straightforward to me, and as far as I understand foolscap, an investigator who later connected with a `flogtool tail` would then see that high-severity incident report, right?

warner commented

2009-12-27 21:55:03 +00:00

Anyways, the requirement on Foolscap is that its "Incident Reporter" (which
is the piece that tries to write the .flog file into BASEDIR/logs/incidents/)
must survive the out-of-disk condition without breaking logging or losing the
in-RAM copy of the incident. As long as that incident is in RAM, a
flogtool tail process should see it later. (I just added
[http://foolscap.lothar.com/trac/ticket/144 foolscap#144] to track this
requirement).

The only other thing I'd want to think about is how to keep the message (or
messages) from being emitted over and over. The obvious place to put this
message would in in the storage server (where it tests disk-space-remaining
against reserved_space) and the cachey thing (where we're going to add code
to do the same). But should there be a flag of some sort to say "only emit
this message once". But, if something resolves the too-full condition and
then, a month later, it gets full again, would we want the message to be
re-emitted?

It almost seems like we'd want a switch that the operator resets when they
fix the overfull condition, sort of like the "Check Engine" light on your
car's dashboard that stays on until the mechanic fixes everything. (or,
assuming your mechanic is a concurrency expert and has a healthy fear of race
conditions, they'll turn off the light and then fix everything).

Maybe the rule should be that if you see this incident, you should do
something to free up space, and then restart the node (to reset the flag).

Yes to all of that. I hadn't been thinking of two separate messages, but maybe that makes sense.. one when reserved_space is exceeded the first time, another when, hm, well when disk_avail==0 (or disk_avail<REALLYSMALL) but since we'd already be guarding all our writes with reserved_space, I don't know where exactly we'd be checking for the second threshold. Anyways, the requirement on Foolscap is that its "Incident Reporter" (which is the piece that tries to write the .flog file into BASEDIR/logs/incidents/) must survive the out-of-disk condition without breaking logging or losing the in-RAM copy of the incident. As long as that incident is in RAM, a `flogtool tail` process should see it later. (I just added [http://foolscap.lothar.com/trac/ticket/144 foolscap#144] to track this requirement). The only other thing I'd want to think about is how to keep the message (or messages) from being emitted over and over. The obvious place to put this message would in in the storage server (where it tests disk-space-remaining against reserved_space) and the cachey thing (where we're going to add code to do the same). But should there be a flag of some sort to say "only emit this message once". But, if something resolves the too-full condition and then, a month later, it gets full again, would we want the message to be re-emitted? It almost seems like we'd want a switch that the operator resets when they fix the overfull condition, sort of like the "Check Engine" light on your car's dashboard that stays on until the mechanic fixes everything. (or, assuming your mechanic is a concurrency expert and has a healthy fear of race conditions, they'll turn off the light and *then* fix everything). Maybe the rule should be that if you see this incident, you should do something to free up space, and then restart the node (to reset the flag).

daira commented

2009-12-27 23:27:06 +00:00

The flag should be reset when the free space is observed to be above a threshold (reserved_space plus constant) when we test it. I think there's no need to poll the free space -- testing it when we are about to write something should be sufficient. There's also no need to remember the flag across restarts.

The flag should be reset when the free space is observed to be above a threshold (`reserved_space` plus constant) when we test it. I think there's no need to poll the free space -- testing it when we are about to write something should be sufficient. There's also no need to remember the flag across restarts.

warner commented

2009-12-27 23:49:15 +00:00

So, something like this?:

did_message = False
def try_to_write():
    if freespace() > reserved_space+hysteresis_constant:
        did_message = False
    if freespace() > reserved_space:
        do_write()
    else:
        if not did_message:
            did_message = True
            log_message()

So, something like this?: ``` did_message = False def try_to_write(): if freespace() > reserved_space+hysteresis_constant: did_message = False if freespace() > reserved_space: do_write() else: if not did_message: did_message = True log_message() ```

daira commented

2009-12-28 00:09:39 +00:00

Replying to warner:

So, something like this? [...]

Yes. Nitpicks:

it should only make the OS call to get free space once.
this shouldn't be duplicated code, so it would have to return a boolean saying whether to do the write rather than actually doing it.

Replying to [warner](/tahoe-lafs/trac/issues/871#issuecomment-375021): > So, something like this? [...] Yes. Nitpicks: * it should only make the OS call to get free space once. * this shouldn't be duplicated code, so it would have to return a boolean saying whether to do the write rather than actually doing it.

daira commented

2010-01-16 00:47:16 +00:00

"Making all of the node operations respect the reserved_space parameter" includes #390 ('readonly_storage' and 'reserved_space' not honored for mutable-slot write requests).

"Making all of the node operations respect the `reserved_space` parameter" includes #390 ('readonly_storage' and 'reserved_space' not honored for mutable-slot write requests).

daira modified the milestone from undecided to 1.7.0

2010-02-01 20:01:18 +00:00

daira modified the milestone from 1.7.0 to 1.6.1

2010-02-15 19:53:10 +00:00

zooko commented

2010-02-16 05:24:16 +00:00

Author

Bumping this from v1.6.1 because it isn't a regression and we have other tickets to do in v1.6.1.

zooko modified the milestone from 1.6.1 to 1.7.0

2010-02-16 05:24:16 +00:00

zooko commented

2010-05-15 03:55:49 +00:00

Author

Therefore, it's the kind of ticket that I value highly so that we don't forget about it and allow users to suffer the consequences. But, v1.7 is over and I'm moving this to "eventually" instead of to v1.8 because I'm not sure of the priority of this ticket vs. the hundreds of other tickets that I'm not looking at right now, and because I don't want the "bulldozer effect" of a big and growing pile of tickets getting pushed from one Milestone to the next. :-)

This is a "reliability" issue, meaning that it is one of those things that developers can get away with ignoring most of the time because most of the time they aren't encountering the conditions which cause this issue to arise. Therefore, it's the kind of ticket that I value highly so that we don't forget about it and allow users to suffer the consequences. But, v1.7 is over and I'm moving this to "eventually" instead of to v1.8 because I'm not sure of the priority of this ticket vs. the hundreds of other tickets that I'm not looking at right now, and because I don't want the "bulldozer effect" of a big and growing pile of tickets getting pushed from one Milestone to the next. :-)

Rows
Columns