Allow deep-check to continue after error, and: if there is an unrecoverable subdirectory, the deep-check report (both WUI and CLI) loses other information #755

New issue

Open

opened 2009-07-11 21:00:14 +00:00 by zooko · 28 comments

zooko commented

2009-07-11 21:00:14 +00:00

UnrecoverableFileError: the directory (or mutable file) could not be retrieved, because there were insufficient good shares. This might indicate that no servers were connected, insufficient servers were connected, the URI was corrupt, or that shares have been lost due to server departure, hard drive failure, or disk corruption. You should perform a filecheck on this object to learn more.

To close this ticket, make it so that I can still see all the other result that have already been generated, plus further results about other files and subdirectories that haven't yet been checked, even while there is an unrecoverable subdirectory present.

I'm using the current trunk: 1.4.1-r3982.

Brian: are you willing to take this ticket?

If I do a deep-check on a directory, I start getting results reported on the web page showing the files and subdirectories within that directory. Reloading (or waiting for the automatic self-reload) shows more and more results. Until one of the subdirectories is unrecoverable, in which case the web page containing the deep check results is replaced with a web page saying only this: ``` UnrecoverableFileError: the directory (or mutable file) could not be retrieved, because there were insufficient good shares. This might indicate that no servers were connected, insufficient servers were connected, the URI was corrupt, or that shares have been lost due to server departure, hard drive failure, or disk corruption. You should perform a filecheck on this object to learn more. ``` To close this ticket, make it so that I can still see all the other result that have already been generated, plus further results about other files and subdirectories that haven't yet been checked, even while there is an unrecoverable subdirectory present. I'm using the current trunk: 1.4.1-r3982. Brian: are you willing to take this ticket?

zooko added the

labels 2009-07-11 21:00:14 +00:00

zooko added this to the 1.5.0 milestone 2009-07-11 21:00:14 +00:00

warner was assigned by zooko

2009-07-11 21:00:14 +00:00

warner commented

2009-07-11 23:28:09 +00:00

yeah, I'll work on this. Basically traversal failures during a deep-check or deep-repair operation should increment a counter and move on, instead of throwing an exception and stopping the walker. I don't know if I can finish it in time for 1.5.0 though.

warner added

c/code-dirnodes

and removed

c/code-frontend-web

labels 2009-07-11 23:28:09 +00:00

zooko commented

2009-07-15 05:24:36 +00:00

Author

This isn't really a blocker for v1.5.0.

zooko modified the milestone from 1.5.0 to eventually

2009-07-15 05:24:36 +00:00

zooko commented

2009-08-11 13:54:27 +00:00

Author

On the mailing list Ludo reported:

$ tahoe deep-check
ERROR: UnrecoverableFileError(no recoverable versions)
[Failure instance: Traceback: <class 'allmydata.mutable.common.UnrecoverableFileError'>: no recoverable versions
/nix/store/i1bkz12nx2vbih8aj37c9gpqnzbjshkx-python-twisted-8.2.0/lib/python2.5/site-packages/Twisted-8.2.0-py2.5-linux-x86_64.egg/twisted/internet/base.py:757:runUntilCurrent
/nix/store/nk39m80fi7ll7460713djzw3qzwgb4kr-python-foolscap-0.4.2/lib/python2.5/site-packages/foolscap-0.4.2-py2.5.egg/foolscap/eventual.py:26:_turn
/nix/store/i1bkz12nx2vbih8aj37c9gpqnzbjshkx-python-twisted-8.2.0/lib/python2.5/site-packages/Twisted-8.2.0-py2.5-linux-x86_64.egg/twisted/internet/defer.py:243:callback
/nix/store/i1bkz12nx2vbih8aj37c9gpqnzbjshkx-python-twisted-8.2.0/lib/python2.5/site-packages/Twisted-8.2.0-py2.5-linux-x86_64.egg/twisted/internet/defer.py:312:_startRunCallbacks
--- <exception caught here> ---
/nix/store/i1bkz12nx2vbih8aj37c9gpqnzbjshkx-python-twisted-8.2.0/lib/python2.5/site-packages/Twisted-8.2.0-py2.5-linux-x86_64.egg/twisted/internet/defer.py:328:_runCallbacks
/nix/store/yj6q079b58rfnnf8g70ib5vaah6gxlhq-tahoe-1.5.0/lib/python2.5/site-packages/allmydata_tahoe-1.5.0-py2.5.egg/allmydata/mutable/filenode.py:312:_once_updated_download_best_version

Is this an example of the issue in this ticket?

By the way, see also #583 (repairer: test cancel, upload failure, download failure).

[On the mailing list](http://allmydata.org/pipermail/tahoe-dev/2009-August/002593.html) Ludo reported: ``` $ tahoe deep-check ERROR: UnrecoverableFileError(no recoverable versions) [Failure instance: Traceback: <class 'allmydata.mutable.common.UnrecoverableFileError'>: no recoverable versions /nix/store/i1bkz12nx2vbih8aj37c9gpqnzbjshkx-python-twisted-8.2.0/lib/python2.5/site-packages/Twisted-8.2.0-py2.5-linux-x86_64.egg/twisted/internet/base.py:757:runUntilCurrent /nix/store/nk39m80fi7ll7460713djzw3qzwgb4kr-python-foolscap-0.4.2/lib/python2.5/site-packages/foolscap-0.4.2-py2.5.egg/foolscap/eventual.py:26:_turn /nix/store/i1bkz12nx2vbih8aj37c9gpqnzbjshkx-python-twisted-8.2.0/lib/python2.5/site-packages/Twisted-8.2.0-py2.5-linux-x86_64.egg/twisted/internet/defer.py:243:callback /nix/store/i1bkz12nx2vbih8aj37c9gpqnzbjshkx-python-twisted-8.2.0/lib/python2.5/site-packages/Twisted-8.2.0-py2.5-linux-x86_64.egg/twisted/internet/defer.py:312:_startRunCallbacks --- <exception caught here> --- /nix/store/i1bkz12nx2vbih8aj37c9gpqnzbjshkx-python-twisted-8.2.0/lib/python2.5/site-packages/Twisted-8.2.0-py2.5-linux-x86_64.egg/twisted/internet/defer.py:328:_runCallbacks /nix/store/yj6q079b58rfnnf8g70ib5vaah6gxlhq-tahoe-1.5.0/lib/python2.5/site-packages/allmydata_tahoe-1.5.0-py2.5.egg/allmydata/mutable/filenode.py:312:_once_updated_download_best_version ``` Is this an example of the issue in this ticket? By the way, see also #583 (repairer: test cancel, upload failure, download failure).

zooko commented

2009-12-31 16:16:49 +00:00

Author

UnrecoverableFileError: the directory (or mutable file) could not be retrieved, because there were insufficient good shares. This might indicate that no servers were connected, insufficient servers were connected, the URI was corrupt, or that shares have been lost due to server departure, hard drive failure, or disk corruption. You should perform a filecheck on this object to learn more.

I just got bitten by this bug again. I have a directory (on the volunteergrid) that has an unrecoverable subdirectory in it. When I do a deep check in the WUI then it shows useful information about the other contents of the directory until it reaches that subdirectory, at which point I lose the other information. Also, the resulting error message doesn't tell me any identifying information about *which* file or directory was unrecoverable! ``` UnrecoverableFileError: the directory (or mutable file) could not be retrieved, because there were insufficient good shares. This might indicate that no servers were connected, insufficient servers were connected, the URI was corrupt, or that shares have been lost due to server departure, hard drive failure, or disk corruption. You should perform a filecheck on this object to learn more. ```

daira modified the milestone from eventually to 1.7.0

2010-02-02 03:08:44 +00:00

zooko commented

2010-02-14 20:36:22 +00:00

Author

This is persistently causing problems for me. I have several important directory structures in which some of the directories or files are sometimes unrecoverable. I really need to be able to see information about the rest of them even at these times. Raising priority to critical to remind myself that I really care about this.

This is persistently causing problems for me. I have several important directory structures in which some of the directories or files are sometimes unrecoverable. I really need to be able to see information about the rest of them even at these times. Raising priority to `critical` to remind myself that I really care about this.

zooko added

p/critical

and removed

p/major

labels 2010-02-14 20:36:22 +00:00

daira modified the milestone from 1.7.0 to 1.6.1

2010-02-15 18:51:02 +00:00

daira commented

2010-02-15 19:50:05 +00:00

Unifying this with #880; this ticket now covers both CLI and WUI.

daira changed title from ~~if there is an unrecoverable subdirectory, the web deep-check report loses other information~~ to if there is an unrecoverable subdirectory, the deep-check report (both WUI and CLI) loses other information

2010-02-15 19:50:05 +00:00

zooko commented

2010-02-16 04:16:22 +00:00

Author

This might be too ambitious to finish for v1.6.1. I would like to get v1.6.1 released this coming weekend of 2010-02-20 so that people who have started packaging or deploying v1.6.0 have the option of quickly upgrading to v1.6.1 before their packages/deployments of v1.6.0 spread too far.

However, I'm leaving it in the Milestone v1.6.1 for now because I don't object to fixing it in v1.6.1.

zooko commented

2010-02-22 05:04:34 +00:00

Author

We're not going to fix this in time for v1.6.1. Hopefully in time for v1.7.0!

zooko modified the milestone from 1.6.1 to 1.7.0

2010-02-22 05:04:34 +00:00

zooko modified the milestone from 1.7.0 to eventually

2010-05-16 23:40:04 +00:00

daira modified the milestone from eventually to soon

2010-05-17 02:15:24 +00:00

daira commented

2010-10-28 23:14:39 +00:00

This is one of our more commonly encountered usability problems, so I think it should be a priority for 1.9.0.

daira modified the milestone from soon to 1.9.0

2010-10-28 23:14:39 +00:00

francois commented

2010-11-01 11:12:49 +00:00

Owner

I'm willing to try to fix this bug.

francois commented

2010-11-20 23:42:41 +00:00

Owner

Attachment 755-fix-for-review.diff (4524 bytes) added

**Attachment** 755-fix-for-review.diff (4524 bytes) added

755-fix-for-review.diff

4.4 KiB

francois commented

2010-11-20 23:44:34 +00:00

Owner

The patch attachment:755-fix-for-review.diff is how I intent to fix this bug. The associated tests are still being worked on.

The patch [attachment:755-fix-for-review.diff](/tahoe-lafs/trac/attachments/000078ac-5234-2fa4-40fb-1e63d44e96a7) is how I intent to fix this bug. The associated tests are still being worked on.

francois commented

2010-11-21 22:46:52 +00:00

Owner

Attachment patch-755.darcs.diff (30858 bytes) added

**Attachment** patch-755.darcs.diff (30858 bytes) added

patch-755.darcs.diff

30 KiB

francois commented

2010-11-21 22:48:49 +00:00

Owner

The patch attachment:patch-755.darcs.diff contains the fix for this issue and associated tests.

The patch [attachment:patch-755.darcs.diff](/tahoe-lafs/trac/attachments/000078ac-5234-2fa4-40fb-43c067161683) contains the fix for this issue and associated tests.

daira self-assigned this 2011-01-01 21:19:51 +00:00

daira modified the milestone from 1.9.0 to 1.8.2

2011-01-06 00:31:29 +00:00

warner commented

2011-01-07 05:31:20 +00:00

Good patch! I like the approach of making filenode.check_and_repair()
signal inability to repair by returning
CheckAndRepairResults.repair_successful=False instead of by
throwing an exception. A few things I'd like to see changed:

we usually repair files that are unhealthy but recoverable. If repair
fails, the file should still be recoverable. The post-repair-results
are pessimistically being set to healthy=False recoverable=False
needs_rebalancing=False, when it's probably (and sometimes certainly)
more accurate to copy these values from the pre-repair-results. In
particular, we shouldn't scare users into thinking that repair
failures of "scratched" files (unhealthy but recoverable) indicate
unrecoverable files: this makes benign things like
UnhappinessError look like data loss. This should be fixed in
both mutable and immutable files.
the newly-enabled test in test_repairer.Repairer.test_harness
(which previously got a self.shouldFail()) should be slightly

enhanced to check the return value of check_and_repair(). We
should verify that it has crr.repair_attempted=True,
crr.repair_successful=False, and
crr.post_repair_results.recoverable=False

we should add a similar test for mutable files that have had 8 shares

deleted. There's something awfully close in
test_mutable.Repair.test_unrepairable_1share .. it should be
changed to use self._fn.check_and_repair() instead of
self._fn.repair() . To be honest, I'm not sure why that test was
passing before, because from what I can tell it should have been
behaving the same way as immutable repair on an unrecoverable file.

it's probably worth checking the code coverage when we exercise
test_mutable and make sure the new code is getting run
do we have any tests that confirm deep-repair on a tree with an

unrecoverable file (or directory) makes it through to the end without
an errback? We probably do but I'd like to be sure.. probably
something in test_deepcheck exercises this.

I see test_deepcheck.py:DeepCheckWebBad.do_deepcheck_broken()

asserts that an unrecoverable dirnode causes the traversal to halt. Is
this what we want? Is this ticket about making sure an unrecoverable
file doesn't halt a deep-repair, or about an unrecoverable
dirnode? (broken dirnodes are more significant than files, because
it means you've probably lost access to even more data). We certainly
want the deep-traversal to keep going and repair more things, but we
also need to make sure the user learns about the dead dirnode.

Otherwise, looks great! With those few changes we can land this one for
1.8.2!

Good patch! I like the approach of making filenode.check_and_repair() signal inability to repair by returning `CheckAndRepairResults.repair_successful`=False instead of by throwing an exception. A few things I'd like to see changed: * we usually repair files that are unhealthy but recoverable. If repair fails, the file should still be recoverable. The post-repair-results are pessimistically being set to healthy=False recoverable=False needs_rebalancing=False, when it's probably (and sometimes certainly) more accurate to copy these values from the pre-repair-results. In particular, we shouldn't scare users into thinking that repair failures of "scratched" files (unhealthy but recoverable) indicate unrecoverable files: this makes benign things like `UnhappinessError` look like data loss. This should be fixed in both mutable and immutable files. * the newly-enabled test in `test_repairer.Repairer.test_harness` (which previously got a `self.shouldFail()`) should be slightly > enhanced to check the return value of `check_and_repair()`. We > should verify that it has `crr.repair_attempted=True`, `crr.repair_successful=False`, and `crr.post_repair_results.recoverable=False` * we should add a similar test for mutable files that have had 8 shares > deleted. There's something awfully close in `test_mutable.Repair.test_unrepairable_1share` .. it should be > changed to use `self._fn.check_and_repair()` instead of `self._fn.repair()` . To be honest, I'm not sure why that test was > passing before, because from what I can tell it should have been > behaving the same way as immutable repair on an unrecoverable file. * it's probably worth checking the code coverage when we exercise `test_mutable` and make sure the new code is getting run * do we have any tests that confirm deep-repair on a tree with an > unrecoverable file (or directory) makes it through to the end without > an errback? We probably do but I'd like to be sure.. probably > something in `test_deepcheck` exercises this. * I see `test_deepcheck.py:DeepCheckWebBad.do_deepcheck_broken()` > asserts that an unrecoverable dirnode causes the traversal to halt. Is > this what we want? Is this ticket about making sure an unrecoverable *file* doesn't halt a deep-repair, or about an unrecoverable *dirnode*? (broken dirnodes are more significant than files, because > it means you've probably lost access to even more data). We certainly > want the deep-traversal to keep going and repair more things, but we > also need to make sure the user learns about the dead dirnode. Otherwise, looks great! With those few changes we can land this one for 1.8.2!

daira commented

2011-01-07 05:55:02 +00:00

Replying to warner:

Good patch! I like the approach of making filenode.check_and_repair()
signal inability to repair by returning
CheckAndRepairResults.repair_successful=False instead of by
throwing an exception.

A few things I'd like to see changed:

we usually repair files that are unhealthy but recoverable. If repair
fails, the file should still be recoverable. The post-repair-results
are pessimistically being set to healthy=False recoverable=False
needs_rebalancing=False, when it's probably (and sometimes certainly)
more accurate to copy these values from the pre-repair-results.

If there's a failure, then we don't know whether the file is healthy, recoverable or needs rebalancing. Shouldn't unknown fields simply be missing from the results?

(Note: needs_rebalancing=False is not pessimistic.)

I see test_deepcheck.py:DeepCheckWebBad.do_deepcheck_broken()
asserts that an unrecoverable dirnode causes the traversal to halt. Is
this what we want? Is this ticket about making sure an unrecoverable
file doesn't halt a deep-repair, or about an unrecoverable
dirnode?

I thought it was both.

Replying to [warner](/tahoe-lafs/trac/issues/755#issuecomment-372830): > Good patch! I like the approach of making filenode.check_and_repair() > signal inability to repair by returning > `CheckAndRepairResults.repair_successful`=False instead of by > throwing an exception. +1 > A few things I'd like to see changed: > > * we usually repair files that are unhealthy but recoverable. If repair > fails, the file should still be recoverable. The post-repair-results > are pessimistically being set to healthy=False recoverable=False > needs_rebalancing=False, when it's probably (and sometimes certainly) > more accurate to copy these values from the pre-repair-results. If there's a failure, then we don't know whether the file is healthy, recoverable or needs rebalancing. Shouldn't unknown fields simply be missing from the results? (Note: needs_rebalancing=False is not pessimistic.) > * I see `test_deepcheck.py:DeepCheckWebBad.do_deepcheck_broken()` > asserts that an unrecoverable dirnode causes the traversal to halt. Is > this what we want? Is this ticket about making sure an unrecoverable > *file* doesn't halt a deep-repair, or about an unrecoverable > *dirnode*? I thought it was both.

francois commented

2011-01-15 16:20:04 +00:00

Owner

Thanks for the review! My comments are inline.

Replying to warner:

we usually repair files that are unhealthy but recoverable. If repair
fails, the file should still be recoverable. The post-repair-results
are pessimistically being set to healthy=False recoverable=False
needs_rebalancing=False, when it's probably (and sometimes certainly)
more accurate to copy these values from the pre-repair-results.

I agree with what davidsarah said in comment:23, it is
difficult to know the actual status when an exception was raised during
the check operation. However, it seems that simply removing the fields
from the results would necessitate other changes because I guess that
many parts of the code except them to be present.

What would you think about setting healthy to its value before the
repair (most likely False) and other fields to None?
Something along those lines?

  def _repair_error(f):
    prr = CheckResults(cr.uri, cr.storage_index)
    prr.data = copy.deepcopy(cr.data)
    prr.set_healthy(crr.pre_repair_results.is_healthy())
    prr.set_recoverable(None)
    prr.set_needs_rebalancing(None)
    crr.post_repair_results = prr
    crr.repair_successful = False
    crr.repair_failure = f
    return crr

the newly-enabled test in test_repairer.Repairer.test_harness
(which previously got a self.shouldFail()) should be slightly
enhanced to check the return value of check_and_repair(). We
should verify that it has crr.repair_attempted=True,
crr.repair_successful=False, and
crr.post_repair_results.recoverable=False

Good point, will be done in the next patch.

we should add a similar test for mutable files that have had 8 shares
deleted. There's something awfully close in
test_mutable.Repair.test_unrepairable_1share .. it should be
changed to use self._fn.check_and_repair() instead of
self._fn.repair().

Will be done in the next patch.

To be honest, I'm not sure why that test was passing before, because
from what I can tell it should have been behaving the same way as
immutable repair on an unrecoverable file.

I don't know either, will try to look in details into this.

it's probably worth checking the code coverage when we exercise
test_mutable and make sure the new code is getting run

I don't remember how the code coverage infrastructure in the build
system actually works. It would be very kind of you if you tell me which
command I should run?

do we have any tests that confirm deep-repair on a tree with an
unrecoverable file (or directory) makes it through to the end without
an errback? We probably do but I'd like to be sure.. probably
something in test_deepcheck exercises this.

This is what I think calling do_web_stream_check() inside
DeepCheckWebBad.test_bad() should be doing, isn't it?

I see test_deepcheck.py:DeepCheckWebBad.do_deepcheck_broken()
asserts that an unrecoverable dirnode causes the traversal to halt. Is
this what we want? Is this ticket about making sure an unrecoverable
file doesn't halt a deep-repair, or about an unrecoverable
dirnode? (broken dirnodes are more significant than files, because
it means you've probably lost access to even more data). We certainly
want the deep-traversal to keep going and repair more things, but we
also need to make sure the user learns about the dead dirnode.

Yes, the traversal must continue in both cases. I was under the impression that unrecoverable immutable files were already supported and I understand this issue as being about unrecoverable direnodes.

Thanks for the review! My comments are inline. Replying to [warner](/tahoe-lafs/trac/issues/755#issuecomment-372830): > * we usually repair files that are unhealthy but recoverable. If repair > fails, the file should still be recoverable. The post-repair-results > are pessimistically being set to healthy=False recoverable=False > needs_rebalancing=False, when it's probably (and sometimes certainly) > more accurate to copy these values from the pre-repair-results. I agree with what davidsarah said in comment:23, it is difficult to know the actual status when an exception was raised during the check operation. However, it seems that simply removing the fields from the results would necessitate other changes because I guess that many parts of the code except them to be present. What would you think about setting healthy to its value before the repair (most likely `False`) and other fields to `None`? Something along those lines? ``` def _repair_error(f): prr = CheckResults(cr.uri, cr.storage_index) prr.data = copy.deepcopy(cr.data) prr.set_healthy(crr.pre_repair_results.is_healthy()) prr.set_recoverable(None) prr.set_needs_rebalancing(None) crr.post_repair_results = prr crr.repair_successful = False crr.repair_failure = f return crr ``` > * the newly-enabled test in `test_repairer.Repairer.test_harness` > (which previously got a `self.shouldFail()`) should be slightly > enhanced to check the return value of `check_and_repair()`. We > should verify that it has `crr.repair_attempted=True`, > `crr.repair_successful=False`, and > `crr.post_repair_results.recoverable=False` Good point, will be done in the next patch. > * we should add a similar test for mutable files that have had 8 shares > deleted. There's something awfully close in > `test_mutable.Repair.test_unrepairable_1share` .. it should be > changed to use `self._fn.check_and_repair()` instead of > `self._fn.repair()`. Will be done in the next patch. > To be honest, I'm not sure why that test was passing before, because > from what I can tell it should have been behaving the same way as > immutable repair on an unrecoverable file. I don't know either, will try to look in details into this. > * it's probably worth checking the code coverage when we exercise > `test_mutable` and make sure the new code is getting run I don't remember how the code coverage infrastructure in the build system actually works. It would be very kind of you if you tell me which command I should run? > * do we have any tests that confirm deep-repair on a tree with an > unrecoverable file (or directory) makes it through to the end without > an errback? We probably do but I'd like to be sure.. probably > something in `test_deepcheck` exercises this. This is what I think calling `do_web_stream_check()` inside `DeepCheckWebBad.test_bad()` should be doing, isn't it? > * I see `test_deepcheck.py:DeepCheckWebBad.do_deepcheck_broken()` > asserts that an unrecoverable dirnode causes the traversal to halt. Is > this what we want? Is this ticket about making sure an unrecoverable > *file* doesn't halt a deep-repair, or about an unrecoverable > *dirnode*? (broken dirnodes are more significant than files, because > it means you've probably lost access to even more data). We certainly > want the deep-traversal to keep going and repair more things, but we > also need to make sure the user learns about the dead dirnode. Yes, the traversal must continue in both cases. I was under the impression that unrecoverable immutable files were already supported and I understand this issue as being about unrecoverable direnodes.

warner commented

2011-01-17 09:21:03 +00:00

Replying to [francois]comment:24:

Thanks for the review! My comments are inline.

Replying to warner:

we usually repair files that are unhealthy but recoverable. If
repair fails, the file should still be recoverable. The
post-repair-results are pessimistically being set to healthy=False
recoverable=False needs_rebalancing=False, when it's probably (and
sometimes certainly) more accurate to copy these values from the
pre-repair-results.

I agree with what davidsarah said in comment:23, it is difficult to
know the actual status when an exception was raised during the check
operation. However, it seems that simply removing the fields from the
results would necessitate other changes because I guess that many
parts of the code except them to be present.

What would you think about setting healthy to its value before the
repair (most likely False) and other fields to None?
Something along those lines?
  def _repair_error(f):
    prr = [CheckResults](wiki/CheckResults)(cr.uri, cr.storage_index)
    prr.data = copy.deepcopy(cr.data)
    prr.set_healthy(crr.pre_repair_results.is_healthy())
    prr.set_recoverable(None)
    prr.set_needs_rebalancing(None)
    crr.post_repair_results = prr
    crr.repair_successful = False
    crr.repair_failure = f
    return crr

Ok, but set_recoverable() and set_needs_rebalancing() should
be copied from the pre-repair values too. For immutable files it's
certainly the case that repair cannot make things any worse, so if the
file was recoverable before repair, it will be recoverable afterwards
too. For mutable files, it's fuzzier, but once we get #1209 fixed, then
repair that doesn't involve UCWE collisions or multiple versions should
be strictly an improvement too. I think set_needs_rebalancing() is
roughly the same.

My big concern is doing a deep-repair while you're missing a few
servers: all files are missing a few shares, so they aren't healthy and
we try to repair them, but you're missing too many servers to
successfully meet the servers-of-happiness threshold, so repair fails.
On every single file. All the files are actually recoverable, but the
post-repair results suggest that they are not. What I want to avoid is
the deep-repair summary message telling users that 4000 out of 4000
files are now unrecoverable and scaring the socks off them.

it's probably worth checking the code coverage when we exercise
test_mutable and make sure the new code is getting run

I don't remember how the code coverage infrastructure in the build
system actually works. It would be very kind of you if you tell me
which command I should run?

I usually do 'make quicktest-coverage', but I think "python setup.py trial --coverage" (or perhaps "python setup.py trial --coverage --test-suite test_mutable" to be a bit more selective) should do the
same. That will create a .coverage file with the raw data. "make coverage-output", or following the commands listed in that section of
the Makefile, will give you an HTML summary with color-coded source
lines.

do we have any tests that confirm deep-repair on a tree with an
unrecoverable file (or directory) makes it through to the end
without an errback? We probably do but I'd like to be sure..
probably something in test_deepcheck exercises this.

This is what I think calling do_web_stream_check() inside
DeepCheckWebBad.test_bad() should be doing, isn't it?

I think that's mostly correct: it looks like set_up_damaged_tree()
creates a root directory with 8 files (half mutable, half immutable),
some of which are unrecoverable. But 1: do_web_stream_check()
doesn't attempt repair, merely deep-check, and 2: there are no
directories in that root, only files. Adding an unrecoverable directory
is the important bit, since I think deep-repair and deep-check have
enough common code paths that exercising deep-check is sufficient. (note
that I think the 'broken' directory set up there is not used by
do_web_stream_check()).

I see
test_deepcheck.py:DeepCheckWebBad.do_deepcheck_broken()
asserts that an unrecoverable dirnode causes the traversal to
halt. Is this what we want? Is this ticket about making sure an
unrecoverable file doesn't halt a deep-repair, or about an
unrecoverable dirnode? (broken dirnodes are more significant
than files, because it means you've probably lost access to even
more data). We certainly want the deep-traversal to keep going and
repair more things, but we also need to make sure the user learns
about the dead dirnode.

Yes, the traversal must continue in both cases. I was under the
impression that unrecoverable immutable files were already supported
and I understand this issue as being about unrecoverable direnodes.

Yeah, do_web_stream_check() should cover the
unrecoverable-immutable-file case (well, unless there's a difference in
behavior between a web-based t=stream-deep-check and an internal
dirnode-based dirnode.start_deep_check(), which is worth testing).
So I agree, unrecoverable dirnodes is the important thing to check.

So my hunch here is that we should add an unrecoverable directory to the
'root' tree created in set_up_damaged_tree(), and adjust the
counters to match, and then maybe we should get rid of the 'broken' tree
and do_deepcheck_broken().

Replying to [francois]comment:24: > Thanks for the review! My comments are inline. > > Replying to [warner](/tahoe-lafs/trac/issues/755#issuecomment-372830): > > > * we usually repair files that are unhealthy but recoverable. If > > repair fails, the file should still be recoverable. The > > post-repair-results are pessimistically being set to healthy=False > > recoverable=False needs_rebalancing=False, when it's probably (and > > sometimes certainly) more accurate to copy these values from the > > pre-repair-results. > > I agree with what davidsarah said in comment:23, it is difficult to > know the actual status when an exception was raised during the check > operation. However, it seems that simply removing the fields from the > results would necessitate other changes because I guess that many > parts of the code except them to be present. > > What would you think about setting healthy to its value before the > repair (most likely `False`) and other fields to `None`? > Something along those lines? > > ``` > def _repair_error(f): > prr = [CheckResults](wiki/CheckResults)(cr.uri, cr.storage_index) > prr.data = copy.deepcopy(cr.data) > prr.set_healthy(crr.pre_repair_results.is_healthy()) > prr.set_recoverable(None) > prr.set_needs_rebalancing(None) > crr.post_repair_results = prr > crr.repair_successful = False > crr.repair_failure = f > return crr > ``` Ok, but `set_recoverable()` and `set_needs_rebalancing()` should be copied from the pre-repair values too. For immutable files it's certainly the case that repair cannot make things any worse, so if the file was recoverable before repair, it will be recoverable afterwards too. For mutable files, it's fuzzier, but once we get #1209 fixed, then repair that doesn't involve UCWE collisions or multiple versions should be strictly an improvement too. I think `set_needs_rebalancing()` is roughly the same. My big concern is doing a deep-repair while you're missing a few servers: all files are missing a few shares, so they aren't healthy and we try to repair them, but you're missing too many servers to successfully meet the servers-of-happiness threshold, so repair fails. On every single file. All the files are actually recoverable, but the post-repair results suggest that they are not. What I want to avoid is the deep-repair summary message telling users that 4000 out of 4000 files are now unrecoverable and scaring the socks off them. > > * it's probably worth checking the code coverage when we exercise > > `test_mutable` and make sure the new code is getting run > > I don't remember how the code coverage infrastructure in the build > system actually works. It would be very kind of you if you tell me > which command I should run? I usually do '`make quicktest-coverage`', but I think "`python setup.py trial --coverage`" (or perhaps "`python setup.py trial --coverage --test-suite test_mutable`" to be a bit more selective) should do the same. That will create a .coverage file with the raw data. "`make coverage-output`", or following the commands listed in that section of the Makefile, will give you an HTML summary with color-coded source lines. > > * do we have any tests that confirm deep-repair on a tree with an > > unrecoverable file (or directory) makes it through to the end > > without an errback? We probably do but I'd like to be sure.. > > probably something in `test_deepcheck` exercises this. > > This is what I think calling `do_web_stream_check()` inside > `DeepCheckWebBad.test_bad()` should be doing, isn't it? I think that's mostly correct: it looks like `set_up_damaged_tree()` creates a root directory with 8 files (half mutable, half immutable), some of which are unrecoverable. But 1: `do_web_stream_check()` doesn't attempt repair, merely deep-check, and 2: there are no directories in that root, only files. Adding an unrecoverable directory is the important bit, since I think deep-repair and deep-check have enough common code paths that exercising deep-check is sufficient. (note that I think the 'broken' directory set up there is not used by `do_web_stream_check()`). > > * I see > > `test_deepcheck.py:DeepCheckWebBad.do_deepcheck_broken()` > > asserts that an unrecoverable dirnode causes the traversal to > > halt. Is this what we want? Is this ticket about making sure an > > unrecoverable *file* doesn't halt a deep-repair, or about an > > unrecoverable *dirnode*? (broken dirnodes are more significant > > than files, because it means you've probably lost access to even > > more data). We certainly want the deep-traversal to keep going and > > repair more things, but we also need to make sure the user learns > > about the dead dirnode. > > Yes, the traversal must continue in both cases. I was under the > impression that unrecoverable immutable files were already supported > and I understand this issue as being about unrecoverable direnodes. Yeah, `do_web_stream_check()` should cover the unrecoverable-immutable-file case (well, unless there's a difference in behavior between a web-based `t=stream-deep-check` and an internal dirnode-based `dirnode.start_deep_check()`, which is worth testing). So I agree, unrecoverable dirnodes is the important thing to check. So my hunch here is that we should add an unrecoverable directory to the 'root' tree created in `set_up_damaged_tree()`, and adjust the counters to match, and then maybe we should get rid of the 'broken' tree and `do_deepcheck_broken()`.

warner commented

2011-01-17 09:22:59 +00:00

BTW, if we get a patch for this on monday, I'll review and land it, and it'll be in 1.8.2. If it's not ready by monday or tuesday, then we may need to push it out until after 1.8.2. I want to make sure we get at least a few days of testing on this, since it's kind of invasive.

francois commented

2011-01-17 20:47:28 +00:00

Owner

Replying to warner:

BTW, if we get a patch for this on monday, I'll review and land it, and it'll be in 1.8.2. If it's not ready by monday or tuesday, then we may need to push it out until after 1.8.2. I want to make sure we get at least a few days of testing on this, since it's kind of invasive.

I guess that it's going to have to wait until after 1.8.2 because spare time in the coming week looks pretty scarce.

Replying to [warner](/tahoe-lafs/trac/issues/755#issuecomment-372834): > BTW, if we get a patch for this on monday, I'll review and land it, and it'll be in 1.8.2. If it's not ready by monday or tuesday, then we may need to push it out until after 1.8.2. I want to make sure we get at least a few days of testing on this, since it's kind of invasive. I guess that it's going to have to wait until after 1.8.2 because spare time in the coming week looks pretty scarce.

tahoe-lafs modified the milestone from 1.8.2 to 1.9.0

2011-01-17 20:47:41 +00:00

daira commented

2011-07-16 20:49:20 +00:00

This needs some work to address the comments and to be rebased to trunk, but has a good chance of getting into 1.9.

daira commented

2011-08-02 15:43:50 +00:00

I have a patch in progress that builds on attachment:patch-755.darcs.diff and fixes the review comments, including skipping unrecoverable directories and including information that they've been skipped in the output. It's not ready for 1.9 though.

I have a patch in progress that builds on [attachment:patch-755.darcs.diff](/tahoe-lafs/trac/attachments/000078ac-5234-2fa4-40fb-43c067161683) and fixes the review comments, including skipping unrecoverable directories and including information that they've been skipped in the output. It's not ready for 1.9 though.

daira modified the milestone from 1.9.0 to 1.10.0

2011-08-02 15:44:19 +00:00

daira commented

2012-08-22 02:12:52 +00:00

I'll try to find the patch mentioned in comment:372838, but if I haven't done so in two weeks, it can be assumed that I've lost it.

I'll try to find the patch mentioned in [comment:372838](/tahoe-lafs/trac/issues/755#issuecomment-372838), but if I haven't done so in two weeks, it can be assumed that I've lost it.

daira commented

2013-04-26 02:00:35 +00:00

#1955 was a duplicate.

daira commented

2014-11-19 07:26:39 +00:00

#2337 was a duplicate.

zooko changed title from ~~if there is an unrecoverable subdirectory, the deep-check report (both WUI and CLI) loses other information~~ to Allow deep-check to continue after error, and: if there is an unrecoverable subdirectory, the deep-check report (both WUI and CLI) loses other information

2015-02-03 17:43:33 +00:00

daira commented

2016-01-14 17:45:29 +00:00

Kyle Markley wrote on tahoe-dev:

When tahoe deep-check --repair encounters a file it can't repair, it stops without reporting anything about what file gave it trouble.
What do I do about this? I rerun, this time with -v, so I get a listing of what files it is working on. From that list I can often infer which file had the error. Assuming I still have the original file, the corrective action is to tahoe put the file. Then I can restart the deep-check.
But in a directory tree with thousands of files, that takes forever! Instead, I can restart the deep-check in a subdirectory closer to the previous failure. But this is a lot of tedious work.

I wish that tahoe deep-check would:

Report which file is unrepairable.

Not stop at the first error, but continue and report all errors upon completion.

When an unrepairable file is an immutable directory, what corrective action should be taken? I have resorted to modifying the directory by creating an empty file, performing a tahoe backup, and then continuing the deep-check --repair. But I cannot then remove the empty file, because that would cause the next backup to point to the original (unrepaired) directory. Can this be improved?

I wish that tahoe backup could be combined with tahoe deep-check --repair. The behavior would be like deep-check, but if any file is unrepairable yet exists in in the local filesystem at the corresponding path, upload it. And for bonus points this should guarantee happiness, not just healthiness. Or, it would be almost as good if deep-check would update the backup database so the next invocation of tahoe backup would re-upload the appropriate files and directories.

Essentially, I struggle with the fact that "tahoe backup" completes successfully without guaranteeing the recoverability of files it claims to have backed up. The backup database is out-of-sync with the healthiness of files on the grid, and there is no way to bring them in-sync. Sure, I can delete the backup database, but I don't want to pointlessly re-upload all the healthy files.

Kyle Markley wrote on tahoe-dev: > When `tahoe deep-check --repair` encounters a file it can't repair, it stops without reporting anything about what file gave it trouble. > What do I do about this? I rerun, this time with `-v`, so I get a listing of what files it is working on. From that list I can often infer which file had the error. Assuming I still have the original file, the corrective action is to tahoe put the file. Then I can restart the deep-check. > But in a directory tree with thousands of files, that takes forever! Instead, I can restart the deep-check in a subdirectory closer to the previous failure. But this is a lot of tedious work. > > I wish that `tahoe deep-check` would: > > 1. Report which file is unrepairable. > 2. Not stop at the first error, but continue and report all errors upon completion. > > When an unrepairable file is an immutable directory, what corrective action should be taken? I have resorted to modifying the directory by creating an empty file, performing a `tahoe backup`, and then continuing the `deep-check --repair`. But I cannot then remove the empty file, because that would cause the next backup to point to the original (unrepaired) directory. Can this be improved? > > I wish that `tahoe backup` could be combined with `tahoe deep-check --repair`. The behavior would be like deep-check, but if any file is unrepairable yet exists in in the local filesystem at the corresponding path, upload it. And for bonus points this should guarantee happiness, not just healthiness. Or, it would be almost as good if deep-check would update the backup database so the next invocation of tahoe backup would re-upload the appropriate files and directories. > > Essentially, I struggle with the fact that "`tahoe backup`" completes successfully without guaranteeing the recoverability of files it claims to have backed up. The backup database is out-of-sync with the healthiness of files on the grid, and there is no way to bring them in-sync. Sure, I can delete the backup database, but I don't want to pointlessly re-upload all the healthy files.

daira removed their assignment 2016-01-14 17:45:29 +00:00

daira self-assigned this 2016-01-14 17:45:29 +00:00

tlhonmey commented

2017-03-09 18:55:14 +00:00

Owner

Meanwhile: I just lost a bunch of stuff because I didn't know about this issue and assumed a deep-check --repair --add-lease cronjob would take care of things. One file near the beginning of the directory structure got damaged somehow, so neither repair nor leasing was done on the rest, and by the time I came back to check on it, chunks had expired and been deleted and I have to re-upload everything, which will take about a month.

This bug has been open for almost 8 years, and I see a patch for it in the discussion thread... If it's not going to be fixed in the next release, I recommend adding a warning about it to the documentation so new users don't do something stupid like expect the repair operation to behave in a sane manner.

As a work-around, I use:

tahoe manifest alias: | cut -d" " -f 1 | xargs -L1 -P5 tahoe check --add-lease --repair

This, of course, requires time and CPU to start a separate instance of the tahoe program for every data object being checked, so going over the entire directory takes days instead of hours, but at least it actually works.

Kyle: It won't have to re-upload all the healthy files. The deduplication algorithm will find that the data for any unchanged files is already available and will re-use whatever shares it can. It'll just take a bit longer to run because it'll have to scan and encode every file. Meanwhile: I just lost a bunch of stuff because I didn't know about this issue and assumed a deep-check --repair --add-lease cronjob would take care of things. One file near the beginning of the directory structure got damaged somehow, so neither repair nor leasing was done on the rest, and by the time I came back to check on it, chunks had expired and been deleted and I have to re-upload everything, which will take about a month. This bug has been open for almost 8 years, and I see a patch for it in the discussion thread... If it's not going to be fixed in the next release, I recommend adding a warning about it to the documentation so new users don't do something stupid like expect the repair operation to behave in a sane manner. As a work-around, I use: ``` tahoe manifest alias: | cut -d" " -f 1 | xargs -L1 -P5 tahoe check --add-lease --repair ``` This, of course, requires time and CPU to start a separate instance of the tahoe program for every data object being checked, so going over the entire directory takes days instead of hours, but at least it actually works.

tlhonmey commented

2018-08-21 21:48:05 +00:00

Owner

So I've resorted to the following bash script:

tahoe="/home/tahoe/tahoe/bin/tahoe"
THREADS=5
FAILEDLOG="/tmp/failed.txt"


recurser() {
  CHILDREN=""
  echo "checking directory: $1"
  $tahoe check --add-lease "$1" || $tahoe check --add-lease --repair "$1" || sleep 5m #give it 5 minutes before continuing to let the grid come back up if this is a connection failure.  This prevents the entire script from finishing as failures if the network connection goes down.
  local ITEM
  for ITEM in $($tahoe ls -F "$1"); do
    echo "checking: ${1}${ITEM}"
    echo "$ITEM" | grep "/" >> /dev/null && echo "  Is a directory..." && recurser "${1}${ITEM}"
    ( $tahoe check --add-lease "${1}${ITEM}" | grep -n10 healthy || $tahoe check --repair --add-lease "${1}${ITEM}" || echo "${1}${ITEM}" >> $FAILEDLOG ) &
    CHILDREN="$? $CHILDREN"
    if [[ $(echo "$CHILDREN" | wc -w) == "$THREADS" ]]; then
      wait 
      CHILDREN=""
    fi
  done
}


echo "If it blows up immediately when passed a URI make sure you end it with a /"
recurser "$1"

The careful observer will notice that this script calls "check --add-lease" first and then only calls --repair if that returns an error. This is due to another bug in the --repair functionality which I will be filing shortly.

Is making deep-check note the unrepairable nodes, but then continue to check the rest of the tree really that difficult? I wouldn't think the average user should have to resort to writing their own tools to avoid cascade failures of the storage system...

If you guys want to bundle this tool or some clone or variant thereof into your packages you are more than welcome to do so. We need something to actually keep people's data safe until this bug is fixed.

Edit: Oh for Pete's Sake! tahoe check exits with a 0 even when the checked objects are unhealthy, so I have to scan the output myself to assess it. I sense that at some point I'm going to need to rewrite this in Python or something and use the REST API. Hopefully that's at least somewhat sane...

Ok, so tahoe manifest also gives up on the first error it encounters, it just only encounters errors on damaged directories. But it will still bite you hard if you are actually stupid enough to rely on it. So I've resorted to the following bash script: ``` /bin/bash tahoe="/home/tahoe/tahoe/bin/tahoe" THREADS=5 FAILEDLOG="/tmp/failed.txt" recurser() { CHILDREN="" echo "checking directory: $1" $tahoe check --add-lease "$1" || $tahoe check --add-lease --repair "$1" || sleep 5m #give it 5 minutes before continuing to let the grid come back up if this is a connection failure. This prevents the entire script from finishing as failures if the network connection goes down. local ITEM for ITEM in $($tahoe ls -F "$1"); do echo "checking: ${1}${ITEM}" echo "$ITEM" | grep "/" >> /dev/null && echo " Is a directory..." && recurser "${1}${ITEM}" ( $tahoe check --add-lease "${1}${ITEM}" | grep -n10 healthy || $tahoe check --repair --add-lease "${1}${ITEM}" || echo "${1}${ITEM}" >> $FAILEDLOG ) & CHILDREN="$? $CHILDREN" if [[ $(echo "$CHILDREN" | wc -w) == "$THREADS" ]]; then wait CHILDREN="" fi done } echo "If it blows up immediately when passed a URI make sure you end it with a /" recurser "$1" ``` The careful observer will notice that this script calls "check --add-lease" first and then only calls --repair if that returns an error. This is due to another bug in the --repair functionality which I will be filing shortly. Is making deep-check note the unrepairable nodes, but then continue to check the rest of the tree really that difficult? I wouldn't think the average user should have to resort to writing their own tools to avoid cascade failures of the storage system... If you guys want to bundle this tool or some clone or variant thereof into your packages you are more than welcome to do so. We need something to actually keep people's data safe until this bug is fixed. Edit: Oh for Pete's Sake! tahoe check exits with a 0 even when the checked objects are unhealthy, so I have to scan the output myself to assess it. I sense that at some point I'm going to need to rewrite this in Python or something and use the REST API. Hopefully that's at least somewhat sane...