tahoe-lafs/trac

assertion failure #1689

New issue

Closed

opened 2012-03-21 13:47:30 +00:00 by jg71 · 14 comments

jg71 commented

2012-03-21 13:47:30 +00:00

Owner

tahoe deep-check -v --repair --add-lease tahoe:

'<root>': healthy
done: 1 objects checked
 pre-repair: 1 healthy, 0 unhealthy
 0 repairs attempted, 0 successful, 0 failed
 post-repair: 1 healthy, 0 unhealthy

when 9 out of 9 storage servers are available.

The error I have found is reproducable by just stopping one storage
server and run once more:
tahoe deep-check -v --repair --add-lease tahoe:

ERROR: AssertionError()
"[Failure instance: Traceback: <type 'exceptions.AssertionError'>: "
/usr/lib64/python2.6/site-packages/allmydata/mutable/filenode.py:563:upload
/usr/lib64/python2.6/site-packages/allmydata/mutable/filenode.py:661:_do_serialized
/usr/lib64/python2.6/site-packages/twisted/internet/defer.py:298:addCallback
/usr/lib64/python2.6/site-packages/twisted/internet/defer.py:287:addCallbacks
--- <exception caught here> ---
/usr/lib64/python2.6/site-packages/twisted/internet/defer.py:545:_runCallbacks
/usr/lib64/python2.6/site-packages/allmydata/mutable/filenode.py:661:<lambda>
/usr/lib64/python2.6/site-packages/allmydata/mutable/filenode.py:689:_upload
/usr/lib64/python2.6/site-packages/allmydata/mutable/publish.py:404:publish

I can see from the second capture (attached flogtool.tail-2.txt) that tahoe connects exactly to that storage
server that has been stopped: connectTCP to ('256.256.256.256', 66666)

There were no incident report files.

Versions used locally:

Nevow: 0.10.0, foolscap: 0.6.3, setuptools: 0.6c11, Twisted: 11.1.0, zfec: 1.4.22, zbase32: 1.1.3, pyOpenSSL: 0.13, simplejson: 2.3.2, mock: 0.7.2, argparse: 1.2.1, pycryptopp: 0.6.0.1206569328141510525648634803928199668821045408958, pyutil: 1.8.4, zope.interface: 3.8.0, allmydata-tahoe: allmydata-tahoe-1.9.0-94-gcef646c, pycrypto: 2.5, pyasn1: 0.0.13

all 9 storage servers use these:
Nevow: 0.10.0, foolscap: 0.6.3, setuptools: 0.6c11, Twisted: 11.1.0, zfec: 1.4.22, pycrypto: 2.4.1, zbase32: 1.1.3, pyOpenSSL: 0.13, simplejson: 2.3.2, mock: 0.7.2, argparse: 1.2.1, pyutil: 1.8.4, zope.interface: 3.8.0, allmydata-tahoe: 1.9.1, pyasn1: 0.0.13, pycryptopp: 0.5.29

Might this be related to https://tahoe-lafs.org/trac/tahoe-lafs/ticket/1648 ?

tahoe deep-check -v --repair --add-lease tahoe: ``` '<root>': healthy done: 1 objects checked pre-repair: 1 healthy, 0 unhealthy 0 repairs attempted, 0 successful, 0 failed post-repair: 1 healthy, 0 unhealthy ``` when 9 out of 9 storage servers are available. The error I have found is reproducable by just stopping one storage server and run once more: tahoe deep-check -v --repair --add-lease tahoe: ``` ERROR: AssertionError() "[Failure instance: Traceback: <type 'exceptions.AssertionError'>: " /usr/lib64/python2.6/site-packages/allmydata/mutable/filenode.py:563:upload /usr/lib64/python2.6/site-packages/allmydata/mutable/filenode.py:661:_do_serialized /usr/lib64/python2.6/site-packages/twisted/internet/defer.py:298:addCallback /usr/lib64/python2.6/site-packages/twisted/internet/defer.py:287:addCallbacks --- <exception caught here> --- /usr/lib64/python2.6/site-packages/twisted/internet/defer.py:545:_runCallbacks /usr/lib64/python2.6/site-packages/allmydata/mutable/filenode.py:661:<lambda> /usr/lib64/python2.6/site-packages/allmydata/mutable/filenode.py:689:_upload /usr/lib64/python2.6/site-packages/allmydata/mutable/publish.py:404:publish ``` I can see from the second capture (attached flogtool.tail-2.txt) that tahoe connects exactly to that storage server that has been stopped: connectTCP to ('256.256.256.256', 66666) There were no incident report files. Versions used locally: Nevow: 0.10.0, foolscap: 0.6.3, setuptools: 0.6c11, Twisted: 11.1.0, zfec: 1.4.22, zbase32: 1.1.3, pyOpenSSL: 0.13, simplejson: 2.3.2, mock: 0.7.2, argparse: 1.2.1, pycryptopp: 0.6.0.1206569328141510525648634803928199668821045408958, pyutil: 1.8.4, zope.interface: 3.8.0, allmydata-tahoe: allmydata-tahoe-1.9.0-94-gcef646c, pycrypto: 2.5, pyasn1: 0.0.13 all 9 storage servers use these: Nevow: 0.10.0, foolscap: 0.6.3, setuptools: 0.6c11, Twisted: 11.1.0, zfec: 1.4.22, pycrypto: 2.4.1, zbase32: 1.1.3, pyOpenSSL: 0.13, simplejson: 2.3.2, mock: 0.7.2, argparse: 1.2.1, pyutil: 1.8.4, zope.interface: 3.8.0, allmydata-tahoe: 1.9.1, pyasn1: 0.0.13, pycryptopp: 0.5.29 Might this be related to <https://tahoe-lafs.org/trac/tahoe-lafs/ticket/1648> ?

tahoe-lafs added the

labels 2012-03-21 13:47:30 +00:00

tahoe-lafs added this to the undecided milestone 2012-03-21 13:47:30 +00:00

jg71 commented

2012-03-21 13:49:03 +00:00

Author

Owner

Attachment flogtool.tail-2.txt.gz (5967 bytes) added

**Attachment** flogtool.tail-2.txt.gz (5967 bytes) added

flogtool.tail-2.txt.gz

5.8 KiB

zooko added

p/critical

and removed

p/major

labels 2012-03-21 14:39:52 +00:00

jg71 commented

2012-03-21 15:31:23 +00:00

Author

Owner

placing a single file in that directory is sufficient to hide the error. even a LIT cap works.

warner commented

2012-03-29 23:45:19 +00:00

The assertion error is on the assert self._privkey line, which means that the repair's Publish operation is trying to use a filenode that never got a privkey. This can happen when the filenode was created from a readcap, or when have a writecap but were unable (or maybe just forgot) to fetch the privkey earlier.

If it depends upon the contents of the directory, then maybe it's really depending upon the length of the share.. if something were fetching a portion of the share that usually contains the privkey, but didn't for really short shares (empty directories), that might explain it. I'll look earlier in the repair process to see where the privkey is usually fetched, and if there's any code to complain if that fails.

I've been able to reproduce the error locally. It's weird, the CLI command returns the AssertionError, but nothing is logged by the node. Usually it's the other way around: the node notices an error, logs it, recovers by using some alternate share, and the CLI command doesn't see anything wrong. The assertion error is on the `assert self._privkey` line, which means that the repair's Publish operation is trying to use a filenode that never got a privkey. This can happen when the filenode was created from a readcap, or when have a writecap but were unable (or maybe just forgot) to fetch the privkey earlier. If it depends upon the contents of the directory, then maybe it's really depending upon the length of the share.. if something were fetching a portion of the share that usually contains the privkey, but didn't for really short shares (empty directories), that might explain it. I'll look earlier in the repair process to see where the privkey is usually fetched, and if there's any code to complain if that fails.

warner commented

2012-03-30 00:32:14 +00:00

More findings:

doing a non-deep tahoe check --repair gets the same failure, but gets a webapi 500 Internal Server Error (but successfully repairs the file, I think).
the Retrieve half of the repair is doing a mapupdate with MODE_READ (which doesn't try to fetch the privkey), but it needs to be using MODE_WRITE (to force a privkey fetch)
the mapupdate does a few queries that do, in fact, get the privkey, but the responses arrive too late for the Publish to use them.

Maybe there's a race condition, and larger files allow mapupdate to win the race and get the privkey in time for Publish to use it, but small (empty) files allow the Publish to start too early. The MODE_READ is definitely wrong, though: not only does it not bother to get the privkey in that case, it also doesn't search as hard to find all shares (MODE_WRITE tries harder, to avoid leaving any share behind at an old version).

More findings: * doing a non-deep `tahoe check --repair` gets the same failure, but gets a webapi 500 Internal Server Error (but successfully repairs the file, I think). * the Retrieve half of the repair is doing a mapupdate with MODE_READ (which doesn't try to fetch the privkey), but it needs to be using MODE_WRITE (to force a privkey fetch) * the mapupdate does a few queries that do, in fact, get the privkey, but the responses arrive too late for the Publish to use them. Maybe there's a race condition, and larger files allow mapupdate to win the race and get the privkey in time for Publish to use it, but small (empty) files allow the Publish to start too early. The MODE_READ is definitely wrong, though: not only does it not bother to get the privkey in that case, it also doesn't search as hard to find all shares (MODE_WRITE tries harder, to avoid leaving any share behind at an old version).

daira added

c/code-mutable

and removed

c/unknown

labels 2012-03-30 01:03:15 +00:00

daira modified the milestone from undecided to 1.9.2

2012-03-30 01:03:15 +00:00

warner commented

2012-03-31 07:00:15 +00:00

Attachment fix-1689.diff (13002 bytes) added

fix and test for the bug

**Attachment** fix-1689.diff (13002 bytes) added fix and test for the bug

fix-1689.diff

13 KiB

warner commented

2012-03-31 20:12:18 +00:00

Ok changeset:2b8a312c and changeset:5bae4a1b should fix that (and includes a test that fails without the fix). To make the test fail, I had to introduce some extra delays into the fake storage-server response (so the privkey would arrive to late to help the Publish succeed). I'm still trying to understand why the filesize matters, but this basic issue should be solved now.

daira commented

2012-04-01 01:22:46 +00:00

On line 1066 of servermap.py as changed by changeset:5bae4a1b:

if self.mode == (MODE_CHECK, MODE_REPAIR):

See the bug? :-)

On line 1066 of servermap.py as changed by changeset:5bae4a1b: ``` if self.mode == (MODE_CHECK, MODE_REPAIR): ``` See the bug? :-)

jg71 commented

2012-04-01 01:32:03 +00:00

Author

Owner

I applied fix-1689.diff and the error is gone. But I noticed the following while running "tahoe deep-check -v --repair --add-lease alias:"

if one storage server is shut down, every dir cap (incl. root) is marked unhealthy and repaired, as are all files, except lit caps which are shown healthy
if all servers are online, by default all dir caps are marked unhealthy and repaired but not quite:

done: 4 objects checked
 pre-repair: 3 healthy, 1 unhealthy
 1 repairs attempted, 1 successful, 0 failed
 post-repair: 3 healthy, 1 unhealthy

The unhealthy one is the dir root cap. Another alias which includes a directory shows the same behaviour:

done: 69 objects checked
 pre-repair: 0 healthy, 69 unhealthy
 69 repairs attempted, 69 successful, 0 failed
 post-repair: 67 healthy, 2 unhealthy

Noteworthy is the fact that lit caps always show up as healthy, if all servers are there or one missing.

I applied fix-1689.diff and the error is gone. But I noticed the following while running "tahoe deep-check -v --repair --add-lease alias:" - if one storage server is shut down, every dir cap (incl. root) is marked unhealthy and repaired, as are all files, except lit caps which are shown healthy - if all servers are online, by default all dir caps are marked unhealthy and repaired but not quite: ``` done: 4 objects checked pre-repair: 3 healthy, 1 unhealthy 1 repairs attempted, 1 successful, 0 failed post-repair: 3 healthy, 1 unhealthy ``` The unhealthy one is the dir root cap. Another alias which includes a directory shows the same behaviour: ``` done: 69 objects checked pre-repair: 0 healthy, 69 unhealthy 69 repairs attempted, 69 successful, 0 failed post-repair: 67 healthy, 2 unhealthy ``` Noteworthy is the fact that lit caps always show up as healthy, if all servers are there or one missing.

daira commented

2012-04-01 01:41:54 +00:00

Replying to jg71:

Noteworthy is the fact that lit caps always show up as healthy, if all servers are there or one missing.

That's as expected; a LIT cap can't be unhealthy, since the cap itself has the data.

Replying to [jg71](/tahoe-lafs/trac/issues/1689#issuecomment-389077): > Noteworthy is the fact that lit caps always show up as healthy, if all servers are there or one missing. That's as expected; a LIT cap can't be unhealthy, since the cap itself has the data.

warner commented

2012-04-01 22:43:36 +00:00

Replying to davidsarah:

On line 1066 of servermap.py as changed by changeset:5bae4a1b:
if self.mode == (MODE_CHECK, MODE_REPAIR):
See the bug? :-)

Oops.. good catch! I just pushed the fix, in changeset:470acbf1

Replying to [davidsarah](/tahoe-lafs/trac/issues/1689#issuecomment-389076): > On line 1066 of servermap.py as changed by changeset:5bae4a1b: > ``` > if self.mode == (MODE_CHECK, MODE_REPAIR): > ``` > > See the bug? :-) Oops.. good catch! I just pushed the fix, in changeset:470acbf1

zooko commented

2012-04-02 00:03:01 +00:00

changeset:470acbf1e1d0a525 apparently didn't cause any test to go red→green. Maybe that code is redundant?

warner was unassigned by zooko

2012-04-05 16:22:37 +00:00

zooko self-assigned this 2012-04-05 16:22:37 +00:00

jg71 commented

2012-04-05 17:07:00 +00:00

Author

Owner

I tested allmydata-tahoe-1.9.0-127-g4e93f77 (commit 4e93f77289) and ran the same tests again. I cannot reproduce the error anymore and no weirdness occurs. Looks like it is fixed.

tahoe-lafs added the

r/fixed

label 2012-04-05 18:35:02 +00:00

jg71 closed this issue

2012-04-05 18:35:02 +00:00

daira commented

2012-04-05 19:46:16 +00:00

Before it was edited, comment:389083 said:

I tested latest git and ran the same tests again. At first it looked ok, the weirdness was gone (that suddenly all dir caps show up as unhealthy). But after having re-run the repair job a few times with one storage node down, some aliases always show up completely healthy, and others show the weird behaviour again; all files are shown healthy but for those few aliases all dir caps are shown unhealthy, get repaired, and still show unhealthy while their repair was claimed to be successful:

*** test3
        '<root>': not healthy
         repair successful
        'test.txt': healthy
        done: 2 objects checked
         pre-repair: 1 healthy, 1 unhealthy
         1 repairs attempted, 1 successful, 0 failed
         post-repair: 1 healthy, 1 unhealthy

*** test4
        '<root>': not healthy
         repair successful
        done: 1 objects checked
         pre-repair: 0 healthy, 1 unhealthy
         1 repairs attempted, 1 successful, 0 failed
         post-repair: 0 healthy, 1 unhealthy

The fact that caps were shown to be unhealthy even though they had been successfully repaired, sounds like ticket #766.

Before it was edited, [comment:389083](/tahoe-lafs/trac/issues/1689#issuecomment-389083) said: > I tested latest git and ran the same tests again. At first it looked ok, the weirdness was gone (that suddenly all dir caps show up as unhealthy). But after having re-run the repair job a few times with one storage node down, some aliases always show up completely healthy, and others show the weird behaviour again; all files are shown healthy but for those few aliases all dir caps are shown unhealthy, get repaired, and still show unhealthy while their repair was claimed to be successful: ``` *** test3 '<root>': not healthy repair successful 'test.txt': healthy done: 2 objects checked pre-repair: 1 healthy, 1 unhealthy 1 repairs attempted, 1 successful, 0 failed post-repair: 1 healthy, 1 unhealthy *** test4 '<root>': not healthy repair successful done: 1 objects checked pre-repair: 0 healthy, 1 unhealthy 1 repairs attempted, 1 successful, 0 failed post-repair: 0 healthy, 1 unhealthy ``` The fact that caps were shown to be unhealthy even though they had been successfully repaired, sounds like ticket #766.

jg71 commented

2012-04-05 20:18:13 +00:00

Author

Owner

That was due to a PEBKAC on my part earlier today which I fixed. Afterwards I ran the tests twice to make sure. Therefore I think that the https://tahoe-lafs.org/trac/tahoe-lafs/ticket/766 chase is a false alarm caused by my original comment #16.

Sorry for the noise.

That was due to a PEBKAC on my part earlier today which I fixed. Afterwards I ran the tests twice to make sure. Therefore I think that the <https://tahoe-lafs.org/trac/tahoe-lafs/ticket/766> chase is a false alarm caused by my original comment #16. Sorry for the noise.

Rows
Columns