servers-responding list is wrong #1739

Closed
opened 2012-05-16 22:40:01 +00:00 by warner · 7 comments
warner commented 2012-05-16 22:40:01 +00:00
Owner

I ran into a bug, hunted it down, and squashed it. Here's the
description and the patch:

If a server did not respond to the pre-repair filecheck, but did
respond to the repair, that server was not correctly added to the
RepairResults.data["servers-responding"] list. (This resulted from
a buggy usage of DictOfSets.union() in filenode.py).

In addition, servers to which filecheck queries were sent, but did
not respond, were incorrectly added to the servers-responding list
anyawys. (This resulted from code in the checker.py not paying
attention to the 'responded' flag).

The first bug was neatly masked by the second: it's pretty rare to
have a server suddenly start responding in the one-second window
between a filecheck and a subsequent repair, and if the server was
around for the filecheck, you'd never notice the problem. I only
spotted the smelly code while I was changing it for IServer cleanup
purposes.

I added coverage to test_repairer.py for this. Trying to get that
test to fail before fixing the first bug is what led me to discover
the second bug. I also had to update test_corrupt_file_verno, since
it was incorrectly asserting that 10 servers responded, when in
fact one of them throws an error (but the second bug was causing it
to be reported anyways).

I ran into a bug, hunted it down, and squashed it. Here's the description and the patch: If a server did not respond to the pre-repair filecheck, but did respond to the repair, that server was not correctly added to the `RepairResults.data["servers-responding"]` list. (This resulted from a buggy usage of `DictOfSets.union()` in filenode.py). In addition, servers to which filecheck queries were sent, but did not respond, were incorrectly added to the servers-responding list anyawys. (This resulted from code in the checker.py not paying attention to the 'responded' flag). The first bug was neatly masked by the second: it's pretty rare to have a server suddenly start responding in the one-second window between a filecheck and a subsequent repair, and if the server was around for the filecheck, you'd never notice the problem. I only spotted the smelly code while I was changing it for IServer cleanup purposes. I added coverage to test_repairer.py for this. Trying to get that test to fail before fixing the first bug is what led me to discover the second bug. I also had to update test_corrupt_file_verno, since it was incorrectly asserting that 10 servers responded, when in fact one of them throws an error (but the second bug was causing it to be reported anyways).
tahoe-lafs added the
code
minor
defect
1.9.1
labels 2012-05-16 22:40:01 +00:00
tahoe-lafs added this to the soon milestone 2012-05-16 22:40:01 +00:00
warner commented 2012-05-16 22:46:01 +00:00
Author
Owner

ugh, the nginx bug (#1581) prevented me from uploading a diff. Look at the github pull-request in https://github.com/tahoe-lafs/tahoe-lafs/pull/9 instead.

ugh, the nginx bug (#1581) prevented me from uploading a diff. Look at the github pull-request in <https://github.com/tahoe-lafs/tahoe-lafs/pull/9> instead.
zooko commented 2012-05-16 23:06:15 +00:00
Author
Owner

will review

will review
tahoe-lafs modified the milestone from soon to 1.9.2 2012-05-16 23:06:15 +00:00
davidsarah commented 2012-05-16 23:36:01 +00:00
Author
Owner

+1 on removing DictOfSets.union; I was looking at using DictOfSets for something and also found that method confusing. I confirmed that the patches remove all uses.

Would d['servers-responding'] = sorted(list(servers_responding)) be better, for determinism? (I know the previous code also wasn't deterministic here.)

I think the overloading of .broken as an int or boolean will not work as intended for the boolean case, because:

$ python
Python 2.6.6 (r266:84292, Sep 15 2010, 16:22:56) 
[GCC 4.4.5] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> broken = True
>>> isinstance(broken, int)
True
>>> broken -= 1
>>> broken
0

Otherwise looks good.

+1 on removing DictOfSets.union; I was looking at using DictOfSets for something and also found that method confusing. I confirmed that the patches remove all uses. Would `d['servers-responding'] = sorted(list(servers_responding))` be better, for determinism? (I know the previous code also wasn't deterministic here.) I think the overloading of `.broken` as an int or boolean will not work as intended for the boolean case, because: ``` $ python Python 2.6.6 (r266:84292, Sep 15 2010, 16:22:56) [GCC 4.4.5] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> broken = True >>> isinstance(broken, int) True >>> broken -= 1 >>> broken 0 ``` Otherwise looks good.
tahoe-lafs added
normal
and removed
minor
labels 2012-05-16 23:36:01 +00:00
warner commented 2012-05-16 23:41:43 +00:00
Author
Owner

wow, good catch, I thought booleans were distinct from ints for sure. Thanks!

Yeah, determinism is helpful. I had to add similar sorted() calls in the webapi code (in particular in places where we switch from a set, to a JSON list) so the tests can match against something stable. I'll move that up to the code that creates servers-responding.

wow, good catch, I thought booleans were distinct from ints for sure. Thanks! Yeah, determinism is helpful. I had to add similar `sorted()` calls in the webapi code (in particular in places where we switch from a set, to a JSON list) so the tests can match against something stable. I'll move that up to the code that creates servers-responding.
davidsarah commented 2012-05-16 23:48:43 +00:00
Author
Owner

Would d['servers-responding'] = sorted(list(servers_responding)) be better, for determinism?

Actually the list() isn't necessary; sorted(servers_responding) does the same thing.

> Would `d['servers-responding'] = sorted(list(servers_responding))` be better, for determinism? Actually the `list()` isn't necessary; `sorted(servers_responding)` does the same thing.
warner commented 2012-05-17 00:02:43 +00:00
Author
Owner

Landed in changeset:cc366903 and changeset:9acf5bee.

Landed in changeset:cc366903 and changeset:9acf5bee.
tahoe-lafs added the
fixed
label 2012-05-17 00:02:43 +00:00
warner closed this issue 2012-05-17 00:02:43 +00:00
davidsarah commented 2012-05-17 00:19:31 +00:00
Author
Owner

Also applied to 1.9.2: [5500/1.9.2], [5499/1.9.2]

Also applied to 1.9.2: [5500/1.9.2], [5499/1.9.2]
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: tahoe-lafs/trac-2024-07-25#1739
No description provided.