introducer client: connection count is wrong, !VersionedRemoteReference needs EQ #653

New issue

Closed

opened 2009-03-04 22:53:00 +00:00 by warner · 23 comments

warner commented

2009-03-04 22:53:00 +00:00

Unfortunately, the recent introduction of VersionedRemoteReference wrappers breaks this. The 135-111 extra ghost servers are actually two different VersionedRemoteReference objects which wrap the same original RemoteReference instance.

I'm not quite sure how to fix this. My first inclination is to stop using VersionedRemoteReference wrappers, and instead assign .versions attributes to the regular RemoteReference instances to achieve the versioning functionality. However, I believe that VersionedRemoteReference also serves another purpose, which is to transform all remote exceptions (everything returned by callRemote, incidentally including local schema Violations) into a special form which is easily recognizable. This is not as simple to achieve without a wrapper.

Another approach would be to modify *eq* and *hash* (and also *ne*? I'm never sure..) on VersionedRemoteReference to pass through to the original. A third approach would be to maintain a singleton weak dict of some sort, which would help us make sure that we have at most one VersionedRemoteReference per RemoteReference.

I prefer the first (no-wrapper) approach, but it may take some doing.

Zandr noticed that the prodnet webapi nodes are reporting 111 known servers (which is correct) and 135 connected servers (which is not). The root problem is that the IntroducerClient uses object-identity (specifically RemoteReference identity) to manage the set of current connections. Specifically `IntroducerClient._connections` is a set of (peerid, service_name, rref) tuples, and the insertion/removal code assumes that 'rref' values are comparable (it uses _connections.add() and assumes that duplicates will be removed by the set class). Unfortunately, the recent introduction of VersionedRemoteReference wrappers breaks this. The 135-111 extra ghost servers are actually two different VersionedRemoteReference objects which wrap the same original RemoteReference instance. I'm not quite sure how to fix this. My first inclination is to stop using VersionedRemoteReference wrappers, and instead assign `.versions` attributes to the regular RemoteReference instances to achieve the versioning functionality. However, I believe that VersionedRemoteReference also serves another purpose, which is to transform all remote exceptions (everything returned by callRemote, incidentally including local schema Violations) into a special form which is easily recognizable. This is not as simple to achieve without a wrapper. Another approach would be to modify `*eq*` and `*hash*` (and also `*ne*`? I'm never sure..) on VersionedRemoteReference to pass through to the original. A third approach would be to maintain a singleton weak dict of some sort, which would help us make sure that we have at most one VersionedRemoteReference per RemoteReference. I prefer the first (no-wrapper) approach, but it may take some doing.

warner added the

labels 2009-03-04 22:53:00 +00:00

warner added this to the undecided milestone 2009-03-04 22:53:00 +00:00

zooko commented

2009-03-04 23:11:58 +00:00

Which object is creating a new instance of VersionedRemoteReference when there already exists one for that RemoteReference? Maybe we should consider this a bug in that object. This is very close to your "singletone, weak dict" approach, but this has a narrow scope of who is supposed to be weakly tracking which remote refs.

zooko commented

2009-03-04 23:13:59 +00:00

P.S. Yes I know it is very likely a class that I wrote which is responsible...

warner commented

2009-03-09 00:18:07 +00:00

Author

it's probably the introducer client, specifically RemoteServiceConnector._got_service, which reacts to the reconnector-driven connection event by calling get_versioned_remote_reference. The problem is the separate manager object (IntroducerClient.add_connection which doesn't ever see the real RemoteReference, just the wrapper.. maybe we could pass back both, and have it use the real RemoteReference as a dict key, and put the wrapper in a value. It used to use a dictionary, but I replaced it with a set because RemoteReferences compare cleanly and because that made the code easier to read.

I suspect that I rewrote all of this code in my signed-announcements branch (#466), so maybe what I ought to do is examine that branch and copy code from it.

it's probably the introducer client, specifically `RemoteServiceConnector._got_service`, which reacts to the reconnector-driven connection event by calling `get_versioned_remote_reference`. The problem is the separate manager object (`IntroducerClient.add_connection` which doesn't ever see the real RemoteReference, just the wrapper.. maybe we could pass back both, and have it use the real RemoteReference as a dict key, and put the wrapper in a value. It used to use a dictionary, but I replaced it with a set because RemoteReferences compare cleanly and because that made the code easier to read. I suspect that I rewrote all of this code in my signed-announcements branch (#466), so maybe what I ought to do is examine that branch and copy code from it.

zooko commented

2009-06-08 16:29:23 +00:00

I think this is fixed by foolscap >= 0.4.0.

I think this is fixed by `foolscap >= 0.4.0`.

warner commented

2009-06-08 18:29:13 +00:00

Author

probably yes, but I haven't written a unit test to confirm it yet.. I'll close this ticket when I've done that.

warner self-assigned this 2009-06-08 18:29:13 +00:00

zooko commented

2009-07-16 03:36:46 +00:00

Hm. http://testgrid.allmydata.org:3567 currently says:

Connected to 350  of 61 known storage servers

And its version numbers are:

allmydata-tahoe: 1.4.1-r3897, foolscap: 0.4.1, pycryptopp: 0.5.2-1, zfec: 1.4.0-4, Twisted: 2.5.0, Nevow: 0.9.18, zope.interface: 3.3.1, python: 2.5.2, platform: Linux-Ubuntu_8.04-i686-32bit, simplejson: 1.7.3, argparse: 0.8.0, pyOpenSSL: 0.6, pyutil: 1.3.16-12, setuptools: 0.6c12dev

The source:NEWS file says that this ticket -- #653 -- has been fixed. But perhaps it hasn't?

Hm. <http://testgrid.allmydata.org:3567> currently says: ``` Connected to 350 of 61 known storage servers ``` And its version numbers are: ``` allmydata-tahoe: 1.4.1-r3897, foolscap: 0.4.1, pycryptopp: 0.5.2-1, zfec: 1.4.0-4, Twisted: 2.5.0, Nevow: 0.9.18, zope.interface: 3.3.1, python: 2.5.2, platform: Linux-Ubuntu_8.04-i686-32bit, simplejson: 1.7.3, argparse: 0.8.0, pyOpenSSL: 0.6, pyutil: 1.3.16-12, setuptools: 0.6c12dev ``` The source:NEWS file says that this ticket -- #653 -- has been fixed. But perhaps it hasn't?

zooko modified the milestone from undecided to 1.5.0

2009-07-16 03:36:46 +00:00

terrell commented

2009-07-16 17:44:21 +00:00

Owner

the testgrid is now behaving properly - but there is still no test case.

zooko commented

2009-07-16 18:38:38 +00:00

Per #748#comment:-1 -- I upgraded and rebooted the testgrid web gateway server. However, the previous version that was running already had foolscap v0.4.1, so there is apparently some issue which results in more connections than servers which is still present in that version of foolscap.

Per #748#[comment:-1](/tahoe-lafs/trac/issues/653#issuecomment--1) -- I upgraded and rebooted the testgrid web gateway server. However, the previous version that was running already had foolscap v0.4.1, so there is apparently *some* issue which results in more connections than servers which is still present in that version of foolscap.

warner commented

2009-07-17 04:51:56 +00:00

Author

FYI, this should have been fixed by changeset:1863aee0aae1cc02, which depended upon a feature added to foolscap-0.4.0 (i.e. foolscap itself does not fix the issue.. it required a tahoe-side change).

I guess something may still be broken. We'll probably have to wait for it to become weird again and then look at the introducerclient's internal state.

FYI, this should have been fixed by changeset:1863aee0aae1cc02, which depended upon a feature added to foolscap-0.4.0 (i.e. foolscap itself does not fix the issue.. it required a tahoe-side change). I guess something may still be broken. We'll probably have to wait for it to become weird again and then look at the introducerclient's internal state.

zooko commented

2009-07-17 12:55:03 +00:00

Attachment incident-2009-07-16-002846-pypgfay.flog.bz2 (78470 bytes) added

**Attachment** incident-2009-07-16-002846-pypgfay.flog.bz2 (78470 bytes) added

incident-2009-07-16-002846-pypgfay.flog.bz2

77 KiB

zooko commented

2009-07-17 12:55:10 +00:00

Attachment incident-2009-07-16-002829-gc4xv5y.flog.bz2 (79877 bytes) added

**Attachment** incident-2009-07-16-002829-gc4xv5y.flog.bz2 (79877 bytes) added

incident-2009-07-16-002829-gc4xv5y.flog.bz2

78 KiB

zooko commented

2009-07-17 13:07:29 +00:00

I guess something may still be broken. We'll probably have to wait for it to become weird again and then look at the introducerclient's internal state.

:-( I guess I shouldn't have rebooted it then. Does that mean we should go ahead and boot this issue out of v1.5 Milestone? I have an idea -- let's add an assertion in the code that the number of connected storage servers <= number of known storage servers. Perhaps even a stronger, more detailed assertion if that's feasible.

Here's what evidence I can gather about the problem that was exhibited.

I upgraded and restarted testgrid.allmydata.org:3567 at 2009-07-16 16:17:18.835Z (on testgrid.allmydata.org's clock). There was nothing that looked too unusual in the twistd.log that day. There are two incidents reported, attached, from that day: incident-2009-07-16-002829-gc4xv5y.flog.bz2 and incident-2009-07-16-002846-pypgfay.flog.bz2.

Here is a foolscap log-viewer web service showing each of those logfiles: http://testgrid.allmydata.org:10000/ http://testgrid.allmydata.org:10001 . I have a hard time learning what I want to know from these logfiles. What I want to know (at least at first) is mostly about temporal coincidence. For starters, I'd like to be sure that these incidents occurred before I rebooted and upgraded the server, not after. However, the timestamps, such as "# 19:19:47.489 [23537270]: UNUSUAL excessive reactor delay ({'args': (25.734021902084351,), 'format': 'excessive reactor delay (%ss)', 'incarnation': ('\xf5\x16\x1dl\xb2\xf5\x85\xf9', None), 'num': 23537270, 'time': 1247710787.4896951, 'level': 23}s)" don't tell me what day it was nor what timezone the timestamp is in. Checking the status of http://foolscap.lothar.com/trac/ticket/90 suggests that the timestamps might be in PST, which is UTC-7. If that's the case then ... No, some of the (causally) earliest events in the log are from 18:01. Perhaps they were from 2009-07-15? Argh, I give up. Please tell me how to understand the timing of events in foolscap incident report files. I updated http://foolscap.lothar.com/trac/ticket/90 to plead for fully-qualified, unambiguous timestamps.

The triggering incident is "download failure", but scarier to me is that there was a 26 second reactor delay.

> I guess something may still be broken. We'll probably have to wait for it to become weird again and then look at the introducerclient's internal state. :-( I guess I shouldn't have rebooted it then. Does that mean we should go ahead and boot this issue out of v1.5 Milestone? I have an idea -- let's add an assertion in the code that the number of connected storage servers <= number of known storage servers. Perhaps even a stronger, more detailed assertion if that's feasible. Here's what evidence I can gather about the problem that was exhibited. I upgraded and restarted testgrid.allmydata.org:3567 at 2009-07-16 16:17:18.835Z (on testgrid.allmydata.org's clock). There was nothing that looked too unusual in the `twistd.log` that day. There are two incidents reported, attached, from that day: `incident-2009-07-16-002829-gc4xv5y.flog.bz2` and `incident-2009-07-16-002846-pypgfay.flog.bz2`. Here is a foolscap log-viewer web service showing each of those logfiles: <http://testgrid.allmydata.org:10000/> <http://testgrid.allmydata.org:10001> . I have a hard time learning what I want to know from these logfiles. What I want to know (at least at first) is mostly about temporal coincidence. For starters, I'd like to be sure that these incidents occurred before I rebooted and upgraded the server, not after. However, the timestamps, such as "# 19:19:47.489 [23537270]: UNUSUAL excessive reactor delay ({'args': (25.734021902084351,), 'format': 'excessive reactor delay (%ss)', 'incarnation': ('\xf5\x16\x1dl\xb2\xf5\x85\xf9', None), 'num': 23537270, 'time': 1247710787.4896951, 'level': 23}s)" don't tell me what day it was nor what timezone the timestamp is in. Checking the status of <http://foolscap.lothar.com/trac/ticket/90> suggests that the timestamps might be in PST, which is UTC-7. If that's the case then ... No, some of the (causally) earliest events in the log are from 18:01. Perhaps they were from 2009-07-15? Argh, I give up. Please tell me how to understand the timing of events in foolscap incident report files. I updated <http://foolscap.lothar.com/trac/ticket/90> to plead for fully-qualified, unambiguous timestamps. The triggering incident is "download failure", but scarier to me is that there was a 26 second reactor delay.

zooko commented

2009-07-17 14:11:49 +00:00

source:src/allmydata/web/welcome.xhtml@3897#L55

source:src/allmydata/web/root.py@3897#L240

source:src/allmydata/introducer/client.py@3897#L277

Here is how it is produced today:

source:src/allmydata/web/welcome.xhtml@3997#L55

source:src/allmydata/web/root.py@3997#L240

source:src/allmydata/storage_client.py@3997#L41

The old way could potentially have lots of tuples of (serverid, servicename, rref) for the same serverid and servicename, if new connections were established to the same serverid but the old connection was not lost, i.e. notifyOnDisconnect() wasn't getting called. The new way cannot possibly have an inconsistency between the number of known storage servers and the number of connected storage servers, since both are computed from the same dict -- the known servers are all the items of the dict and the connected storage servers are the ones that have an rref.

Brian: what do you think about notifyOnDisconnect() not getting called even while new connections to the same foolscap peer are being established? That's the only explanation I can see for the observed miscounting on the web gateway that was running allmydata-tahoe changeset:35b3f7f426c193cf and foolscap 0.4.1.

Oookay, I read the relevant source code and the miscounting of number of connected storage servers was fixed between changeset:35b3f7f426c193cf (the version that exhibited the problem) and current HEAD. However, I'm not sure if that also means that whatever caused the failures on [TestGrid](wiki/TestGrid) was also fixed. Read on. Here is the path of the code that shows how "Connected to %s of %s known storage servers" was produced at version changeset:35b3f7f426c193cf: source:src/allmydata/web/welcome.xhtml@3897#L55 source:src/allmydata/web/root.py@3897#L240 source:src/allmydata/introducer/client.py@3897#L277 Here is how it is produced today: source:src/allmydata/web/welcome.xhtml@3997#L55 source:src/allmydata/web/root.py@3997#L240 source:src/allmydata/storage_client.py@3997#L41 The old way could potentially have lots of tuples of `(serverid, servicename, rref)` for the same serverid and servicename, if new connections were established to the same serverid but the old connection was not lost, i.e. `notifyOnDisconnect()` wasn't getting called. The new way cannot possibly have an inconsistency between the number of known storage servers and the number of connected storage servers, since both are computed from the same dict -- the known servers are all the items of the dict and the connected storage servers are the ones that have an rref. Brian: what do you think about `notifyOnDisconnect()` not getting called even while new connections to the same foolscap peer are being established? That's the only explanation I can see for the observed miscounting on the web gateway that was running allmydata-tahoe changeset:35b3f7f426c193cf and foolscap 0.4.1.

zooko commented

2009-07-17 14:36:13 +00:00

# 13:56:23.900 [867]: WEIRD Tub.connectorFinished: WEIRD, <foolscap.negotiate.TubConnector object at 0x902776c from wfninubkrvhlyscum7rlschbhx5iarg3 to qvqv7jmm76yfhdjuuzeqyfuwlqinpofd> is not in [<foolscap.negotiate.TubConnector object at 0x88da7ec from wfninubkrvhlyscum7rlschbhx5iarg3 to wyye67q7p53gtcsy2mmylsbpzw4gaaz4>, 
<foolscap.negotiate.TubConnector object at 0x88e050c from wfninubkrvhlyscum7rlschbhx5iarg3 to q5l37rle6pojjnllrwjyryulavpqdlq5>, 
<foolscap.negotiate.TubConnector object at 0x88e0a2c from wfninubkrvhlyscum7rlschbhx5iarg3 to gdfa2zqvj7l2ng26bwalyubitdxtmywf>, 
<foolscap.negotiate.TubConnector object at 0x88e0f4c from wfninubkrvhlyscum7rlschbhx5iarg3 to alt6cjddwfnwrnct4lx2ypwricrgtoam>, 
<foolscap.negotiate.TubConnector object at 0x8faf48c from wfninubkrvhlyscum7rlschbhx5iarg3 to u5vgfpug7qhkxdtj76tcfh6bmzyo6w5s>, 
<foolscap.negotiate.TubConnector object at 0x8fafb4c from wfninubkrvhlyscum7rlschbhx5iarg3 to x4pds7hcgaq2exnumxbsjkv2jrkzyyxf>, 
<foolscap.negotiate.TubConnector object at 0x8faffcc from wfninubkrvhlyscum7rlschbhx5iarg3 to k6h4otc2f2wmmysam5ofciaenhngurwm>, 
<foolscap.negotiate.TubConnector object at 0x8fb558c from wfninubkrvhlyscum7rlschbhx5iarg3 to gkc4vgeeyktdzwklygxnummy3cin2fri>, 
<foolscap.negotiate.TubConnector object at 0x8fb5b8c from wfninubkrvhlyscum7rlschbhx5iarg3 to srb3paczogfsh6blsmopelry2ogc4uii>, 
<foolscap.negotiate.TubConnector object at 0x8fba1ac from wfninubkrvhlyscum7rlschbhx5iarg3 to 4z76kljyfuogmoygw7dfedtegjz6h6ou>, 
<foolscap.negotiate.TubConnector object at 0x8fbaa0c from wfninubkrvhlyscum7rlschbhx5iarg3 to jg6qoqzfunqhsvjsrquize3tpcsyglvo>, 
<foolscap.negotiate.TubConnector object at 0x8fc0e2c from wfninubkrvhlyscum7rlschbhx5iarg3 to voa6qf4473phtacw7t3ht47wust2lr2z>, 
<foolscap.negotiate.TubConnector object at 0x8fc524c from wfninubkrvhlyscum7rlschbhx5iarg3 to pstqfjqxsfpu7ac7crfzgm26jervfz37>, 
<foolscap.negotiate.TubConnector object at 0x8fc584c from wfninubkrvhlyscum7rlschbhx5iarg3 to mtuj2lnh5sarkpummx77w7jjhynz2jea>, 
<foolscap.negotiate.TubConnector object at 0x8fc5c4c from wfninubkrvhlyscum7rlschbhx5iarg3 to okjeolejndkuenpuuaiduvk2plebgjao>, 
<foolscap.negotiate.TubConnector object at 0x8fcc46c from wfninubkrvhlyscum7rlschbhx5iarg3 to jfccf453ml75gscmpugihlg2bcjzgswr>, 
<foolscap.negotiate.TubConnector object at 0x8fd1b2c from wfninubkrvhlyscum7rlschbhx5iarg3 to yzdb43kciqbv3oech66pq54lrtddbbvf>, 
<foolscap.negotiate.TubConnector object at 0x8fd802c from wfninubkrvhlyscum7rlschbhx5iarg3 to i4kd2dpoaj7wnd6kicsklcswayop24wy>, 
<foolscap.negotiate.TubConnector object at 0x8fd88cc from wfninubkrvhlyscum7rlschbhx5iarg3 to ckpjhpffmbmpv5rxc7uzrcdlu2ad6slj>, 
<foolscap.negotiate.TubConnector object at 0x8fd8eec from wfninubkrvhlyscum7rlschbhx5iarg3 to i3u2uz4ewpz3n36ckaqmtmfrubpifapd>, 
<foolscap.negotiate.TubConnector object at 0x8fddb0c from wfninubkrvhlyscum7rlschbhx5iarg3 to jobn3i5i7dlu6vt2y6duc3yhvif3d4vr>, 
<foolscap.negotiate.TubConnector object at 0x8fe352c from wfninubkrvhlyscum7rlschbhx5iarg3 to zzkq2nfnpukh5uajj7aedsl6sc3m5rup>, 
<foolscap.negotiate.TubConnector object at 0x8fe9b4c from wfninubkrvhlyscum7rlschbhx5iarg3 to yg6uehymklr2jjzcewok3l2mlixawyrp>, 
<foolscap.negotiate.TubConnector object at 0x901e16c from wfninubkrvhlyscum7rlschbhx5iarg3 to e66df2pmr2kzk2rhgjtfp2uagl6jkatj>, 
<foolscap.negotiate.TubConnector object at 0x901e7cc from wfninubkrvhlyscum7rlschbhx5iarg3 to sttpea3ze6sqi5qyqmomeiwfsayytkar>, 
<foolscap.negotiate.TubConnector object at 0x9022e0c from wfninubkrvhlyscum7rlschbhx5iarg3 to n5hgpiok6ka3x6emgfgkydzgj7hwwle2>, 
<foolscap.negotiate.TubConnector object at 0x902716c from wfninubkrvhlyscum7rlschbhx5iarg3 to cignbrutktmd7iudwnwhjzmcmsvfu6ex>, 
<foolscap.negotiate.TubConnector object at 0x9027f6c from wfninubkrvhlyscum7rlschbhx5iarg3 to 3wrurfqxq4xqzlqatks53b2ekwaw5jt2>, 
<foolscap.negotiate.TubConnector object at 0x902a58c from wfninubkrvhlyscum7rlschbhx5iarg3 to 2y7ldksggg447xnf4zwsjccx7ihs6wfm>, 
<foolscap.negotiate.TubConnector object at 0x902ab8c from wfninubkrvhlyscum7rlschbhx5iarg3 to g5662mtbvxlfcipvcpvicellm2ljdqow>, 
<foolscap.negotiate.TubConnector object at 0x9030bcc from wfninubkrvhlyscum7rlschbhx5iarg3 to t5zibyamutxhjtazj7fdc5q4nqmifqnj>, 
<foolscap.negotiate.TubConnector object at 0x90341ec from wfninubkrvhlyscum7rlschbhx5iarg3 to uonu5jw32ncc523picty2dh7um2kwmqv>, 
<foolscap.negotiate.TubConnector object at 0x90347ec from wfninubkrvhlyscum7rlschbhx5iarg3 to q7tphao4eyo2xfd32stspexiitydewwf>, 
<foolscap.negotiate.TubConnector object at 0x9034e0c from wfninubkrvhlyscum7rlschbhx5iarg3 to xpyajfs4rvm3yps5szojjrxjsgkmshfp>, 
<foolscap.negotiate.TubConnector object at 0x9038a4c from wfninubkrvhlyscum7rlschbhx5iarg3 to hsxpnvvqxgyrvgwbmsiwdom2hqf7q4i2>, 
<foolscap.negotiate.TubConnector object at 0x903d5ac from wfninubkrvhlyscum7rlschbhx5iarg3 to t5g7egomnnktbpydbuijt6zgtmw4oqi5>, 
<foolscap.negotiate.TubConnector object at 0x903ddcc from wfninubkrvhlyscum7rlschbhx5iarg3 to odcd5enlzmv7iobwad63tchotty3lotx>]

Hm, note that at least one of the storage servers was in a messed-up state in which it had incurred MemoryErrors, so it might have been failing to participate in some sort of foolscap "good-bye" negotiation, thus preventing the older connections to it from being cleaned up. This reminds of these mysterious warnings in the incident reports (I've seen a lot of this kind of message in foolscap logs): ``` # 13:56:23.900 [867]: WEIRD Tub.connectorFinished: WEIRD, <foolscap.negotiate.TubConnector object at 0x902776c from wfninubkrvhlyscum7rlschbhx5iarg3 to qvqv7jmm76yfhdjuuzeqyfuwlqinpofd> is not in [<foolscap.negotiate.TubConnector object at 0x88da7ec from wfninubkrvhlyscum7rlschbhx5iarg3 to wyye67q7p53gtcsy2mmylsbpzw4gaaz4>, <foolscap.negotiate.TubConnector object at 0x88e050c from wfninubkrvhlyscum7rlschbhx5iarg3 to q5l37rle6pojjnllrwjyryulavpqdlq5>, <foolscap.negotiate.TubConnector object at 0x88e0a2c from wfninubkrvhlyscum7rlschbhx5iarg3 to gdfa2zqvj7l2ng26bwalyubitdxtmywf>, <foolscap.negotiate.TubConnector object at 0x88e0f4c from wfninubkrvhlyscum7rlschbhx5iarg3 to alt6cjddwfnwrnct4lx2ypwricrgtoam>, <foolscap.negotiate.TubConnector object at 0x8faf48c from wfninubkrvhlyscum7rlschbhx5iarg3 to u5vgfpug7qhkxdtj76tcfh6bmzyo6w5s>, <foolscap.negotiate.TubConnector object at 0x8fafb4c from wfninubkrvhlyscum7rlschbhx5iarg3 to x4pds7hcgaq2exnumxbsjkv2jrkzyyxf>, <foolscap.negotiate.TubConnector object at 0x8faffcc from wfninubkrvhlyscum7rlschbhx5iarg3 to k6h4otc2f2wmmysam5ofciaenhngurwm>, <foolscap.negotiate.TubConnector object at 0x8fb558c from wfninubkrvhlyscum7rlschbhx5iarg3 to gkc4vgeeyktdzwklygxnummy3cin2fri>, <foolscap.negotiate.TubConnector object at 0x8fb5b8c from wfninubkrvhlyscum7rlschbhx5iarg3 to srb3paczogfsh6blsmopelry2ogc4uii>, <foolscap.negotiate.TubConnector object at 0x8fba1ac from wfninubkrvhlyscum7rlschbhx5iarg3 to 4z76kljyfuogmoygw7dfedtegjz6h6ou>, <foolscap.negotiate.TubConnector object at 0x8fbaa0c from wfninubkrvhlyscum7rlschbhx5iarg3 to jg6qoqzfunqhsvjsrquize3tpcsyglvo>, <foolscap.negotiate.TubConnector object at 0x8fc0e2c from wfninubkrvhlyscum7rlschbhx5iarg3 to voa6qf4473phtacw7t3ht47wust2lr2z>, <foolscap.negotiate.TubConnector object at 0x8fc524c from wfninubkrvhlyscum7rlschbhx5iarg3 to pstqfjqxsfpu7ac7crfzgm26jervfz37>, <foolscap.negotiate.TubConnector object at 0x8fc584c from wfninubkrvhlyscum7rlschbhx5iarg3 to mtuj2lnh5sarkpummx77w7jjhynz2jea>, <foolscap.negotiate.TubConnector object at 0x8fc5c4c from wfninubkrvhlyscum7rlschbhx5iarg3 to okjeolejndkuenpuuaiduvk2plebgjao>, <foolscap.negotiate.TubConnector object at 0x8fcc46c from wfninubkrvhlyscum7rlschbhx5iarg3 to jfccf453ml75gscmpugihlg2bcjzgswr>, <foolscap.negotiate.TubConnector object at 0x8fd1b2c from wfninubkrvhlyscum7rlschbhx5iarg3 to yzdb43kciqbv3oech66pq54lrtddbbvf>, <foolscap.negotiate.TubConnector object at 0x8fd802c from wfninubkrvhlyscum7rlschbhx5iarg3 to i4kd2dpoaj7wnd6kicsklcswayop24wy>, <foolscap.negotiate.TubConnector object at 0x8fd88cc from wfninubkrvhlyscum7rlschbhx5iarg3 to ckpjhpffmbmpv5rxc7uzrcdlu2ad6slj>, <foolscap.negotiate.TubConnector object at 0x8fd8eec from wfninubkrvhlyscum7rlschbhx5iarg3 to i3u2uz4ewpz3n36ckaqmtmfrubpifapd>, <foolscap.negotiate.TubConnector object at 0x8fddb0c from wfninubkrvhlyscum7rlschbhx5iarg3 to jobn3i5i7dlu6vt2y6duc3yhvif3d4vr>, <foolscap.negotiate.TubConnector object at 0x8fe352c from wfninubkrvhlyscum7rlschbhx5iarg3 to zzkq2nfnpukh5uajj7aedsl6sc3m5rup>, <foolscap.negotiate.TubConnector object at 0x8fe9b4c from wfninubkrvhlyscum7rlschbhx5iarg3 to yg6uehymklr2jjzcewok3l2mlixawyrp>, <foolscap.negotiate.TubConnector object at 0x901e16c from wfninubkrvhlyscum7rlschbhx5iarg3 to e66df2pmr2kzk2rhgjtfp2uagl6jkatj>, <foolscap.negotiate.TubConnector object at 0x901e7cc from wfninubkrvhlyscum7rlschbhx5iarg3 to sttpea3ze6sqi5qyqmomeiwfsayytkar>, <foolscap.negotiate.TubConnector object at 0x9022e0c from wfninubkrvhlyscum7rlschbhx5iarg3 to n5hgpiok6ka3x6emgfgkydzgj7hwwle2>, <foolscap.negotiate.TubConnector object at 0x902716c from wfninubkrvhlyscum7rlschbhx5iarg3 to cignbrutktmd7iudwnwhjzmcmsvfu6ex>, <foolscap.negotiate.TubConnector object at 0x9027f6c from wfninubkrvhlyscum7rlschbhx5iarg3 to 3wrurfqxq4xqzlqatks53b2ekwaw5jt2>, <foolscap.negotiate.TubConnector object at 0x902a58c from wfninubkrvhlyscum7rlschbhx5iarg3 to 2y7ldksggg447xnf4zwsjccx7ihs6wfm>, <foolscap.negotiate.TubConnector object at 0x902ab8c from wfninubkrvhlyscum7rlschbhx5iarg3 to g5662mtbvxlfcipvcpvicellm2ljdqow>, <foolscap.negotiate.TubConnector object at 0x9030bcc from wfninubkrvhlyscum7rlschbhx5iarg3 to t5zibyamutxhjtazj7fdc5q4nqmifqnj>, <foolscap.negotiate.TubConnector object at 0x90341ec from wfninubkrvhlyscum7rlschbhx5iarg3 to uonu5jw32ncc523picty2dh7um2kwmqv>, <foolscap.negotiate.TubConnector object at 0x90347ec from wfninubkrvhlyscum7rlschbhx5iarg3 to q7tphao4eyo2xfd32stspexiitydewwf>, <foolscap.negotiate.TubConnector object at 0x9034e0c from wfninubkrvhlyscum7rlschbhx5iarg3 to xpyajfs4rvm3yps5szojjrxjsgkmshfp>, <foolscap.negotiate.TubConnector object at 0x9038a4c from wfninubkrvhlyscum7rlschbhx5iarg3 to hsxpnvvqxgyrvgwbmsiwdom2hqf7q4i2>, <foolscap.negotiate.TubConnector object at 0x903d5ac from wfninubkrvhlyscum7rlschbhx5iarg3 to t5g7egomnnktbpydbuijt6zgtmw4oqi5>, <foolscap.negotiate.TubConnector object at 0x903ddcc from wfninubkrvhlyscum7rlschbhx5iarg3 to odcd5enlzmv7iobwad63tchotty3lotx>] ```

warner commented

2009-07-17 17:27:02 +00:00

Author

(flogtool note: until we get foolscap#90 resolved, you can use "flogtool dump
--verbose" to get the raw seconds-since-epoch timestamps out of the event
logs, which may give you more confidence in comparing them across servers.
There is not yet something similar for "flogtool web-viewer", unfortunately.
Also note that incident files always store seconds-since-epoch, so any
conversion to a string time format is being done by the "flogtool" program
that is processing the incident file)

Yeah, one of the improvements in changeset:8df15e9f30a3bda7 was to use a single data structure to
keep track of storage servers, instead of two separate ones. Seemed more
robust.

I'd have to look more closely at the old code to figure out how
notifyOnDisconnect could be confused. It gets fired when the TCP
connection is lost, so there's no requirement that the remote side
participate in any shutdown-time protocol, but it does get fired through an
eventual-send, so there could conceivably have been a race condition or two.

(incidentally, I removed those "Tub.connectorFinished" WEIRD messages in
foolscap-0.4.2, the current release.. they were causing more problems than
they were worth)

(flogtool note: until we get [foolscap#90](http://foolscap.lothar.com/trac/ticket/90) resolved, you can use "flogtool dump --verbose" to get the raw seconds-since-epoch timestamps out of the event logs, which may give you more confidence in comparing them across servers. There is not yet something similar for "flogtool web-viewer", unfortunately. Also note that incident files **always** store seconds-since-epoch, so any conversion to a string time format is being done by the "flogtool" program that is processing the incident file) Yeah, one of the improvements in changeset:8df15e9f30a3bda7 was to use a single data structure to keep track of storage servers, instead of two separate ones. Seemed more robust. I'd have to look more closely at the old code to figure out how `notifyOnDisconnect` could be confused. It gets fired when the TCP connection is lost, so there's no requirement that the remote side participate in any shutdown-time protocol, but it does get fired through an eventual-send, so there could conceivably have been a race condition or two. (incidentally, I removed those "Tub.connectorFinished" WEIRD messages in foolscap-0.4.2, the current release.. they were causing more problems than they were worth)

zooko commented

2009-07-18 20:16:46 +00:00

Note that we can't actually close this ticket yet because there isn't a test for the original issue: duplicate entries in IntroducerClient._connections because RemoteReferences don't compare equal when they denote the same underlying TCP connection to the same peer, much less the newly discussed issue: duplicate RemoteReferences because (somehow) the peer is setting up new RemoteReferences while notifyOnDisconnect() is not getting called for the old ones.

However, I'm fairly confident that this is (a) not a regression versus TahoeLAFS v1.4.1, and (b) not a critical bug. (In fact, it might actually be already fixed in current trunk.)

I spoke with Brian on IRC this morning before his gallivanting schedule, and he said that it was unlikely that this problem will recur with the current versions of foolscap and of TahoeLAFS. I agree and am going to bump this ticket out of the v1.5 Milestone. Note that we can't actually close this ticket yet because there isn't a test for the original issue: duplicate entries in `IntroducerClient._connections` because RemoteReferences don't compare equal when they denote the same underlying TCP connection to the same peer, much less the newly discussed issue: duplicate RemoteReferences because (somehow) the peer is setting up new RemoteReferences while `notifyOnDisconnect()` is not getting called for the old ones. However, I'm fairly confident that this is (a) not a regression versus TahoeLAFS v1.4.1, and (b) not a critical bug. (In fact, it might actually be already fixed in current trunk.)

zooko modified the milestone from 1.5.0 to 1.6.0

2009-07-18 20:16:46 +00:00

warner commented

2009-07-18 21:35:04 +00:00

Author

hm, I do hope we'll be able to close this ticket eventually, without reverting back to the old code, analyzing the problem, fixing it, then discarding the fix to go back to the new code :-). Let's consider closing this as INVALID after a couple of months of correct operation of the testgrid.

zooko commented

2009-07-18 23:57:10 +00:00

The only part that I'm not sure about is the way that multiple RemoteReferences piled up. I.e., I'm confident that there is not a bug in TahoeLAFS with regard to this issue, but I'm not confident that there isn't a bug in foolscap about this issue. Note that we didn't change the version of foolscap on the testgrid webapi (foolscap 0.4.1).

Maybe if I write down what I know about it in this ticket that will help me understand it better.

Could you summarize what was changed which made me think (in comment:371001) that foolscap v0.4.0 fixed the original issue of having multiple RemoteReferences over the same TCP connection to the same peer?

Whatever it was, it didn't fix the problem that was demonstrated using TahoeLAFS changeset:35b3f7f426c193cf and foolscap 0.4.1 in which there were many more tuples of (serverid, servicetype, rref) than there were servers. This suggests that foolscap 0.4.1 must have been creating new RemoteReferences and not calling back the notifyOnDisconnect() method to clean up old RemoteReferences to the same peers. Now, TahoeLAFS in version changeset:1192b61dfed62a49 no longer keeps tuples of (peerid, servicetype, rref), instead it keeps a dict mapping from peerid to NativeStorageClientDescriptor, where a NativeStorageClientDescriptor has at most one RemoteReference.

When foolscap calls NativeStorageClientDescriptor._got_connection(), then the NativeStorageClientDescriptor eventually sets self.rref to reference the new RemoteReference (see [storage_client.py line 218]source:src/allmydata/storage_client.py@20090717050709-66853-80eda37caf0df376be79f45cbed728999b68a843#L218.) If its self.rref already had a reference to a RemoteReference then it would be overwritten by the new one, so the problem of lots of redundant RemoteReferences piling up certainly can't happen in TahoeLAFS anymore.

But, if foolscap is generating redundant RemoteReferences passing them to TahoeLAFS, could this cause other problems? For example, I see that NativeStorageClientDescriptor calls rref.notifyOnDisconnect(self._lost). If foolscap is generating redundant rrefs, then maybe the self._lost will eventually be called by one of the old ones that is no longer referenced by the NativeStorageClientDescriptor, and if so then the current rref will get spuriously removed?

Maybe we should add some sanity checking to NativeStorageClientDescriptor such as asserting that when its _got_connection() is called that its self.rref is required to be currently None, or asserting that when its self._lost() method is called that the current value of self.rref is the same as it was when the notifyOnDisconnect() was called? In the interests of a stable v1.5, maybe we could make it so that if these checks fail it does something "safe", such as if self._lost() is called when self.rref is not equal to the same thing that it was when the notifyOnDisconnect() was called, that it leaves the self.rref alone (as well as logging a high-scariness incident.

What do you think?

The only part that I'm not sure about is the way that multiple `RemoteReferences` piled up. I.e., I'm confident that there is not a bug in TahoeLAFS with regard to this issue, but I'm not confident that there isn't a bug in foolscap about this issue. Note that we didn't change the version of foolscap on the testgrid webapi (foolscap 0.4.1). Maybe if I write down what I know about it in this ticket that will help me understand it better. Could you summarize what was changed which made me think (in [comment:371001](/tahoe-lafs/trac/issues/653#issuecomment-371001)) that foolscap v0.4.0 fixed the *original* issue of having multiple `RemoteReferences` over the same TCP connection to the same peer? Whatever it was, it didn't fix the problem that was demonstrated using TahoeLAFS changeset:35b3f7f426c193cf and foolscap 0.4.1 in which there were many more tuples of `(serverid, servicetype, rref)` than there were servers. This suggests that foolscap 0.4.1 must have been creating new `RemoteReferences` and not calling back the `notifyOnDisconnect()` method to clean up old `RemoteReferences` to the same peers. Now, TahoeLAFS in version changeset:1192b61dfed62a49 no longer keeps tuples of `(peerid, servicetype, rref)`, instead it keeps a dict mapping from `peerid` to `NativeStorageClientDescriptor`, where a `NativeStorageClientDescriptor` has at most one `RemoteReference`. When foolscap calls `NativeStorageClientDescriptor._got_connection()`, then the `NativeStorageClientDescriptor` eventually sets `self.rref` to reference the new `RemoteReference` (see [storage_client.py line 218]source:src/allmydata/storage_client.py@20090717050709-66853-80eda37caf0df376be79f45cbed728999b68a843#L218.) If its `self.rref` already had a reference to a `RemoteReference` then it would be overwritten by the new one, so the problem of lots of redundant `RemoteReference`s piling up certainly can't happen in TahoeLAFS anymore. But, if foolscap is generating redundant `RemoteReference`s passing them to TahoeLAFS, could this cause other problems? For example, I see that `NativeStorageClientDescriptor` calls `rref.notifyOnDisconnect(self._lost)`. If foolscap is generating redundant `rref`s, then maybe the `self._lost` will eventually be called by one of the old ones that is no longer referenced by the `NativeStorageClientDescriptor`, and if so then the current `rref` will get spuriously removed? Maybe we should add some sanity checking to `NativeStorageClientDescriptor` such as asserting that when its `_got_connection()` is called that its `self.rref` is required to be currently `None`, or asserting that when its `self._lost()` method is called that the current value of `self.rref` is the same as it was when the `notifyOnDisconnect()` was called? In the interests of a stable v1.5, maybe we could make it so that if these checks fail it does something "safe", such as if `self._lost()` is called when `self.rref` is not equal to the same thing that it was when the `notifyOnDisconnect()` was called, that it leaves the `self.rref` alone (as well as logging a high-scariness incident. What do you think?

zooko commented

2009-08-08 15:40:30 +00:00

Brian: I hope you get some time to read my comment and think if my analysis is right, and this shows that there is still a bug in foolscap 0.4.1, and that if there is this might still cause problems for Tahoe-LAFS.

warner commented

2009-09-02 22:00:54 +00:00

Author

The tahoe code was rewritten to reduce the damage caused by this sort of thing. We could change it further, to remove the use of notifyOnDisconnect altogether, with two negative consequences:

the welcome-page status display would be unable to show "Connected / Not Connected" status for each known server. Instead, it could say "Last Connection Established At / Not Connected". Basically we'd know when the connection was established, and (with extra code) we could know when we last successfully used the connection. And when we tried to use the connection and found it down, we could mark the connection as down until we'd restablished it. But we wouldn't notice the actual event of connection loss (or the resulting period of not-being-connected) until we actually tried to use it. So we couldn't claim to be "connected", we could merely claim that we had connected at some point, and that we haven't noticed becoming disconnected yet (but aren't trying very hard to notice).
the share-allocation algorithm wouldn't learn about disconnected servers until it tried to send a message to them (this would fail quickly, but still not synchronously), but allocates share numbers ahead of time for each batch of requests. This could wind up with shares placed 0,1,3,4,2 instead of 0,1,2,3,4

The first problem would be annoying, so I think we're going to leave tahoe alone for now. I'll add a note to the foolscap docs to warn users about the notifyOnDisconnect bug, and encourage people to not rely upon it in replacement-connection -likely environments.

Zooko and I talked and did some more analysis. Based on that, we think there's a high probability of a foolscap bug (still present in the latest 0.4.2) that causes notifyOnDisconnect to sometimes not get called, probably triggered by "replacement connections" (i.e. where NAT table expiries or something cause an asymmetric close, one side reconnects, and the other side must displace an existing-but-really-dead connection with the new inbound one). The tahoe code was rewritten to reduce the damage caused by this sort of thing. We could change it further, to remove the use of notifyOnDisconnect altogether, with two negative consequences: * the welcome-page status display would be unable to show "Connected / Not Connected" status for each known server. Instead, it could say "Last Connection Established At / Not Connected". Basically we'd know when the connection was established, and (with extra code) we could know when we last successfully used the connection. And when we tried to use the connection and found it down, we could mark the connection as down until we'd restablished it. But we wouldn't notice the actual event of connection loss (or the resulting period of not-being-connected) until we actually tried to use it. So we couldn't claim to be "connected", we could merely claim that we *had* connected at some point, and that we haven't noticed becoming disconnected yet (but aren't trying very hard to notice). * the share-allocation algorithm wouldn't learn about disconnected servers until it tried to send a message to them (this would fail quickly, but still not synchronously), but allocates share numbers ahead of time for each batch of requests. This could wind up with shares placed 0,1,3,4,2 instead of 0,1,2,3,4 The first problem would be annoying, so I think we're going to leave tahoe alone for now. I'll add a note to the foolscap docs to warn users about the notifyOnDisconnect bug, and encourage people to not rely upon it in replacement-connection -likely environments.

zooko commented

2009-10-21 22:22:09 +00:00

I created #816 (don't rely on notifyOnDisconnect()). We can close this ticket as fixed by the rewrite of the server management between changeset:35b3f7f426c193cf and changeset:1192b61dfed62a49. (Therefore the first release with this bug fixed was v1.5.0.)

zooko added the

r/fixed

label 2009-10-21 22:22:09 +00:00

zooko modified the milestone from 1.6.0 to 1.5.0

2009-10-21 22:22:09 +00:00

zooko closed this issue

2009-10-21 22:22:09 +00:00

zooko commented

2009-10-27 05:29:20 +00:00

Brian, I was just looking at a tahoe-lafs v1.4.1 storage server, and it has an incident report triggered by this:

08:02:05.009 [8841]: WEIRD Tub.connectorFinished: WEIRD, <foolscap.negotiate.TubConnector object at 0x6e359d0 from pfavfmv34lnacg3qne4szpaxq2c4looi to 73vs3w4gt6dpx6u72ycq6bne4qpzlyzb> is not in [<foolscap.negotiate.TubConnector object at 0x5341b90 from pfavfmv34lnacg3qne4szpaxq2c4looi to cowzzoperlon3pwffk37dbt2vg5dfvns>, <foolscap.negotiate.TubConnector object at 0x5345190 from pfavfmv34lnacg3qne4szpaxq2c4looi to trkcba2hs3sugcnnedce4cxty34he3xy>, <foolscap.negotiate.TubConnector object at 0x534de90 from pfavfmv34lnacg3qne4szpaxq2c4looi to qbxvdaee3fudohcyh3qei7nry5ng5kze>, <foolscap.negotiate.TubConnector object at 0x47d0310 from pfavfmv34lnacg3qne4szpaxq2c4looi to qr7czhhcs4ag7vgsf3xqv7wohm32e7u7>, <foolscap.negotiate.TubConnector object at 0x80eb810 from pfavfmv34lnacg3qne4szpaxq2c4looi to lelde4ima4pqowvttwdxpzc232qxaxx2>] [INCIDENT-TRIGGER]

I remember seeing this WEIRDness a lot in the past, and in fact I think you might have downgraded it to be not considered weird so that it would stop generating so many incident reports. Anyway, I was wondering if the hypothesis that notifyOnDisconnect() sometimes doesn't fire when it is expected to would cast light on this weirdness. Thanks!

Brian, I was just looking at a tahoe-lafs v1.4.1 storage server, and it has an incident report triggered by this: ``` 08:02:05.009 [8841]: WEIRD Tub.connectorFinished: WEIRD, <foolscap.negotiate.TubConnector object at 0x6e359d0 from pfavfmv34lnacg3qne4szpaxq2c4looi to 73vs3w4gt6dpx6u72ycq6bne4qpzlyzb> is not in [<foolscap.negotiate.TubConnector object at 0x5341b90 from pfavfmv34lnacg3qne4szpaxq2c4looi to cowzzoperlon3pwffk37dbt2vg5dfvns>, <foolscap.negotiate.TubConnector object at 0x5345190 from pfavfmv34lnacg3qne4szpaxq2c4looi to trkcba2hs3sugcnnedce4cxty34he3xy>, <foolscap.negotiate.TubConnector object at 0x534de90 from pfavfmv34lnacg3qne4szpaxq2c4looi to qbxvdaee3fudohcyh3qei7nry5ng5kze>, <foolscap.negotiate.TubConnector object at 0x47d0310 from pfavfmv34lnacg3qne4szpaxq2c4looi to qr7czhhcs4ag7vgsf3xqv7wohm32e7u7>, <foolscap.negotiate.TubConnector object at 0x80eb810 from pfavfmv34lnacg3qne4szpaxq2c4looi to lelde4ima4pqowvttwdxpzc232qxaxx2>] [INCIDENT-TRIGGER] ``` I remember seeing this WEIRDness a lot in the past, and in fact I think you might have downgraded it to be not considered weird so that it would stop generating so many incident reports. Anyway, I was wondering if the hypothesis that `notifyOnDisconnect()` sometimes doesn't fire when it is expected to would cast light on this weirdness. Thanks!

warner commented

2009-10-27 07:40:45 +00:00

Author

I suppose a direction to investigate might be to see what happens to those connectors when the primary connection finally closes. Or if the path that triggers the WEIRD warning might also somehow clobber the notifyOnDisconnect handler.

Maybe. Yeah, I remember seeing this particular message an awful lot, and couldn't figure it out, and decided that it was pretty harmless, so I downgraded it. I vaguely recall thinking that it was related to multiple simultaneous connections, in which the losing connection attempts were somehow being doubly-removed from the list of remaining attempts. I suppose a direction to investigate might be to see what happens to those connectors when the primary connection finally closes. Or if the path that triggers the WEIRD warning might also somehow clobber the notifyOnDisconnect handler.

Rows
Columns