introducer client: connection count is wrong, !VersionedRemoteReference needs EQ #653
Labels
No labels
c/code
c/code-dirnodes
c/code-encoding
c/code-frontend
c/code-frontend-cli
c/code-frontend-ftp-sftp
c/code-frontend-magic-folder
c/code-frontend-web
c/code-mutable
c/code-network
c/code-nodeadmin
c/code-peerselection
c/code-storage
c/contrib
c/dev-infrastructure
c/docs
c/operational
c/packaging
c/unknown
c/website
kw:2pc
kw:410
kw:9p
kw:ActivePerl
kw:AttributeError
kw:DataUnavailable
kw:DeadReferenceError
kw:DoS
kw:FileZilla
kw:GetLastError
kw:IFinishableConsumer
kw:K
kw:LeastAuthority
kw:Makefile
kw:RIStorageServer
kw:StringIO
kw:UncoordinatedWriteError
kw:about
kw:access
kw:access-control
kw:accessibility
kw:accounting
kw:accounting-crawler
kw:add-only
kw:aes
kw:aesthetics
kw:alias
kw:aliases
kw:aliens
kw:allmydata
kw:amazon
kw:ambient
kw:annotations
kw:anonymity
kw:anonymous
kw:anti-censorship
kw:api_auth_token
kw:appearance
kw:appname
kw:apport
kw:archive
kw:archlinux
kw:argparse
kw:arm
kw:assertion
kw:attachment
kw:auth
kw:authentication
kw:automation
kw:avahi
kw:availability
kw:aws
kw:azure
kw:backend
kw:backoff
kw:backup
kw:backupdb
kw:backward-compatibility
kw:bandwidth
kw:basedir
kw:bayes
kw:bbfreeze
kw:beta
kw:binaries
kw:binutils
kw:bitcoin
kw:bitrot
kw:blacklist
kw:blocker
kw:blocks-cloud-deployment
kw:blocks-cloud-merge
kw:blocks-magic-folder-merge
kw:blocks-merge
kw:blocks-raic
kw:blocks-release
kw:blog
kw:bom
kw:bonjour
kw:branch
kw:branding
kw:breadcrumbs
kw:brians-opinion-needed
kw:browser
kw:bsd
kw:build
kw:build-helpers
kw:buildbot
kw:builders
kw:buildslave
kw:buildslaves
kw:cache
kw:cap
kw:capleak
kw:captcha
kw:cast
kw:centos
kw:cffi
kw:chacha
kw:charset
kw:check
kw:checker
kw:chroot
kw:ci
kw:clean
kw:cleanup
kw:cli
kw:cloud
kw:cloud-backend
kw:cmdline
kw:code
kw:code-checks
kw:coding-standards
kw:coding-tools
kw:coding_tools
kw:collection
kw:compatibility
kw:completion
kw:compression
kw:confidentiality
kw:config
kw:configuration
kw:configuration.txt
kw:conflict
kw:connection
kw:connectivity
kw:consistency
kw:content
kw:control
kw:control.furl
kw:convergence
kw:coordination
kw:copyright
kw:corruption
kw:cors
kw:cost
kw:coverage
kw:coveralls
kw:coveralls.io
kw:cpu-watcher
kw:cpyext
kw:crash
kw:crawler
kw:crawlers
kw:create-container
kw:cruft
kw:crypto
kw:cryptography
kw:cryptography-lib
kw:cryptopp
kw:csp
kw:curl
kw:cutoff-date
kw:cycle
kw:cygwin
kw:d3
kw:daemon
kw:darcs
kw:darcsver
kw:database
kw:dataloss
kw:db
kw:dead-code
kw:deb
kw:debian
kw:debug
kw:deep-check
kw:defaults
kw:deferred
kw:delete
kw:deletion
kw:denial-of-service
kw:dependency
kw:deployment
kw:deprecation
kw:desert-island
kw:desert-island-build
kw:design
kw:design-review-needed
kw:detection
kw:dev-infrastructure
kw:devpay
kw:directory
kw:directory-page
kw:dirnode
kw:dirnodes
kw:disconnect
kw:discovery
kw:disk
kw:disk-backend
kw:distribute
kw:distutils
kw:dns
kw:do_http
kw:doc-needed
kw:docker
kw:docs
kw:docs-needed
kw:dokan
kw:dos
kw:download
kw:downloader
kw:dragonfly
kw:drop-upload
kw:duplicity
kw:dusty
kw:earth-dragon
kw:easy
kw:ec2
kw:ecdsa
kw:ed25519
kw:egg-needed
kw:eggs
kw:eliot
kw:email
kw:empty
kw:encoding
kw:endpoint
kw:enterprise
kw:enum34
kw:environment
kw:erasure
kw:erasure-coding
kw:error
kw:escaping
kw:etag
kw:etch
kw:evangelism
kw:eventual
kw:example
kw:excess-authority
kw:exec
kw:exocet
kw:expiration
kw:extensibility
kw:extension
kw:failure
kw:fedora
kw:ffp
kw:fhs
kw:figleaf
kw:file
kw:file-descriptor
kw:filename
kw:filesystem
kw:fileutil
kw:fips
kw:firewall
kw:first
kw:floatingpoint
kw:flog
kw:foolscap
kw:forward-compatibility
kw:forward-secrecy
kw:forwarding
kw:free
kw:freebsd
kw:frontend
kw:fsevents
kw:ftp
kw:ftpd
kw:full
kw:furl
kw:fuse
kw:garbage
kw:garbage-collection
kw:gateway
kw:gatherer
kw:gc
kw:gcc
kw:gentoo
kw:get
kw:git
kw:git-annex
kw:github
kw:glacier
kw:globalcaps
kw:glossary
kw:google-cloud-storage
kw:google-drive-backend
kw:gossip
kw:governance
kw:grid
kw:grid-manager
kw:gridid
kw:gridsync
kw:grsec
kw:gsoc
kw:gvfs
kw:hackfest
kw:hacktahoe
kw:hang
kw:hardlink
kw:heartbleed
kw:heisenbug
kw:help
kw:helper
kw:hint
kw:hooks
kw:how
kw:how-to
kw:howto
kw:hp
kw:hp-cloud
kw:html
kw:http
kw:https
kw:i18n
kw:i2p
kw:i2p-collab
kw:illustration
kw:image
kw:immutable
kw:impressions
kw:incentives
kw:incident
kw:init
kw:inlineCallbacks
kw:inotify
kw:install
kw:installer
kw:integration
kw:integration-test
kw:integrity
kw:interactive
kw:interface
kw:interfaces
kw:interoperability
kw:interstellar-exploration
kw:introducer
kw:introduction
kw:iphone
kw:ipkg
kw:iputil
kw:ipv6
kw:irc
kw:jail
kw:javascript
kw:joke
kw:jquery
kw:json
kw:jsui
kw:junk
kw:key-value-store
kw:kfreebsd
kw:known-issue
kw:konqueror
kw:kpreid
kw:kvm
kw:l10n
kw:lae
kw:large
kw:latency
kw:leak
kw:leasedb
kw:leases
kw:libgmp
kw:license
kw:licenss
kw:linecount
kw:link
kw:linux
kw:lit
kw:localhost
kw:location
kw:locking
kw:logging
kw:logo
kw:loopback
kw:lucid
kw:mac
kw:macintosh
kw:magic-folder
kw:manhole
kw:manifest
kw:manual-test-needed
kw:map
kw:mapupdate
kw:max_space
kw:mdmf
kw:memcheck
kw:memory
kw:memory-leak
kw:mesh
kw:metadata
kw:meter
kw:migration
kw:mime
kw:mingw
kw:minimal
kw:misc
kw:miscapture
kw:mlp
kw:mock
kw:more-info-needed
kw:mountain-lion
kw:move
kw:multi-users
kw:multiple
kw:multiuser-gateway
kw:munin
kw:music
kw:mutability
kw:mutable
kw:mystery
kw:names
kw:naming
kw:nas
kw:navigation
kw:needs-review
kw:needs-spawn
kw:netbsd
kw:network
kw:nevow
kw:new-user
kw:newcaps
kw:news
kw:news-done
kw:news-needed
kw:newsletter
kw:newurls
kw:nfc
kw:nginx
kw:nixos
kw:no-clobber
kw:node
kw:node-url
kw:notification
kw:notifyOnDisconnect
kw:nsa310
kw:nsa320
kw:nsa325
kw:numpy
kw:objects
kw:old
kw:openbsd
kw:openitp-packaging
kw:openssl
kw:openstack
kw:opensuse
kw:operation-helpers
kw:operational
kw:operations
kw:ophandle
kw:ophandles
kw:ops
kw:optimization
kw:optional
kw:options
kw:organization
kw:os
kw:os.abort
kw:ostrom
kw:osx
kw:osxfuse
kw:otf-magic-folder-objective1
kw:otf-magic-folder-objective2
kw:otf-magic-folder-objective3
kw:otf-magic-folder-objective4
kw:otf-magic-folder-objective5
kw:otf-magic-folder-objective6
kw:p2p
kw:packaging
kw:partial
kw:password
kw:path
kw:paths
kw:pause
kw:peer-selection
kw:performance
kw:permalink
kw:permissions
kw:persistence
kw:phone
kw:pickle
kw:pip
kw:pipermail
kw:pkg_resources
kw:placement
kw:planning
kw:policy
kw:port
kw:portability
kw:portal
kw:posthook
kw:pratchett
kw:preformance
kw:preservation
kw:privacy
kw:process
kw:profile
kw:profiling
kw:progress
kw:proxy
kw:publish
kw:pyOpenSSL
kw:pyasn1
kw:pycparser
kw:pycrypto
kw:pycrypto-lib
kw:pycryptopp
kw:pyfilesystem
kw:pyflakes
kw:pylint
kw:pypi
kw:pypy
kw:pysqlite
kw:python
kw:python3
kw:pythonpath
kw:pyutil
kw:pywin32
kw:quickstart
kw:quiet
kw:quotas
kw:quoting
kw:raic
kw:rainhill
kw:random
kw:random-access
kw:range
kw:raspberry-pi
kw:reactor
kw:readonly
kw:rebalancing
kw:recovery
kw:recursive
kw:redhat
kw:redirect
kw:redressing
kw:refactor
kw:referer
kw:referrer
kw:regression
kw:rekey
kw:relay
kw:release
kw:release-blocker
kw:reliability
kw:relnotes
kw:remote
kw:removable
kw:removable-disk
kw:rename
kw:renew
kw:repair
kw:replace
kw:report
kw:repository
kw:research
kw:reserved_space
kw:response-needed
kw:response-time
kw:restore
kw:retrieve
kw:retry
kw:review
kw:review-needed
kw:reviewed
kw:revocation
kw:roadmap
kw:rollback
kw:rpm
kw:rsa
kw:rss
kw:rst
kw:rsync
kw:rusty
kw:s3
kw:s3-backend
kw:s3-frontend
kw:s4
kw:same-origin
kw:sandbox
kw:scalability
kw:scaling
kw:scheduling
kw:schema
kw:scheme
kw:scp
kw:scripts
kw:sdist
kw:sdmf
kw:security
kw:self-contained
kw:server
kw:servermap
kw:servers-of-happiness
kw:service
kw:setup
kw:setup.py
kw:setup_requires
kw:setuptools
kw:setuptools_darcs
kw:sftp
kw:shared
kw:shareset
kw:shell
kw:signals
kw:simultaneous
kw:six
kw:size
kw:slackware
kw:slashes
kw:smb
kw:sneakernet
kw:snowleopard
kw:socket
kw:solaris
kw:space
kw:space-efficiency
kw:spam
kw:spec
kw:speed
kw:sqlite
kw:ssh
kw:ssh-keygen
kw:sshfs
kw:ssl
kw:stability
kw:standards
kw:start
kw:startup
kw:static
kw:static-analysis
kw:statistics
kw:stats
kw:stats_gatherer
kw:status
kw:stdeb
kw:storage
kw:streaming
kw:strports
kw:style
kw:stylesheet
kw:subprocess
kw:sumo
kw:survey
kw:svg
kw:symlink
kw:synchronous
kw:tac
kw:tahoe-*
kw:tahoe-add-alias
kw:tahoe-admin
kw:tahoe-archive
kw:tahoe-backup
kw:tahoe-check
kw:tahoe-cp
kw:tahoe-create-alias
kw:tahoe-create-introducer
kw:tahoe-debug
kw:tahoe-deep-check
kw:tahoe-deepcheck
kw:tahoe-lafs-trac-stream
kw:tahoe-list-aliases
kw:tahoe-ls
kw:tahoe-magic-folder
kw:tahoe-manifest
kw:tahoe-mkdir
kw:tahoe-mount
kw:tahoe-mv
kw:tahoe-put
kw:tahoe-restart
kw:tahoe-rm
kw:tahoe-run
kw:tahoe-start
kw:tahoe-stats
kw:tahoe-unlink
kw:tahoe-webopen
kw:tahoe.css
kw:tahoe_files
kw:tahoewapi
kw:tarball
kw:tarballs
kw:tempfile
kw:templates
kw:terminology
kw:test
kw:test-and-set
kw:test-from-egg
kw:test-needed
kw:testgrid
kw:testing
kw:tests
kw:throttling
kw:ticket999-s3-backend
kw:tiddly
kw:time
kw:timeout
kw:timing
kw:to
kw:to-be-closed-on-2011-08-01
kw:tor
kw:tor-protocol
kw:torsocks
kw:tox
kw:trac
kw:transparency
kw:travis
kw:travis-ci
kw:trial
kw:trickle
kw:trivial
kw:truckee
kw:tub
kw:tub.location
kw:twine
kw:twistd
kw:twistd.log
kw:twisted
kw:twisted-14
kw:twisted-trial
kw:twitter
kw:twn
kw:txaws
kw:type
kw:typeerror
kw:ubuntu
kw:ucwe
kw:ueb
kw:ui
kw:unclean
kw:uncoordinated-writes
kw:undeletable
kw:unfinished-business
kw:unhandled-error
kw:unhappy
kw:unicode
kw:unit
kw:unix
kw:unlink
kw:update
kw:upgrade
kw:upload
kw:upload-helper
kw:uri
kw:url
kw:usability
kw:use-case
kw:utf-8
kw:util
kw:uwsgi
kw:ux
kw:validation
kw:variables
kw:vdrive
kw:verify
kw:verlib
kw:version
kw:versioning
kw:versions
kw:video
kw:virtualbox
kw:virtualenv
kw:vista
kw:visualization
kw:visualizer
kw:vm
kw:volunteergrid2
kw:volunteers
kw:vpn
kw:wapi
kw:warners-opinion-needed
kw:warning
kw:weapi
kw:web
kw:web.port
kw:webapi
kw:webdav
kw:webdrive
kw:webport
kw:websec
kw:website
kw:websocket
kw:welcome
kw:welcome-page
kw:welcomepage
kw:wiki
kw:win32
kw:win64
kw:windows
kw:windows-related
kw:winscp
kw:workaround
kw:world-domination
kw:wrapper
kw:write-enabler
kw:wui
kw:x86
kw:x86-64
kw:xhtml
kw:xml
kw:xss
kw:zbase32
kw:zetuptoolz
kw:zfec
kw:zookos-opinion-needed
kw:zope
kw:zope.interface
p/blocker
p/critical
p/major
p/minor
p/normal
p/supercritical
p/trivial
r/cannot reproduce
r/duplicate
r/fixed
r/invalid
r/somebody else's problem
r/was already fixed
r/wontfix
r/worksforme
t/defect
t/enhancement
t/task
v/0.2.0
v/0.3.0
v/0.4.0
v/0.5.0
v/0.5.1
v/0.6.0
v/0.6.1
v/0.7.0
v/0.8.0
v/0.9.0
v/1.0.0
v/1.1.0
v/1.10.0
v/1.10.1
v/1.10.2
v/1.10a2
v/1.11.0
v/1.12.0
v/1.12.1
v/1.13.0
v/1.14.0
v/1.15.0
v/1.15.1
v/1.2.0
v/1.3.0
v/1.4.1
v/1.5.0
v/1.6.0
v/1.6.1
v/1.7.0
v/1.7.1
v/1.7β
v/1.8.0
v/1.8.1
v/1.8.2
v/1.8.3
v/1.8β
v/1.9.0
v/1.9.0-s3branch
v/1.9.0a1
v/1.9.0a2
v/1.9.0b1
v/1.9.1
v/1.9.2
v/1.9.2a1
v/cloud-branch
v/unknown
No project
No assignees
3 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: tahoe-lafs/trac#653
Loading…
Add table
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Zandr noticed that the prodnet webapi nodes are reporting 111 known servers (which is correct) and 135 connected servers (which is not). The root problem is that the IntroducerClient uses object-identity (specifically RemoteReference identity) to manage the set of current connections. Specifically
IntroducerClient._connections
is a set of (peerid, service_name, rref) tuples, and the insertion/removal code assumes that 'rref' values are comparable (it uses _connections.add() and assumes that duplicates will be removed by the set class).Unfortunately, the recent introduction of VersionedRemoteReference wrappers breaks this. The 135-111 extra ghost servers are actually two different VersionedRemoteReference objects which wrap the same original RemoteReference instance.
I'm not quite sure how to fix this. My first inclination is to stop using VersionedRemoteReference wrappers, and instead assign
.versions
attributes to the regular RemoteReference instances to achieve the versioning functionality. However, I believe that VersionedRemoteReference also serves another purpose, which is to transform all remote exceptions (everything returned by callRemote, incidentally including local schema Violations) into a special form which is easily recognizable. This is not as simple to achieve without a wrapper.Another approach would be to modify
*eq*
and*hash*
(and also*ne*
? I'm never sure..) on VersionedRemoteReference to pass through to the original. A third approach would be to maintain a singleton weak dict of some sort, which would help us make sure that we have at most one VersionedRemoteReference per RemoteReference.I prefer the first (no-wrapper) approach, but it may take some doing.
Which object is creating a new instance of VersionedRemoteReference when there already exists one for that RemoteReference? Maybe we should consider this a bug in that object. This is very close to your "singletone, weak dict" approach, but this has a narrow scope of who is supposed to be weakly tracking which remote refs.
P.S. Yes I know it is very likely a class that I wrote which is responsible...
it's probably the introducer client, specifically
RemoteServiceConnector._got_service
, which reacts to the reconnector-driven connection event by callingget_versioned_remote_reference
. The problem is the separate manager object (IntroducerClient.add_connection
which doesn't ever see the real RemoteReference, just the wrapper.. maybe we could pass back both, and have it use the real RemoteReference as a dict key, and put the wrapper in a value. It used to use a dictionary, but I replaced it with a set because RemoteReferences compare cleanly and because that made the code easier to read.I suspect that I rewrote all of this code in my signed-announcements branch (#466), so maybe what I ought to do is examine that branch and copy code from it.
I think this is fixed by
foolscap >= 0.4.0
.probably yes, but I haven't written a unit test to confirm it yet.. I'll close this ticket when I've done that.
Hm. http://testgrid.allmydata.org:3567 currently says:
And its version numbers are:
The source:NEWS file says that this ticket -- #653 -- has been fixed. But perhaps it hasn't?
the testgrid is now behaving properly - but there is still no test case.
Per #748#comment:-1 -- I upgraded and rebooted the testgrid web gateway server. However, the previous version that was running already had foolscap v0.4.1, so there is apparently some issue which results in more connections than servers which is still present in that version of foolscap.
FYI, this should have been fixed by changeset:1863aee0aae1cc02, which depended upon a feature added to foolscap-0.4.0 (i.e. foolscap itself does not fix the issue.. it required a tahoe-side change).
I guess something may still be broken. We'll probably have to wait for it to become weird again and then look at the introducerclient's internal state.
Attachment incident-2009-07-16-002846-pypgfay.flog.bz2 (78470 bytes) added
Attachment incident-2009-07-16-002829-gc4xv5y.flog.bz2 (79877 bytes) added
:-( I guess I shouldn't have rebooted it then. Does that mean we should go ahead and boot this issue out of v1.5 Milestone? I have an idea -- let's add an assertion in the code that the number of connected storage servers <= number of known storage servers. Perhaps even a stronger, more detailed assertion if that's feasible.
Here's what evidence I can gather about the problem that was exhibited.
I upgraded and restarted testgrid.allmydata.org:3567 at 2009-07-16 16:17:18.835Z (on testgrid.allmydata.org's clock). There was nothing that looked too unusual in the
twistd.log
that day. There are two incidents reported, attached, from that day:incident-2009-07-16-002829-gc4xv5y.flog.bz2
andincident-2009-07-16-002846-pypgfay.flog.bz2
.Here is a foolscap log-viewer web service showing each of those logfiles: http://testgrid.allmydata.org:10000/ http://testgrid.allmydata.org:10001 . I have a hard time learning what I want to know from these logfiles. What I want to know (at least at first) is mostly about temporal coincidence. For starters, I'd like to be sure that these incidents occurred before I rebooted and upgraded the server, not after. However, the timestamps, such as "# 19:19:47.489 [23537270]: UNUSUAL excessive reactor delay ({'args': (25.734021902084351,), 'format': 'excessive reactor delay (%ss)', 'incarnation': ('\xf5\x16\x1dl\xb2\xf5\x85\xf9', None), 'num': 23537270, 'time': 1247710787.4896951, 'level': 23}s)" don't tell me what day it was nor what timezone the timestamp is in. Checking the status of http://foolscap.lothar.com/trac/ticket/90 suggests that the timestamps might be in PST, which is UTC-7. If that's the case then ... No, some of the (causally) earliest events in the log are from 18:01. Perhaps they were from 2009-07-15? Argh, I give up. Please tell me how to understand the timing of events in foolscap incident report files. I updated http://foolscap.lothar.com/trac/ticket/90 to plead for fully-qualified, unambiguous timestamps.
The triggering incident is "download failure", but scarier to me is that there was a 26 second reactor delay.
Oookay, I read the relevant source code and the miscounting of number of connected storage servers was fixed between changeset:35b3f7f426c193cf (the version that exhibited the problem) and current HEAD. However, I'm not sure if that also means that whatever caused the failures on TestGrid was also fixed. Read on. Here is the path of the code that shows how "Connected to %s of %s known storage servers" was produced at version changeset:35b3f7f426c193cf:
source:src/allmydata/web/welcome.xhtml@3897#L55
source:src/allmydata/web/root.py@3897#L240
source:src/allmydata/introducer/client.py@3897#L277
Here is how it is produced today:
source:src/allmydata/web/welcome.xhtml@3997#L55
source:src/allmydata/web/root.py@3997#L240
source:src/allmydata/storage_client.py@3997#L41
The old way could potentially have lots of tuples of
(serverid, servicename, rref)
for the same serverid and servicename, if new connections were established to the same serverid but the old connection was not lost, i.e.notifyOnDisconnect()
wasn't getting called. The new way cannot possibly have an inconsistency between the number of known storage servers and the number of connected storage servers, since both are computed from the same dict -- the known servers are all the items of the dict and the connected storage servers are the ones that have an rref.Brian: what do you think about
notifyOnDisconnect()
not getting called even while new connections to the same foolscap peer are being established? That's the only explanation I can see for the observed miscounting on the web gateway that was running allmydata-tahoe changeset:35b3f7f426c193cf and foolscap 0.4.1.Hm, note that at least one of the storage servers was in a messed-up state in which it had incurred MemoryErrors, so it might have been failing to participate in some sort of foolscap "good-bye" negotiation, thus preventing the older connections to it from being cleaned up. This reminds of these mysterious warnings in the incident reports (I've seen a lot of this kind of message in foolscap logs):
(flogtool note: until we get foolscap#90 resolved, you can use "flogtool dump
--verbose" to get the raw seconds-since-epoch timestamps out of the event
logs, which may give you more confidence in comparing them across servers.
There is not yet something similar for "flogtool web-viewer", unfortunately.
Also note that incident files always store seconds-since-epoch, so any
conversion to a string time format is being done by the "flogtool" program
that is processing the incident file)
Yeah, one of the improvements in changeset:8df15e9f30a3bda7 was to use a single data structure to
keep track of storage servers, instead of two separate ones. Seemed more
robust.
I'd have to look more closely at the old code to figure out how
notifyOnDisconnect
could be confused. It gets fired when the TCPconnection is lost, so there's no requirement that the remote side
participate in any shutdown-time protocol, but it does get fired through an
eventual-send, so there could conceivably have been a race condition or two.
(incidentally, I removed those "Tub.connectorFinished" WEIRD messages in
foolscap-0.4.2, the current release.. they were causing more problems than
they were worth)
I spoke with Brian on IRC this morning before his gallivanting schedule, and he said that it was unlikely that this problem will recur with the current versions of foolscap and of TahoeLAFS. I agree and am going to bump this ticket out of the v1.5 Milestone.
Note that we can't actually close this ticket yet because there isn't a test for the original issue: duplicate entries in
IntroducerClient._connections
because RemoteReferences don't compare equal when they denote the same underlying TCP connection to the same peer, much less the newly discussed issue: duplicate RemoteReferences because (somehow) the peer is setting up new RemoteReferences whilenotifyOnDisconnect()
is not getting called for the old ones.However, I'm fairly confident that this is (a) not a regression versus TahoeLAFS v1.4.1, and (b) not a critical bug. (In fact, it might actually be already fixed in current trunk.)
hm, I do hope we'll be able to close this ticket eventually, without reverting back to the old code, analyzing the problem, fixing it, then discarding the fix to go back to the new code :-). Let's consider closing this as INVALID after a couple of months of correct operation of the testgrid.
The only part that I'm not sure about is the way that multiple
RemoteReferences
piled up. I.e., I'm confident that there is not a bug in TahoeLAFS with regard to this issue, but I'm not confident that there isn't a bug in foolscap about this issue. Note that we didn't change the version of foolscap on the testgrid webapi (foolscap 0.4.1).Maybe if I write down what I know about it in this ticket that will help me understand it better.
Could you summarize what was changed which made me think (in comment:371001) that foolscap v0.4.0 fixed the original issue of having multiple
RemoteReferences
over the same TCP connection to the same peer?Whatever it was, it didn't fix the problem that was demonstrated using TahoeLAFS changeset:35b3f7f426c193cf and foolscap 0.4.1 in which there were many more tuples of
(serverid, servicetype, rref)
than there were servers. This suggests that foolscap 0.4.1 must have been creating newRemoteReferences
and not calling back thenotifyOnDisconnect()
method to clean up oldRemoteReferences
to the same peers. Now, TahoeLAFS in version changeset:1192b61dfed62a49 no longer keeps tuples of(peerid, servicetype, rref)
, instead it keeps a dict mapping frompeerid
toNativeStorageClientDescriptor
, where aNativeStorageClientDescriptor
has at most oneRemoteReference
.When foolscap calls
NativeStorageClientDescriptor._got_connection()
, then theNativeStorageClientDescriptor
eventually setsself.rref
to reference the newRemoteReference
(see [storage_client.py line 218]source:src/allmydata/storage_client.py@20090717050709-66853-80eda37caf0df376be79f45cbed728999b68a843#L218.) If itsself.rref
already had a reference to aRemoteReference
then it would be overwritten by the new one, so the problem of lots of redundantRemoteReference
s piling up certainly can't happen in TahoeLAFS anymore.But, if foolscap is generating redundant
RemoteReference
s passing them to TahoeLAFS, could this cause other problems? For example, I see thatNativeStorageClientDescriptor
callsrref.notifyOnDisconnect(self._lost)
. If foolscap is generating redundantrref
s, then maybe theself._lost
will eventually be called by one of the old ones that is no longer referenced by theNativeStorageClientDescriptor
, and if so then the currentrref
will get spuriously removed?Maybe we should add some sanity checking to
NativeStorageClientDescriptor
such as asserting that when its_got_connection()
is called that itsself.rref
is required to be currentlyNone
, or asserting that when itsself._lost()
method is called that the current value ofself.rref
is the same as it was when thenotifyOnDisconnect()
was called? In the interests of a stable v1.5, maybe we could make it so that if these checks fail it does something "safe", such as ifself._lost()
is called whenself.rref
is not equal to the same thing that it was when thenotifyOnDisconnect()
was called, that it leaves theself.rref
alone (as well as logging a high-scariness incident.What do you think?
Brian: I hope you get some time to read my comment and think if my analysis is right, and this shows that there is still a bug in foolscap 0.4.1, and that if there is this might still cause problems for Tahoe-LAFS.
Zooko and I talked and did some more analysis. Based on that, we think there's a high probability of a foolscap bug (still present in the latest 0.4.2) that causes notifyOnDisconnect to sometimes not get called, probably triggered by "replacement connections" (i.e. where NAT table expiries or something cause an asymmetric close, one side reconnects, and the other side must displace an existing-but-really-dead connection with the new inbound one).
The tahoe code was rewritten to reduce the damage caused by this sort of thing. We could change it further, to remove the use of notifyOnDisconnect altogether, with two negative consequences:
The first problem would be annoying, so I think we're going to leave tahoe alone for now. I'll add a note to the foolscap docs to warn users about the notifyOnDisconnect bug, and encourage people to not rely upon it in replacement-connection -likely environments.
I created #816 (don't rely on notifyOnDisconnect()). We can close this ticket as fixed by the rewrite of the server management between changeset:35b3f7f426c193cf and changeset:1192b61dfed62a49. (Therefore the first release with this bug fixed was v1.5.0.)
Brian, I was just looking at a tahoe-lafs v1.4.1 storage server, and it has an incident report triggered by this:
I remember seeing this WEIRDness a lot in the past, and in fact I think you might have downgraded it to be not considered weird so that it would stop generating so many incident reports. Anyway, I was wondering if the hypothesis that
notifyOnDisconnect()
sometimes doesn't fire when it is expected to would cast light on this weirdness. Thanks!Maybe. Yeah, I remember seeing this particular message an awful lot, and couldn't figure it out, and decided that it was pretty harmless, so I downgraded it. I vaguely recall thinking that it was related to multiple simultaneous connections, in which the losing connection attempts were somehow being doubly-removed from the list of remaining attempts.
I suppose a direction to investigate might be to see what happens to those connectors when the primary connection finally closes. Or if the path that triggers the WEIRD warning might also somehow clobber the notifyOnDisconnect handler.