"cannot convert float NaN to integer" in next_power_of_k, during upload via helper #1418
Labels
No labels
c/code
c/code-dirnodes
c/code-encoding
c/code-frontend
c/code-frontend-cli
c/code-frontend-ftp-sftp
c/code-frontend-magic-folder
c/code-frontend-web
c/code-mutable
c/code-network
c/code-nodeadmin
c/code-peerselection
c/code-storage
c/contrib
c/dev-infrastructure
c/docs
c/operational
c/packaging
c/unknown
c/website
kw:2pc
kw:410
kw:9p
kw:ActivePerl
kw:AttributeError
kw:DataUnavailable
kw:DeadReferenceError
kw:DoS
kw:FileZilla
kw:GetLastError
kw:IFinishableConsumer
kw:K
kw:LeastAuthority
kw:Makefile
kw:RIStorageServer
kw:StringIO
kw:UncoordinatedWriteError
kw:about
kw:access
kw:access-control
kw:accessibility
kw:accounting
kw:accounting-crawler
kw:add-only
kw:aes
kw:aesthetics
kw:alias
kw:aliases
kw:aliens
kw:allmydata
kw:amazon
kw:ambient
kw:annotations
kw:anonymity
kw:anonymous
kw:anti-censorship
kw:api_auth_token
kw:appearance
kw:appname
kw:apport
kw:archive
kw:archlinux
kw:argparse
kw:arm
kw:assertion
kw:attachment
kw:auth
kw:authentication
kw:automation
kw:avahi
kw:availability
kw:aws
kw:azure
kw:backend
kw:backoff
kw:backup
kw:backupdb
kw:backward-compatibility
kw:bandwidth
kw:basedir
kw:bayes
kw:bbfreeze
kw:beta
kw:binaries
kw:binutils
kw:bitcoin
kw:bitrot
kw:blacklist
kw:blocker
kw:blocks-cloud-deployment
kw:blocks-cloud-merge
kw:blocks-magic-folder-merge
kw:blocks-merge
kw:blocks-raic
kw:blocks-release
kw:blog
kw:bom
kw:bonjour
kw:branch
kw:branding
kw:breadcrumbs
kw:brians-opinion-needed
kw:browser
kw:bsd
kw:build
kw:build-helpers
kw:buildbot
kw:builders
kw:buildslave
kw:buildslaves
kw:cache
kw:cap
kw:capleak
kw:captcha
kw:cast
kw:centos
kw:cffi
kw:chacha
kw:charset
kw:check
kw:checker
kw:chroot
kw:ci
kw:clean
kw:cleanup
kw:cli
kw:cloud
kw:cloud-backend
kw:cmdline
kw:code
kw:code-checks
kw:coding-standards
kw:coding-tools
kw:coding_tools
kw:collection
kw:compatibility
kw:completion
kw:compression
kw:confidentiality
kw:config
kw:configuration
kw:configuration.txt
kw:conflict
kw:connection
kw:connectivity
kw:consistency
kw:content
kw:control
kw:control.furl
kw:convergence
kw:coordination
kw:copyright
kw:corruption
kw:cors
kw:cost
kw:coverage
kw:coveralls
kw:coveralls.io
kw:cpu-watcher
kw:cpyext
kw:crash
kw:crawler
kw:crawlers
kw:create-container
kw:cruft
kw:crypto
kw:cryptography
kw:cryptography-lib
kw:cryptopp
kw:csp
kw:curl
kw:cutoff-date
kw:cycle
kw:cygwin
kw:d3
kw:daemon
kw:darcs
kw:darcsver
kw:database
kw:dataloss
kw:db
kw:dead-code
kw:deb
kw:debian
kw:debug
kw:deep-check
kw:defaults
kw:deferred
kw:delete
kw:deletion
kw:denial-of-service
kw:dependency
kw:deployment
kw:deprecation
kw:desert-island
kw:desert-island-build
kw:design
kw:design-review-needed
kw:detection
kw:dev-infrastructure
kw:devpay
kw:directory
kw:directory-page
kw:dirnode
kw:dirnodes
kw:disconnect
kw:discovery
kw:disk
kw:disk-backend
kw:distribute
kw:distutils
kw:dns
kw:do_http
kw:doc-needed
kw:docker
kw:docs
kw:docs-needed
kw:dokan
kw:dos
kw:download
kw:downloader
kw:dragonfly
kw:drop-upload
kw:duplicity
kw:dusty
kw:earth-dragon
kw:easy
kw:ec2
kw:ecdsa
kw:ed25519
kw:egg-needed
kw:eggs
kw:eliot
kw:email
kw:empty
kw:encoding
kw:endpoint
kw:enterprise
kw:enum34
kw:environment
kw:erasure
kw:erasure-coding
kw:error
kw:escaping
kw:etag
kw:etch
kw:evangelism
kw:eventual
kw:example
kw:excess-authority
kw:exec
kw:exocet
kw:expiration
kw:extensibility
kw:extension
kw:failure
kw:fedora
kw:ffp
kw:fhs
kw:figleaf
kw:file
kw:file-descriptor
kw:filename
kw:filesystem
kw:fileutil
kw:fips
kw:firewall
kw:first
kw:floatingpoint
kw:flog
kw:foolscap
kw:forward-compatibility
kw:forward-secrecy
kw:forwarding
kw:free
kw:freebsd
kw:frontend
kw:fsevents
kw:ftp
kw:ftpd
kw:full
kw:furl
kw:fuse
kw:garbage
kw:garbage-collection
kw:gateway
kw:gatherer
kw:gc
kw:gcc
kw:gentoo
kw:get
kw:git
kw:git-annex
kw:github
kw:glacier
kw:globalcaps
kw:glossary
kw:google-cloud-storage
kw:google-drive-backend
kw:gossip
kw:governance
kw:grid
kw:grid-manager
kw:gridid
kw:gridsync
kw:grsec
kw:gsoc
kw:gvfs
kw:hackfest
kw:hacktahoe
kw:hang
kw:hardlink
kw:heartbleed
kw:heisenbug
kw:help
kw:helper
kw:hint
kw:hooks
kw:how
kw:how-to
kw:howto
kw:hp
kw:hp-cloud
kw:html
kw:http
kw:https
kw:i18n
kw:i2p
kw:i2p-collab
kw:illustration
kw:image
kw:immutable
kw:impressions
kw:incentives
kw:incident
kw:init
kw:inlineCallbacks
kw:inotify
kw:install
kw:installer
kw:integration
kw:integration-test
kw:integrity
kw:interactive
kw:interface
kw:interfaces
kw:interoperability
kw:interstellar-exploration
kw:introducer
kw:introduction
kw:iphone
kw:ipkg
kw:iputil
kw:ipv6
kw:irc
kw:jail
kw:javascript
kw:joke
kw:jquery
kw:json
kw:jsui
kw:junk
kw:key-value-store
kw:kfreebsd
kw:known-issue
kw:konqueror
kw:kpreid
kw:kvm
kw:l10n
kw:lae
kw:large
kw:latency
kw:leak
kw:leasedb
kw:leases
kw:libgmp
kw:license
kw:licenss
kw:linecount
kw:link
kw:linux
kw:lit
kw:localhost
kw:location
kw:locking
kw:logging
kw:logo
kw:loopback
kw:lucid
kw:mac
kw:macintosh
kw:magic-folder
kw:manhole
kw:manifest
kw:manual-test-needed
kw:map
kw:mapupdate
kw:max_space
kw:mdmf
kw:memcheck
kw:memory
kw:memory-leak
kw:mesh
kw:metadata
kw:meter
kw:migration
kw:mime
kw:mingw
kw:minimal
kw:misc
kw:miscapture
kw:mlp
kw:mock
kw:more-info-needed
kw:mountain-lion
kw:move
kw:multi-users
kw:multiple
kw:multiuser-gateway
kw:munin
kw:music
kw:mutability
kw:mutable
kw:mystery
kw:names
kw:naming
kw:nas
kw:navigation
kw:needs-review
kw:needs-spawn
kw:netbsd
kw:network
kw:nevow
kw:new-user
kw:newcaps
kw:news
kw:news-done
kw:news-needed
kw:newsletter
kw:newurls
kw:nfc
kw:nginx
kw:nixos
kw:no-clobber
kw:node
kw:node-url
kw:notification
kw:notifyOnDisconnect
kw:nsa310
kw:nsa320
kw:nsa325
kw:numpy
kw:objects
kw:old
kw:openbsd
kw:openitp-packaging
kw:openssl
kw:openstack
kw:opensuse
kw:operation-helpers
kw:operational
kw:operations
kw:ophandle
kw:ophandles
kw:ops
kw:optimization
kw:optional
kw:options
kw:organization
kw:os
kw:os.abort
kw:ostrom
kw:osx
kw:osxfuse
kw:otf-magic-folder-objective1
kw:otf-magic-folder-objective2
kw:otf-magic-folder-objective3
kw:otf-magic-folder-objective4
kw:otf-magic-folder-objective5
kw:otf-magic-folder-objective6
kw:p2p
kw:packaging
kw:partial
kw:password
kw:path
kw:paths
kw:pause
kw:peer-selection
kw:performance
kw:permalink
kw:permissions
kw:persistence
kw:phone
kw:pickle
kw:pip
kw:pipermail
kw:pkg_resources
kw:placement
kw:planning
kw:policy
kw:port
kw:portability
kw:portal
kw:posthook
kw:pratchett
kw:preformance
kw:preservation
kw:privacy
kw:process
kw:profile
kw:profiling
kw:progress
kw:proxy
kw:publish
kw:pyOpenSSL
kw:pyasn1
kw:pycparser
kw:pycrypto
kw:pycrypto-lib
kw:pycryptopp
kw:pyfilesystem
kw:pyflakes
kw:pylint
kw:pypi
kw:pypy
kw:pysqlite
kw:python
kw:python3
kw:pythonpath
kw:pyutil
kw:pywin32
kw:quickstart
kw:quiet
kw:quotas
kw:quoting
kw:raic
kw:rainhill
kw:random
kw:random-access
kw:range
kw:raspberry-pi
kw:reactor
kw:readonly
kw:rebalancing
kw:recovery
kw:recursive
kw:redhat
kw:redirect
kw:redressing
kw:refactor
kw:referer
kw:referrer
kw:regression
kw:rekey
kw:relay
kw:release
kw:release-blocker
kw:reliability
kw:relnotes
kw:remote
kw:removable
kw:removable-disk
kw:rename
kw:renew
kw:repair
kw:replace
kw:report
kw:repository
kw:research
kw:reserved_space
kw:response-needed
kw:response-time
kw:restore
kw:retrieve
kw:retry
kw:review
kw:review-needed
kw:reviewed
kw:revocation
kw:roadmap
kw:rollback
kw:rpm
kw:rsa
kw:rss
kw:rst
kw:rsync
kw:rusty
kw:s3
kw:s3-backend
kw:s3-frontend
kw:s4
kw:same-origin
kw:sandbox
kw:scalability
kw:scaling
kw:scheduling
kw:schema
kw:scheme
kw:scp
kw:scripts
kw:sdist
kw:sdmf
kw:security
kw:self-contained
kw:server
kw:servermap
kw:servers-of-happiness
kw:service
kw:setup
kw:setup.py
kw:setup_requires
kw:setuptools
kw:setuptools_darcs
kw:sftp
kw:shared
kw:shareset
kw:shell
kw:signals
kw:simultaneous
kw:six
kw:size
kw:slackware
kw:slashes
kw:smb
kw:sneakernet
kw:snowleopard
kw:socket
kw:solaris
kw:space
kw:space-efficiency
kw:spam
kw:spec
kw:speed
kw:sqlite
kw:ssh
kw:ssh-keygen
kw:sshfs
kw:ssl
kw:stability
kw:standards
kw:start
kw:startup
kw:static
kw:static-analysis
kw:statistics
kw:stats
kw:stats_gatherer
kw:status
kw:stdeb
kw:storage
kw:streaming
kw:strports
kw:style
kw:stylesheet
kw:subprocess
kw:sumo
kw:survey
kw:svg
kw:symlink
kw:synchronous
kw:tac
kw:tahoe-*
kw:tahoe-add-alias
kw:tahoe-admin
kw:tahoe-archive
kw:tahoe-backup
kw:tahoe-check
kw:tahoe-cp
kw:tahoe-create-alias
kw:tahoe-create-introducer
kw:tahoe-debug
kw:tahoe-deep-check
kw:tahoe-deepcheck
kw:tahoe-lafs-trac-stream
kw:tahoe-list-aliases
kw:tahoe-ls
kw:tahoe-magic-folder
kw:tahoe-manifest
kw:tahoe-mkdir
kw:tahoe-mount
kw:tahoe-mv
kw:tahoe-put
kw:tahoe-restart
kw:tahoe-rm
kw:tahoe-run
kw:tahoe-start
kw:tahoe-stats
kw:tahoe-unlink
kw:tahoe-webopen
kw:tahoe.css
kw:tahoe_files
kw:tahoewapi
kw:tarball
kw:tarballs
kw:tempfile
kw:templates
kw:terminology
kw:test
kw:test-and-set
kw:test-from-egg
kw:test-needed
kw:testgrid
kw:testing
kw:tests
kw:throttling
kw:ticket999-s3-backend
kw:tiddly
kw:time
kw:timeout
kw:timing
kw:to
kw:to-be-closed-on-2011-08-01
kw:tor
kw:tor-protocol
kw:torsocks
kw:tox
kw:trac
kw:transparency
kw:travis
kw:travis-ci
kw:trial
kw:trickle
kw:trivial
kw:truckee
kw:tub
kw:tub.location
kw:twine
kw:twistd
kw:twistd.log
kw:twisted
kw:twisted-14
kw:twisted-trial
kw:twitter
kw:twn
kw:txaws
kw:type
kw:typeerror
kw:ubuntu
kw:ucwe
kw:ueb
kw:ui
kw:unclean
kw:uncoordinated-writes
kw:undeletable
kw:unfinished-business
kw:unhandled-error
kw:unhappy
kw:unicode
kw:unit
kw:unix
kw:unlink
kw:update
kw:upgrade
kw:upload
kw:upload-helper
kw:uri
kw:url
kw:usability
kw:use-case
kw:utf-8
kw:util
kw:uwsgi
kw:ux
kw:validation
kw:variables
kw:vdrive
kw:verify
kw:verlib
kw:version
kw:versioning
kw:versions
kw:video
kw:virtualbox
kw:virtualenv
kw:vista
kw:visualization
kw:visualizer
kw:vm
kw:volunteergrid2
kw:volunteers
kw:vpn
kw:wapi
kw:warners-opinion-needed
kw:warning
kw:weapi
kw:web
kw:web.port
kw:webapi
kw:webdav
kw:webdrive
kw:webport
kw:websec
kw:website
kw:websocket
kw:welcome
kw:welcome-page
kw:welcomepage
kw:wiki
kw:win32
kw:win64
kw:windows
kw:windows-related
kw:winscp
kw:workaround
kw:world-domination
kw:wrapper
kw:write-enabler
kw:wui
kw:x86
kw:x86-64
kw:xhtml
kw:xml
kw:xss
kw:zbase32
kw:zetuptoolz
kw:zfec
kw:zookos-opinion-needed
kw:zope
kw:zope.interface
p/blocker
p/critical
p/major
p/minor
p/normal
p/supercritical
p/trivial
r/cannot reproduce
r/duplicate
r/fixed
r/invalid
r/somebody else's problem
r/was already fixed
r/wontfix
r/worksforme
t/defect
t/enhancement
t/task
v/0.2.0
v/0.3.0
v/0.4.0
v/0.5.0
v/0.5.1
v/0.6.0
v/0.6.1
v/0.7.0
v/0.8.0
v/0.9.0
v/1.0.0
v/1.1.0
v/1.10.0
v/1.10.1
v/1.10.2
v/1.10a2
v/1.11.0
v/1.12.0
v/1.12.1
v/1.13.0
v/1.14.0
v/1.15.0
v/1.15.1
v/1.2.0
v/1.3.0
v/1.4.1
v/1.5.0
v/1.6.0
v/1.6.1
v/1.7.0
v/1.7.1
v/1.7β
v/1.8.0
v/1.8.1
v/1.8.2
v/1.8.3
v/1.8β
v/1.9.0
v/1.9.0-s3branch
v/1.9.0a1
v/1.9.0a2
v/1.9.0b1
v/1.9.1
v/1.9.2
v/1.9.2a1
v/cloud-branch
v/unknown
No milestone
No project
No assignees
3 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: tahoe-lafs/trac#1418
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
While performing a backup just a few minutes ago, I got the following exception:
The remote exception was:
I have previously performed successful backups with this precise setup, no configurations have been changed or servers restarted recently. The storage node has ample available space.
I tried running the backup several times and the error occur each time. However, when running a second time the file on which the first run failed is reported as uploaded and the second run fails on the next file. Running a check on the reported cap gives:
Without using --verify
Both client and server nodes are running 1.8.2 on Debian stable. Sorry for the big pastes, if there is any other information you would like please let me know.
I'm currently running a fsck on the storage disks. Maybe a disk corruption would have this effect?
No, the fsck runs came back clean (I run two storage nodes on the computer).
It seems as though
mathutil.next_power_of_k(num_segments, 2)
was probably called withnum_segments = NaN
. (The next power of 2 is also undefined for negative numbers and infinity, but those give different exception messages.)Attachment preconditions-about-ticket1418.diff (3698 bytes) added
Dear rycee: thank you for the good bug report! I scanned through the code shown by the stack trace and didn't see the problem. Along the way I added preconditions which should identify the problem more precisely. Could you please apply attachment:preconditions-about-ticket1418.diff to your 1.8.2 source tree and reproduce the problem? Thank you very much!
I wonder if it could stem from a configuration change you've made in
tahoe.cfg
, perhaps to the "max segment size" setting. If so, this patch will tell us that this is the problem.With zooko's patch I get pretty much the same thing as earlier (do I need to give python some special flags to enable the preconditions?):
Please note that upload.py:926 now appear in the stack compared to upload.py:924 in the initial stack so the patch was really applied! I checked this many times to convince myself that I'm not going insane :-)
After a bit of further investigation it seems like the problem only occurs when I'm using a helper. When uploading directly from my client node it works perfectly fine. When I realized this I also realized that it is possible that since upgrading the helper node to 1.8.2, I have only used the client without helper! I did successfully use the helper when it was running 1.8.1 though. I don't know if this helps you narrow down the problem.
Disabling the use of helper and uploading a random file gives:
which is good. However, when I enable the use of a helper and restart the client node I get:
which is bad.
Attachment tahoe.cfg (391 bytes) added
The tahoe.cfg of my storage/helper node (the non-helper nodes just have enabled=false in helper
Attachment tahoe-client.cfg (340 bytes) added
My client config with helper enabled.
"cannot convert float NaN to integer" in next_power_of_kto "cannot convert float NaN to integer" in next_power_of_k, during upload via helperReplying to rycee:
No, the [precondition()]source:trunk/src/allmydata/util/assertutil.py?annotate=blame&rev=4329 function runs the tests regardless of any flags. (There is a flag in Python to disable
assert
, butprecondition()
doesn't useassert
or respect that flag.)Oh, I guess my patch was written under the mistaken assumption that this
NaN
was arising because one of the inputs tomath.log()
was not an int or long. I'll need to think harder to provide a more useful diagnostics patch, probably starting by re-reading davidsarah's comment:384951.Wait a minute, are you saying that
tahoe cp
reportsSuccess: files copied
even though the helper encountered this uncaught exception!? That seems unlikely. Which process encounters the uncaught exception and in which step of the transcript above?Attachment diagnostics-about-NaN-from-npok.diff (1105 bytes) added
Okay, this patch will definitely give actually useful diagnostics! attachment:diagnostics-about-NaN-from-npok.diff Please try it out.
Oh, by the way, are applying these diagnostics patches to the helper source code, too? And restarting the helper? The code in the stack trace would be running in the helper, not in the client, I believe.
Yes, I have applied the patches you've given on the helper node, not the client node. With the new patch I did indeed get some output but being a Python novice I feel more confused, not less. The stacktrace says:
and in the node's twistd.log I found
In the python REPL on the same computer I get
This, together with the
Success: files copied
, makes me quite confused. Feels like the NaN error is a decoy put under our noses while the real problem slips quietly into the night.I also tried creating a completely pristine 1.8.2 build on my helper node and now it also fails in the same way as the cp did, i.e., it claims success and the CHK-caps reported by a verbose backup are OK when running
check
but fails when runningcheck --verify
. For example, backup saysand checking gives
I will attach the result of a
diff -ur
between the version giving the exception (including zooko's patches from this bug) and the pristine version. Note that the version giving exception contains edits of mine that changes minimum_cycle_time incrawler.py
andexpirer.py
but I have had those changes without trouble since a long time (many versions ago).Oh, finally. Since
next_power_of_k_alt
returns a sensible result, I tried makingnext_power_of_k
return that value in my build that produces exceptions. Running a backup then proceeds in the same way as the pristine version, i.e., the client reports success butcheck --verify
fails.Attachment 1.8.2.diff (17244 bytes) added
diff -ur between build throwing NaN exception and pristine build
Yes, this is getting weird. It looks like
next_power_of_k_math
raised this exception when its inputs weren=30, k=2
, but when you tried it yourself in the REPL, the same calls tomath.log()
on the same computer worked. It could be a failure in your CPU's floating point support, but I would have expected that to be sporadic or permanent, rather than to fail every time undertahoe
and work every time under the REPL! I wonder if executingtahoe
is somehow changing the floating point mode of your CPU...Maybe there's a bug in
next_power_of_k_math()
. Could you please try something like:A next-step on this for me is to try to understand why upload is succeeding but then verify failing. I guess this error is happening during write. If the error is causing it to get an incorrect value for
effective_segments
:Then it might write the data out incorrectly. If the file you are uploading was previously uploaded then the deduplication will prevent your gateway from uploading a new copy. This would explain why changing it to use
next_power_of_k_alt()
and then uploading and verifying previously written files got the same failure-to-verify. Oh wait, did you upload a new random file instead of a previously uploaded file, when you did the experiment that showednext_power_of_k_alt
had the same problem? Gah! If you already tried that, then perhaps there are two bugs here -- theNaN
exception and a different bug that is corrupting files on write.Anyway it wouldn't make sense for this
NaN
exception to result in an incorrect value ofeffective_segments
, when we can clearly see that what results is an uncaught exception!Here are a couple of ideas you could try:
Run
next_power_of_k()
(the version that uses_math
and then uses_alt
if_math
raised an exception) in a tight loop, possibly with multiple processes doing it, and leave those running and reporting if they got any exceptions.Edit the code to use
next_power_of_k_alt
exclusively andmv
your entirestorage
directory aside, or create an entirely separate storage server and introducer for testing, and upload and then verify a random file. (If you haven't already done this.)Two more ideas:
3. Run
trial --until-failure allmydata.test
and just leave it running.4. memtest86 :-)
I don't have physical access to the machine but I ran memtester for quite some time and it didn't show anything. Neither did some stress testing of
math.log
. But running the trial thingy was good, I noticed thatallmydata.test.test_cli.Backup.test_backup
failed with an exception on thex = int(math.log(n, k) + 0.5)
line. And I found something that seems rather curious to me, puttingprint math.log(10)
first innext_power_of_k
and runningtest_backup
givesYeah, that's right. Running
math.log(10)
gives an exception and since it's only run once I can't imagine it's being from the CPU being overloaded or anything. Further:If I change
math.log(10)
tomath.log10(10)
it still gives the error. But – and hang on to your hat – if I usemath.sqrt(10)
it works, not only does the sqrt work but themath.log(n, k)
a few lines down does as well! Withnext_power_of_k
definedthe unit test passes!
Now, while this elegant fix removes the exception during backup the uploaded files unfortunately still fail
check --verify
).Anyway, obviously something strange happens with some floating point operations and like you said, some floating point mode is probably being altered when running tahoe but not otherwise.
I also tried adding a print of
effective_segments
right after its assignment. It turns out that this variable is being set to 1 when I try doing acp
of a 1kB file and 128 with a 10MB file. I don't know if these values are OK but regardless, it appears thatnext_power_of_k
is OK to call sometimes but not otherwise.Could you please run this single unit test that fails --
allmydata.test.test_cli.Backup.test_backup
-- with themath.log(10)
in place, under valgrind?http://valgrind.org
Valgrind is very good. You'll need to use the Python suppressions file with it:
http://svn.python.org/projects/python/trunk/Misc/valgrind-python.supp
Another idea suggested by Riastradh on IRC is to turn on floating point exceptions: http://docs.python.org/library/fpectl.html
Could you please attach the output of:
and
I'll attach the show-tool-versions output as a file (it's a bit messy).
Attachment show-tool-versions-output.txt (11059 bytes) added
Output from show-tools-versions
Attachment valgrind-output.txt (84317 bytes) added
Output from valgrind.
About the
fpectl
library. I tried addingfpectl.turnon_sigfpe()
to the beginning of thetest_backup
method, this gave mewhich seems … interesting.
I don't know how to associate this to a source location though so I dunno if it corresponds to the
math.log(10)
or some other piece of code.Sigh. The Valgrind Python suppressions file that you used is apparently not working. Perhaps it was written for 64-bit architecture and has a bunch of "8" in it which need to be changed to "4" to make it suppress all of those false positives in attachment:valgrind-output.txt ?
All of those "Conditional jump or move depends on uninitialised value(s)" and "Use of uninitialised value of size 4" emitting from
PyObject_Free
orPyObject_Realloc
are false positives. You can generate new suppression entries to put into a suppressions file by executingYou might also want to report a bug to the distributor of your valgrind-python.supp file. If you do, please let me know the URL to the ticket.
Hm, judging from http://docs.python.org/library/fpectl.html , any Python code (in the CPython interpreter or in an extension module) which does floating point math is supposed to wrap the calls to floating point functions in macros named
PyFPE_START_PROTECT
andPyFPE_END_PROTECT
in order to trap the exception and raise a nice Python exception. Apparently this hasn't happened!Hopefully you can get a run under valgrind without all the noise from false positives. Maybe whatever problem ends in the floating point exception starts with a memory mis-usage that valgrind can report.
Jason Riedy on identi.ca wrote: "If the 30L somehow is pushed on a stack as an integer, the resulting corruption likely would produce a NaN." -- http://identi.ca/notice/77048991
As suggested by Riastradh in #tahoe-lafs, I ran it under strace. Using
strace -f python bin/tahoe debug trial allmydata.test.test_cli.Backup.test_backup
I got a 3.3MB file which I can attach if requested. The last 20 lines areAfter some help I also managed to catch the SIGFPE in gdb. Since I don't have debug symbols in my python (simply the version found in Debian Squeeze), I don't know how useful it is. Anyway, I pasted the stack trace to http://hpaste.org/48228
Riastradh suggested that the cause could be a cosmic ray flipping a bit in memory, especially after seeing strange output from
info float
in gdb. I tried flushing the file cache, rebooting, reinstalling python 2.6, uninstalling python 2.5 (in case it was somehow interfering), rebuilding tahoe and building tahoe 1.8.1. Still get an exception fromx = int(math.log(n, k) + 0.5)
. Note, I also get an exception from my newly built 1.8.1. I have previously used 1.8.1 reliably as a helper so I suppose the problem might be caused due to some newer version of a library pulled in duringpython setup.py build
?Replying to rycee:
Yes, Tahoe-LAFS itself is written in pure Python. I suspect this problem is due to memory corruption or some other incorrect C/C++ code, so it is probably due to one of the dependencies. I really want to track down which one it is, so please try to continue maintaining a "paper trail" of what you changed at each step of this. Thanks!
To see which dependencies have native code in them, look at this table: http://tahoe-lafs.org/source/tahoe-lafs/deps/tahoe-lafs-dep-eggs/README.html Every column there is a library that uses native (C/C++) code.
If I were you, my next step would be to clear the false positives out of valgrind (as I described in comment:384968). valgrind will either identify the problem, or if it gives a clean report then we can be confident that there is no branching on uninitialized memory, writing to memory out of bounds, or other of the memory errors that valgrind checks for.
If you're not interested in doing that then say so and lets think of some other approaches to track down this bug.
Oh I have an idea. Turn on tracing -- http://www.doughellmann.com/PyMOTW/trace/ -- then you'll know what line of source code triggered the fp exception. (Caveat: the Python process might buffer the output and then die due to the fp exception before it flushes the line telling the actual Python code which came immediately before the exception. So turn on the
PYTHONUNBUFFERED=1
env var or pass-u
topython
.Turns out Debian stable has a python2.6-dbg package. I installed that and caught the SIGFPE when running the
allmydata.test.test_upload.EncodingParameters.test_upper_limit_on_readonly_queries
test (it also fails and runs quicker than thetest_backup
test). Anyway, I getThe failing line in
intobject.c
isreturn [PyFloat](wiki/PyFloat)_FromDouble((double)(v -> ob_ival));
so presumably it is the cast to double that fails. But sinceob_ival = 40000
I can't see any reasonable reason why. The failing line in tahoe isself._status.set_progress(0, float(bytes_read)/self._size)
inimmutable/upload.py
.If I have time, then I'll try the valgrind thingy but at the moment I'm pretty much ready to chalk it all up to hexed hardware.
It can't be hexed hardware, because it is reproducible (at least mostly?) and it doesn't happen with any other executables, nor with other Python experiments. Say, here's an idea. My best guess is that one of the native-code dependencies (hopefully not zfec or pycryptopp -- the ones I'm responsible for!) are doing something bad. You could test this by running some complicated floating point math, then importing each native code dependency (see comment:23) and then running the floating point math again.
Also the tracing approach that I suggested in comment:384974 is quite likely to give us useful information. If you're not comfortable with all the Python coding suggested on http://www.doughellmann.com/PyMOTW/trace/ then I can whip up a script that'll do it for you. :-)
Chad Austin had a suggestion on twitter:
Did you try asserting on the x87 and SSE control words periodically? I put that assertion after every test in our test runner.
https://twitter.com/#!/chadaustin/status/84508454357696514
I asked Chad Austin on twitter how to detect the x87 and SSE control words. He replied:
On Windows with Python 2.6, this seems to work:
The meanings of the control registers: http://software.intel.com/en-us/articles/x87-and-sse-floating-point-assists-in-ia-32-flush-to-zero-ftz-and-denormals-are-zero-daz/
Kludgy but more cross-platform way to get the FP environment (assuming
sizeof(fenv_t) <= 100
):which on my system gives
To decode the structure, see http://linux.die.net/include/bits/fenv.h. Note that we probably only need to detect when it changes during a unit test. For that, I would try hacking the
run
method ofTestCase
intwisted.trial.unittest
so that it callslibm.fegetenv(b)
before and after running the test code (theresult.startTest(self)
line) and prints the values ofb.raw
if they are different.There was a bug in the Linux kernel that would cause floating point state to be wrong "on 32-bit linux on modern CPUs":
http://article.gmane.org/gmane.linux.kernel/1258780
comment:384966 says that rycee's machine was in 32-bit mode. rycee's [/attachment/ticket/1418/show-tool-versions-output.txt show-tool-versions output] says that the kernel was 2.6.32-5 (from Debian). I wasn't able to tell from the kernel patch comments when this bug was introduced into the kernel.
Anyway, since rycee appears to have lost interest in debugging this, I suggest we close this ticket as unreproducible until rycee or someone else who can reproduce it reappears.
Yeah, sorry about not being active on this bug. I've been happy with using my tahoe grid without helper so I haven't felt very motivated to dig around.
zooko: The Linux bug you linked seem unrelated since the CPU on this computer is an old Pentium III-M, 1.2GHz and the bug was for newer, presumably 64-bit, CPUs.
But I have interesting news, just tried downloading and building Tahoe 1.9.1. When I ran
allmydata.test.test_cli.Backup.test_backup
it was successful! I thought it might be related to some glibc update or something so I also downloaded both 1.8.2 and 1.8.3, built them and ran the same test. They both failed in the same way as described in my previous comments. So, I assume that some change occurred between 1.8.3 and 1.9.0 that must have fixed my problem.Just to double check that 1.9.1 actually fixes my problem I enabled the helper functionality and tried to upload through it, indeed it was successful.
rycee: oh boy! You've managed to reproducibly isolate a way to make the bug happen or not happen! Now here's a question: can you run
git bisect
and get it to automatically run the unit tests in between each version change so that it finds exactly what patch makes the difference? Or do the equivalent manually -- try different versions and narrow down exactly what causes the problem?Image()Honestly, I'm still skeptical that anything we do in our 100% Python code could cause this bug, but perhaps something about the sequence of instructions that we're sending through the CPU accidentally avoids or accidentally triggers it...
Also I am still willing to help you use some diagnostic tools like valgrind or the fp env from comment:384979...
Also you might want to try applying that one patchset from linux to change the way it manages floating point state during context switches and see if that makes the problem go away. I don't know for sure that the linux bug doesn't apply on an old Pentium III -- it was just something from, IIRC, gregkh's announcement that mentioned "newer CPUs".
Ok, gave bisect a try and it told me that the issue was fixed in commit 095efbf7a5f8ee375a81136f87ba3943ec995fc6. This change seems to make Tahoe depend on pycryptopp version 0.5.20 or later on my machine, before this change it was satisfied with version 0.5.14 or later.
On my Debian Squeeze installation, the python-pycryptopp package is of version 0.5.17-1+b1 satisfying the dependency before davidsarah's change but not after. Finding this out, I of course tried uninstalling the python-pycryptopp Debian package and rebuild a pristine Tahoe 1.8.2 source tree. The failing test now happily passes!
About the kernel patch. I'm reluctant to try this since I have no physical access to the machine and rely on it being up stably.
Aha. It seems quite likely that one of the bugs fixed between pycryptopp
0.5.140.5.17-1+b1 and 0.5.20 could have been responsible. I doubt it has anything to do with the kernel bug.I guess we can raise the lower bound to pycryptopp version 0.5.17-1+b1 since that's what my Debian box was using when the test was failing. I don't know whether it is worth trying to determine the problem in pycryptopp…
Anyway, thank you David-Sarah for making the commit 095efbf7a5f8ee375a81136f87ba3943ec995fc6 that got us closer to the underlying cause :-)
Note that it's possible that the Debian cryptopp was broken or miscompiled or included a bad Debian-specific patch, rather than this being a fix between upstream pycryptopp 0.5.17 and 0.5.20.
(Can you tell I don't really trust the Debian packaging policies or practices? We'd have a lot fewer problems if they would just follow upstream rather than picking out individual patches without really understanding them.)
Oh, also note that the Debian pycryptopp links against a non-embedded Crypto++, so we actually have no idea what version of Crypto++ that is (unless rycee can tell what it was).
I considered closing this ticket as fixed, since it seems relatively certain from the bisection that this is a fixed pycryptopp or Crypto++ issue. However, since we don't know whether Debian added a bad patch to their pycryptopp or Crypto++, I'll leave it open until we can determine that.
Yes, it is also possible that there was a bug in upstream pycryptopp itself. rycee: can you experiment with swapping in the old Debian version of pycryptopp versus the old upstream version of pycryptopp, maybe?
Replying to zooko:
One reason I thought that less likely is that I reviewed all the patches between 0.5.17 and 0.5.20 and didn't see any fixes that could have caused this bug.
Yes, that's a good test.
I added 0.5.17 as an upper bound in _auto_deps.py and running setup.py build now prints
and it goes on to spend quite some time building the crypto++ library. After finishing, this build successfully runs the the previously failing test even though it's Tahoe version 1.8.2. If I then replace the
_pycryptopp.so
file from inside this Tahoe tree with the corresponding file from the Debian package the error comes back. The Debian version of pycryptopp is dynamically linked against crypto++ version 5.6.0-6.So, finally, leaving Debian's
_pycryptopp.so
in place, I tried downloading and building a dynamic library from upstream crypto++ 5.6.0 and 5.6.1 and substituting Debian's copy. Curiously the test failed with 5.6.0 and worked with 5.6.1…Okay, this is very good bug isolation. I didn't quite follow the last bit though -- the bug is in upstream Crypto++ 5.6.0 and not upstream Crypto++ 5.6.1? Or it is in Debian Crypto++ 5.6.0 and not Debian Crypto++ 5.6.1?
By the way, we have a "quick startup self-test" built into pycryptopp, which is an attempt to catch the most egregious of these sorts of failures at run-time. I wonder if it would be possible to make that quick startup self-test more rigorous so that it would catch this one. I guess I'll be better able to tell once we really isolate this bug.
Sorry about the vagueness. I used upstream
cryptopp560.zip
andcryptopp561.zip
downloaded from http://www.cryptopp.com/. The 5.6.0 make file didn't have a target for a shared library so I copied that build target from 5.6.1, this is the only change I made to them. Building thecryptest.exe
program and running it with./cryptest.exe v
didn't show any problems. Similarly, there were no errors reported for./cryptest.exe tv [TestVectors](wiki/TestVectors)/all.txt
.Hm, well okay so we're pretty sure there is a bug in Crypto++ 5.6.0 that is fixed in Crypto++ 5.6.1, right? Maybe we could let it go at that, since it is apparently fixed. The steps to reproduce are:
--disable-embedded-cryptopp
option so that it will link against the .so generated in step 1 instead of using its own bundled copy of Crypto++ source codepython setup.py test -s allmydata.test.test_cli.Backup.test_backup
On your machine this 100% reproducible triggers the problem, and swapping in Crypto++ 5.6.1 for 5.6.0 makes it stop happening?
Okay one more thing: how do I know I don't have the same bug in the copy of Crypto++ that is bundled inside pycryptopp? I don't think we've upgraded the embedded copy of Crypto++ a newer version of Crypto++ since this patch [91f454b47f66636fffde68e8a8bf4364bed7516e]:
Four years ago, before Crypto++ 5.6.0 was released. Since then we've just been cherry-picking patches from upstream Crypto++. So if this bug was present in the version that we copied, and if we didn't realize that we needed to cherry-pick the patch that fixes it, then it could be present in the embedded copy of Crypto++ inside pycryptopp.
Could you please replace step 2 above with building and omitting the
--disable-embedded-cryptopp
flag, to force it to use its embedded version of Crypto++? I think you've probably done that already in this process, but if you could do it again now just to be sure, that would make me feel better.Ok, I downloaded pycryptopp version 0.5.29 and did two builds with it. Once with
--disable-embedded-cryptopp
and once without. I then ranpython setup.py test
on both. For the build that used the embedded version of Crypto++ this finished without any errors. For the one using the dynamically linked Crypto++ (v 5.6.0) it ended in failure and I'll make an attachment containing the output of the failed run.Note, when I used
--disable-embedded-cryptopp
it was built against the Debian version of Crypto++ 5.6.0 since I don't know how to convincesetup.py
to look for headers and libraries in a non-standard directory. It still fails in the same way, though, if I use LD_LIBRARY_PATH to force it to load the upstream Crypto++ 5.6.0. Using the same technique to force it to use Crypto++ 5.6.1 results in an undefined symbol error.I'm not entirely sure how to use a custom pycryptopp in a Tahoe build so I didn't try to use my two pycryptopp builds in Tahoe but I suppose the pycryptopp test is telling enough. Also, when I run the Tahoe build now (once I uninstalled the Debian pycryptopp) it downloads and builds pycryptopp using the embedded Crypto++; and as previously seen those builds are doing just fine.
Anyway, I think it's safe to say that the Crypto++ embedded in pycryptopp doesn't exhibit the problem I encountered.
Attachment pycryptopp-dynlink-test-failure.txt (58050 bytes) added
Output from pycryptopp test when built with --disable-embedded-cryptopp
Replying to rycee:
This is almost sufficient to convince me that this is a fixed bug in Crypto++ (despite it not being clear which change fixed it). However, we haven't excluded the possibility that the problem interacts with the dynamic linking, i.e. that it fails in non-embedded mode.
It sounds like we should open a new (pycryptopp) ticket for that, since at some point Debian or Ubuntu will upgrade to 5.6.1 and then it may fail with the same error.
I agree to close this ticket as "invalid".
About the undefined symbol error, it sounds like rycee compiled it against the Crypto++ 5.6.0 headers provided by Debian, and then used
LD_LIBRARY_PATH
to force it to load Crypto++ 5.6.1 dynamic link library. That's not a safe thing to do and nobody, or at least not us, can do anything about that giving undefined symbol errors. Am I right, or am I confused?Replying to [davidsarah]comment:49:
I'm a bit curious, however, as to why the
cryptest.exe
program from Crypto++ 5.6.0 didn't fail, maybe the failing test from pycryptopp could be ported to a test incryptest.exe
?However, curiously enough, when I used Debian's pycryptopp – which is dynamically linked against Debian's Crypto++ – I was able to change out the
.so
file from 5.6.0 to 5.6.1 without problem. Doing this is how I found out that 5.6.1 worked. I suppose it has something to do with the patches and how the Debian pycryptopp and Crypto++ packages are built.Replying to [rycee]comment:51:
The bug seems to affect the floating point state, and it's quite plausible that
cryptest.exe
doesn't test for that. A regression test would indeed be nice -- if you're sufficiently motivated then I suggest asking on the Crypto++ mailing list.Yes, as Zooko said this is an invalid test.
Ah, so if that worked then it is not the dynamic linking. Okay, let's stop worrying about this one.
Okay, so we think this is a bug that was introduced into Crypto++ after we stopped updating pycryptopp's embedded copy of Crypto++ to track upstream Crypto++, and was then released in Crypto++ v5.6.0, and then was fixed in Crypto++ v5.6.1. It would be nice if we knew more precisely what bug was introduced and what patch fixed it, or if we at least had a reliable regression test that would tell us if we later introduce this bug into pycryptopp, but I'm not sure how to go about that and it might not be worth the effort.
So, shall we close this ticket now?
This bug should certainly be closed as invalid or something similar. I may explore Crypto++ a bit further as time and interest permits but will raise such a discussion in a more suitable forum.
Thanks for your help in tracking this down. It's been a tricky issue and I'm happy it resolved itself so pleasantly.
Okay, thanks a lot for your help!