tahoe-lafs/trac

"cannot convert float NaN to integer" in next_power_of_k, during upload via helper #1418

New issue

Closed

opened 2011-06-14 11:13:29 +00:00 by rycee · 59 comments

rycee commented

2011-06-14 11:13:29 +00:00

Owner

While performing a backup just a few minutes ago, I got the following exception:

Traceback (most recent call last):
  File "/home/rycee/bin/allmydata-tahoe-1.8.2/support/bin/tahoe", line 9, in <module>
    load_entry_point('allmydata-tahoe==1.8.2', 'console_scripts', 'tahoe')()
  File "/home/rycee/bin/allmydata-tahoe-1.8.2/src/allmydata/scripts/runner.py", line 113, in run
    rc = runner(sys.argv[1:], install_node_control=install_node_control)
  File "/home/rycee/bin/allmydata-tahoe-1.8.2/src/allmydata/scripts/runner.py", line 99, in runner
    rc = cli.dispatch[command](so)
  File "/home/rycee/bin/allmydata-tahoe-1.8.2/src/allmydata/scripts/cli.py", line 540, in backup
    rc = tahoe_backup.backup(options)
  File "/home/rycee/bin/allmydata-tahoe-1.8.2/src/allmydata/scripts/tahoe_backup.py", line 325, in backup
    return bu.run()
  File "/home/rycee/bin/allmydata-tahoe-1.8.2/src/allmydata/scripts/tahoe_backup.py", line 118, in run
    new_backup_dircap = self.process(options.from_dir)
  File "/home/rycee/bin/allmydata-tahoe-1.8.2/src/allmydata/scripts/tahoe_backup.py", line 188, in process
    childcap = self.process(childpath)
  File "/home/rycee/bin/allmydata-tahoe-1.8.2/src/allmydata/scripts/tahoe_backup.py", line 188, in process
    childcap = self.process(childpath)
  File "/home/rycee/bin/allmydata-tahoe-1.8.2/src/allmydata/scripts/tahoe_backup.py", line 188, in process
    childcap = self.process(childpath)
  File "/home/rycee/bin/allmydata-tahoe-1.8.2/src/allmydata/scripts/tahoe_backup.py", line 188, in process
    childcap = self.process(childpath)
  File "/home/rycee/bin/allmydata-tahoe-1.8.2/src/allmydata/scripts/tahoe_backup.py", line 194, in process
    childcap, metadata = self.upload(childpath)
  File "/home/rycee/bin/allmydata-tahoe-1.8.2/src/allmydata/scripts/tahoe_backup.py", line 305, in upload
    raise HTTPError("Error during file PUT", resp)
allmydata.scripts.common_http.HTTPError: Error during file PUT: 500 Internal Server Error

The remote exception was:

Traceback (most recent call last):
Failure: foolscap.tokens.RemoteException: <RemoteException around '[CopiedFailure instance: Traceback from remote host -- Traceback (most recent call last):
  File "/home/rycee/allmydata-tahoe-1.8.2/support/lib/python2.6/site-packages/foolscap-0.6.1-py2.6.egg/foolscap/call.py", line 674, in _done
    self.request.complete(res)
  File "/home/rycee/allmydata-tahoe-1.8.2/support/lib/python2.6/site-packages/foolscap-0.6.1-py2.6.egg/foolscap/call.py", line 60, in complete
    self.deferred.callback(res)
  File "/home/rycee/allmydata-tahoe-1.8.2/support/lib/python2.6/site-packages/Twisted-10.2.0-py2.6-linux-i686.egg/twisted/internet/defer.py", line 361, in callback
    self._startRunCallbacks(result)
  File "/home/rycee/allmydata-tahoe-1.8.2/support/lib/python2.6/site-packages/Twisted-10.2.0-py2.6-linux-i686.egg/twisted/internet/defer.py", line 455, in _startRunCallbacks
    self._runCallbacks()
--- <exception caught here> ---
  File "/home/rycee/allmydata-tahoe-1.8.2/support/lib/python2.6/site-packages/Twisted-10.2.0-py2.6-linux-i686.egg/twisted/internet/defer.py", line 542, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/home/rycee/allmydata-tahoe-1.8.2/src/allmydata/immutable/upload.py", line 924, in locate_all_shareholders
    num_segments, n, k, desired)
  File "/home/rycee/allmydata-tahoe-1.8.2/src/allmydata/immutable/upload.py", line 225, in get_shareholders
    None)
  File "/home/rycee/allmydata-tahoe-1.8.2/src/allmydata/immutable/layout.py", line 88, in make_write_bucket_proxy
    num_share_hashes, uri_extension_size_max, nodeid)
  File "/home/rycee/allmydata-tahoe-1.8.2/src/allmydata/immutable/layout.py", line 108, in __init__
    effective_segments = mathutil.next_power_of_k(num_segments,2)
  File "/home/rycee/allmydata-tahoe-1.8.2/src/allmydata/util/mathutil.py", line 35, in next_power_of_k
    x = int(math.log(n, k) + 0.5)
exceptions.ValueError: cannot convert float NaN to integer

I have previously performed successful backups with this precise setup, no configurations have been changed or servers restarted recently. The storage node has ample available space.

I tried running the backup several times and the error occur each time. However, when running a second time the file on which the first run failed is reported as uploaded and the second run fails on the next file. Running a check on the reported cap gives:

$ tahoe check --verify --raw URI:CHK:2pqpbe...
{
 "results": {
  "needs-rebalancing": false, 
  "count-shares-expected": 3, 
  "healthy": false, 
  "count-unrecoverable-versions": 1, 
  "count-shares-needed": 1, 
  "sharemap": {}, 
  "count-recoverable-versions": 0, 
  "servers-responding": [
   "5yea4my3w3frgp524lgthrb7rdd6frtr", 
   "44g5kkgwulzrrrntdzci7jtt5rgt6nuo", 
   "bzyf23mghgxycnr34pdkqdmybnevf4ks"
  ], 
  "count-good-share-hosts": 0, 
  "count-wrong-shares": 0, 
  "count-shares-good": 0, 
  "count-corrupt-shares": 3, 
  "list-corrupt-shares": [
   [
    "bzyf23mghgxycnr34pdkqdmybnevf4ks", 
    "barr7kgra3pst6icbtnmxrggca", 
    0
   ], 
   [
    "44g5kkgwulzrrrntdzci7jtt5rgt6nuo", 
    "barr7kgra3pst6icbtnmxrggca", 
    1
   ], 
   [
    "5yea4my3w3frgp524lgthrb7rdd6frtr", 
    "barr7kgra3pst6icbtnmxrggca", 
    2
   ]
  ], 
  "recoverable": false
 }, 
 "storage-index": "barr7kgra3pst6icbtnmxrggca", 
 "summary": "Not Healthy: 0 shares (enc 1-of-3)"
}

Without using --verify

$ tahoe check --raw URI:CHK:2pqp...
{
 "results": {
  "needs-rebalancing": false, 
  "count-shares-expected": 3, 
  "healthy": true, 
  "count-unrecoverable-versions": 0, 
  "count-shares-needed": 1, 
  "sharemap": {
   "0": [
    "bzyf23mghgxycnr34pdkqdmybnevf4ks"
   ], 
   "1": [
    "44g5kkgwulzrrrntdzci7jtt5rgt6nuo"
   ], 
   "2": [
    "5yea4my3w3frgp524lgthrb7rdd6frtr"
   ]
  }, 
  "count-recoverable-versions": 1, 
  "servers-responding": [
   "5yea4my3w3frgp524lgthrb7rdd6frtr", 
   "44g5kkgwulzrrrntdzci7jtt5rgt6nuo", 
   "bzyf23mghgxycnr34pdkqdmybnevf4ks"
  ], 
  "count-good-share-hosts": 3, 
  "count-wrong-shares": 0, 
  "count-shares-good": 3, 
  "count-corrupt-shares": 0, 
  "list-corrupt-shares": [], 
  "recoverable": true
 }, 
 "storage-index": "barr7kgra3pst6icbtnmxrggca", 
 "summary": "Healthy"
}

Both client and server nodes are running 1.8.2 on Debian stable. Sorry for the big pastes, if there is any other information you would like please let me know.

While performing a backup just a few minutes ago, I got the following exception: ``` Traceback (most recent call last): File "/home/rycee/bin/allmydata-tahoe-1.8.2/support/bin/tahoe", line 9, in <module> load_entry_point('allmydata-tahoe==1.8.2', 'console_scripts', 'tahoe')() File "/home/rycee/bin/allmydata-tahoe-1.8.2/src/allmydata/scripts/runner.py", line 113, in run rc = runner(sys.argv[1:], install_node_control=install_node_control) File "/home/rycee/bin/allmydata-tahoe-1.8.2/src/allmydata/scripts/runner.py", line 99, in runner rc = cli.dispatch[command](so) File "/home/rycee/bin/allmydata-tahoe-1.8.2/src/allmydata/scripts/cli.py", line 540, in backup rc = tahoe_backup.backup(options) File "/home/rycee/bin/allmydata-tahoe-1.8.2/src/allmydata/scripts/tahoe_backup.py", line 325, in backup return bu.run() File "/home/rycee/bin/allmydata-tahoe-1.8.2/src/allmydata/scripts/tahoe_backup.py", line 118, in run new_backup_dircap = self.process(options.from_dir) File "/home/rycee/bin/allmydata-tahoe-1.8.2/src/allmydata/scripts/tahoe_backup.py", line 188, in process childcap = self.process(childpath) File "/home/rycee/bin/allmydata-tahoe-1.8.2/src/allmydata/scripts/tahoe_backup.py", line 188, in process childcap = self.process(childpath) File "/home/rycee/bin/allmydata-tahoe-1.8.2/src/allmydata/scripts/tahoe_backup.py", line 188, in process childcap = self.process(childpath) File "/home/rycee/bin/allmydata-tahoe-1.8.2/src/allmydata/scripts/tahoe_backup.py", line 188, in process childcap = self.process(childpath) File "/home/rycee/bin/allmydata-tahoe-1.8.2/src/allmydata/scripts/tahoe_backup.py", line 194, in process childcap, metadata = self.upload(childpath) File "/home/rycee/bin/allmydata-tahoe-1.8.2/src/allmydata/scripts/tahoe_backup.py", line 305, in upload raise HTTPError("Error during file PUT", resp) allmydata.scripts.common_http.HTTPError: Error during file PUT: 500 Internal Server Error ``` The remote exception was: ``` Traceback (most recent call last): Failure: foolscap.tokens.RemoteException: <RemoteException around '[CopiedFailure instance: Traceback from remote host -- Traceback (most recent call last): File "/home/rycee/allmydata-tahoe-1.8.2/support/lib/python2.6/site-packages/foolscap-0.6.1-py2.6.egg/foolscap/call.py", line 674, in _done self.request.complete(res) File "/home/rycee/allmydata-tahoe-1.8.2/support/lib/python2.6/site-packages/foolscap-0.6.1-py2.6.egg/foolscap/call.py", line 60, in complete self.deferred.callback(res) File "/home/rycee/allmydata-tahoe-1.8.2/support/lib/python2.6/site-packages/Twisted-10.2.0-py2.6-linux-i686.egg/twisted/internet/defer.py", line 361, in callback self._startRunCallbacks(result) File "/home/rycee/allmydata-tahoe-1.8.2/support/lib/python2.6/site-packages/Twisted-10.2.0-py2.6-linux-i686.egg/twisted/internet/defer.py", line 455, in _startRunCallbacks self._runCallbacks() --- <exception caught here> --- File "/home/rycee/allmydata-tahoe-1.8.2/support/lib/python2.6/site-packages/Twisted-10.2.0-py2.6-linux-i686.egg/twisted/internet/defer.py", line 542, in _runCallbacks current.result = callback(current.result, *args, **kw) File "/home/rycee/allmydata-tahoe-1.8.2/src/allmydata/immutable/upload.py", line 924, in locate_all_shareholders num_segments, n, k, desired) File "/home/rycee/allmydata-tahoe-1.8.2/src/allmydata/immutable/upload.py", line 225, in get_shareholders None) File "/home/rycee/allmydata-tahoe-1.8.2/src/allmydata/immutable/layout.py", line 88, in make_write_bucket_proxy num_share_hashes, uri_extension_size_max, nodeid) File "/home/rycee/allmydata-tahoe-1.8.2/src/allmydata/immutable/layout.py", line 108, in __init__ effective_segments = mathutil.next_power_of_k(num_segments,2) File "/home/rycee/allmydata-tahoe-1.8.2/src/allmydata/util/mathutil.py", line 35, in next_power_of_k x = int(math.log(n, k) + 0.5) exceptions.ValueError: cannot convert float NaN to integer ``` I have previously performed successful backups with this precise setup, no configurations have been changed or servers restarted recently. The storage node has ample available space. I tried running the backup several times and the error occur each time. However, when running a second time the file on which the first run failed is reported as uploaded and the second run fails on the next file. Running a check on the reported cap gives: ``` $ tahoe check --verify --raw URI:CHK:2pqpbe... { "results": { "needs-rebalancing": false, "count-shares-expected": 3, "healthy": false, "count-unrecoverable-versions": 1, "count-shares-needed": 1, "sharemap": {}, "count-recoverable-versions": 0, "servers-responding": [ "5yea4my3w3frgp524lgthrb7rdd6frtr", "44g5kkgwulzrrrntdzci7jtt5rgt6nuo", "bzyf23mghgxycnr34pdkqdmybnevf4ks" ], "count-good-share-hosts": 0, "count-wrong-shares": 0, "count-shares-good": 0, "count-corrupt-shares": 3, "list-corrupt-shares": [ [ "bzyf23mghgxycnr34pdkqdmybnevf4ks", "barr7kgra3pst6icbtnmxrggca", 0 ], [ "44g5kkgwulzrrrntdzci7jtt5rgt6nuo", "barr7kgra3pst6icbtnmxrggca", 1 ], [ "5yea4my3w3frgp524lgthrb7rdd6frtr", "barr7kgra3pst6icbtnmxrggca", 2 ] ], "recoverable": false }, "storage-index": "barr7kgra3pst6icbtnmxrggca", "summary": "Not Healthy: 0 shares (enc 1-of-3)" } ``` Without using --verify ``` $ tahoe check --raw URI:CHK:2pqp... { "results": { "needs-rebalancing": false, "count-shares-expected": 3, "healthy": true, "count-unrecoverable-versions": 0, "count-shares-needed": 1, "sharemap": { "0": [ "bzyf23mghgxycnr34pdkqdmybnevf4ks" ], "1": [ "44g5kkgwulzrrrntdzci7jtt5rgt6nuo" ], "2": [ "5yea4my3w3frgp524lgthrb7rdd6frtr" ] }, "count-recoverable-versions": 1, "servers-responding": [ "5yea4my3w3frgp524lgthrb7rdd6frtr", "44g5kkgwulzrrrntdzci7jtt5rgt6nuo", "bzyf23mghgxycnr34pdkqdmybnevf4ks" ], "count-good-share-hosts": 3, "count-wrong-shares": 0, "count-shares-good": 3, "count-corrupt-shares": 0, "list-corrupt-shares": [], "recoverable": true }, "storage-index": "barr7kgra3pst6icbtnmxrggca", "summary": "Healthy" } ``` Both client and server nodes are running 1.8.2 on Debian stable. Sorry for the big pastes, if there is any other information you would like please let me know.

tahoe-lafs added the

labels 2011-06-14 11:13:29 +00:00

tahoe-lafs added this to the undecided milestone 2011-06-14 11:13:29 +00:00

rycee commented

2011-06-14 11:16:29 +00:00

Author

Owner

I'm currently running a fsck on the storage disks. Maybe a disk corruption would have this effect?

No, the fsck runs came back clean (I run two storage nodes on the computer).

I'm currently running a fsck on the storage disks. Maybe a disk corruption would have this effect? No, the fsck runs came back clean (I run two storage nodes on the computer).

daira commented

2011-06-14 21:31:14 +00:00

It seems as though mathutil.next_power_of_k(num_segments, 2) was probably called with num_segments = NaN. (The next power of 2 is also undefined for negative numbers and infinity, but those give different exception messages.)

It seems as though `mathutil.next_power_of_k(num_segments, 2)` was probably called with `num_segments = NaN`. (The next power of 2 is also undefined for negative numbers and infinity, but those give different exception messages.)

zooko commented

2011-06-21 17:54:17 +00:00

Attachment preconditions-about-ticket1418.diff (3698 bytes) added

**Attachment** preconditions-about-ticket1418.diff (3698 bytes) added

preconditions-about-ticket1418.diff

3.6 KiB

zooko commented

2011-06-22 02:38:12 +00:00

I wonder if it could stem from a configuration change you've made in tahoe.cfg, perhaps to the "max segment size" setting. If so, this patch will tell us that this is the problem.

Dear rycee: thank you for the good bug report! I scanned through the code shown by the stack trace and didn't see the problem. Along the way I added preconditions which should identify the problem more precisely. Could you please apply [attachment:preconditions-about-ticket1418.diff](/tahoe-lafs/trac/attachments/000078ac-e3ed-3963-c28f-06b4ef65d4cd) to your 1.8.2 source tree and reproduce the problem? Thank you very much! I wonder if it could stem from a configuration change you've made in `tahoe.cfg`, perhaps to the "max segment size" setting. If so, this patch will tell us that this is the problem.

rycee commented

2011-06-22 08:55:30 +00:00

Author

Owner

With zooko's patch I get pretty much the same thing as earlier (do I need to give python some special flags to enable the preconditions?):

[snip]
--- <exception caught here> ---
  File "/home/rycee/allmydata-tahoe-1.8.2.foo/support/lib/python2.6/site-packages/Twisted-10.2.0-py2.6-linux-i686.egg/twisted/internet/defer.py", line 542, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/home/rycee/allmydata-tahoe-1.8.2/src/allmydata/immutable/upload.py", line 926, in locate_all_shareholders
    num_segments, n, k, desired)
  File "/home/rycee/allmydata-tahoe-1.8.2/src/allmydata/immutable/upload.py", line 225, in get_shareholders
    None)
  File "/home/rycee/allmydata-tahoe-1.8.2/src/allmydata/immutable/layout.py", line 88, in make_write_bucket_proxy
    num_share_hashes, uri_extension_size_max, nodeid)
  File "/home/rycee/allmydata-tahoe-1.8.2/src/allmydata/immutable/layout.py", line 108, in __init__
    effective_segments = mathutil.next_power_of_k(num_segments,2)
  File "/home/rycee/allmydata-tahoe-1.8.2/src/allmydata/util/mathutil.py", line 35, in next_power_of_k
    x = int(math.log(n, k) + 0.5)
exceptions.ValueError: cannot convert float NaN to integer

Please note that upload.py:926 now appear in the stack compared to upload.py:924 in the initial stack so the patch was really applied! I checked this many times to convince myself that I'm not going insane :-)

After a bit of further investigation it seems like the problem only occurs when I'm using a helper. When uploading directly from my client node it works perfectly fine. When I realized this I also realized that it is possible that since upgrading the helper node to 1.8.2, I have only used the client without helper! I did successfully use the helper when it was running 1.8.1 though. I don't know if this helps you narrow down the problem.

Disabling the use of helper and uploading a random file gives:

$ dd if=/dev/urandom of=rnd0 bs=1M count=1
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.256792 s, 4.1 MB/s
$ tahoe cp rnd0 priv:
Success: files copied
$ tahoe check priv:rnd0
Summary: Healthy
 storage index: fpeoa7vqrcrjny354zmru3gqda
 good-shares: 3 (encoding is 1-of-3)
 wrong-shares: 0
$ tahoe check --verify priv:rnd0
Summary: Healthy
 storage index: fpeoa7vqrcrjny354zmru3gqda
 good-shares: 3 (encoding is 1-of-3)
 wrong-shares: 0

which is good. However, when I enable the use of a helper and restart the client node I get:

$ dd if=/dev/urandom of=rnd1 bs=1M count=1
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.256289 s, 4.1 MB/s
$ tahoe cp rnd1 priv:
Success: files copied
$ tahoe check priv:rnd1
Summary: Healthy
 storage index: rzlouseev5hbrte62facntsj4i
 good-shares: 3 (encoding is 1-of-3)
 wrong-shares: 0
$ tahoe check --verify priv:rnd1
Summary: Not Healthy: 0 shares (enc 1-of-3)
 storage index: rzlouseev5hbrte62facntsj4i
 good-shares: 0 (encoding is 1-of-3)
 wrong-shares: 0
 corrupt shares:
  server bzyf23mghgxycnr34pdkqdmybnevf4ks, SI rzlouseev5hbrte62facntsj4i, shnum 2
  server 44g5kkgwulzrrrntdzci7jtt5rgt6nuo, SI rzlouseev5hbrte62facntsj4i, shnum 0
  server 5yea4my3w3frgp524lgthrb7rdd6frtr, SI rzlouseev5hbrte62facntsj4i, shnum 1

which is bad.

With zooko's patch I get pretty much the same thing as earlier (do I need to give python some special flags to enable the preconditions?): ``` [snip] --- <exception caught here> --- File "/home/rycee/allmydata-tahoe-1.8.2.foo/support/lib/python2.6/site-packages/Twisted-10.2.0-py2.6-linux-i686.egg/twisted/internet/defer.py", line 542, in _runCallbacks current.result = callback(current.result, *args, **kw) File "/home/rycee/allmydata-tahoe-1.8.2/src/allmydata/immutable/upload.py", line 926, in locate_all_shareholders num_segments, n, k, desired) File "/home/rycee/allmydata-tahoe-1.8.2/src/allmydata/immutable/upload.py", line 225, in get_shareholders None) File "/home/rycee/allmydata-tahoe-1.8.2/src/allmydata/immutable/layout.py", line 88, in make_write_bucket_proxy num_share_hashes, uri_extension_size_max, nodeid) File "/home/rycee/allmydata-tahoe-1.8.2/src/allmydata/immutable/layout.py", line 108, in __init__ effective_segments = mathutil.next_power_of_k(num_segments,2) File "/home/rycee/allmydata-tahoe-1.8.2/src/allmydata/util/mathutil.py", line 35, in next_power_of_k x = int(math.log(n, k) + 0.5) exceptions.ValueError: cannot convert float NaN to integer ``` Please note that upload.py:926 now appear in the stack compared to upload.py:924 in the initial stack so the patch was really applied! I checked this many times to convince myself that I'm not going insane :-) After a bit of further investigation it seems like the problem only occurs when I'm using a helper. When uploading directly from my client node it works perfectly fine. When I realized this I also realized that it is possible that since upgrading the helper node to 1.8.2, I have only used the client without helper! I did successfully use the helper when it was running 1.8.1 though. I don't know if this helps you narrow down the problem. Disabling the use of helper and uploading a random file gives: ``` $ dd if=/dev/urandom of=rnd0 bs=1M count=1 1+0 records in 1+0 records out 1048576 bytes (1.0 MB) copied, 0.256792 s, 4.1 MB/s $ tahoe cp rnd0 priv: Success: files copied $ tahoe check priv:rnd0 Summary: Healthy storage index: fpeoa7vqrcrjny354zmru3gqda good-shares: 3 (encoding is 1-of-3) wrong-shares: 0 $ tahoe check --verify priv:rnd0 Summary: Healthy storage index: fpeoa7vqrcrjny354zmru3gqda good-shares: 3 (encoding is 1-of-3) wrong-shares: 0 ``` which is good. However, when I enable the use of a helper and restart the client node I get: ``` $ dd if=/dev/urandom of=rnd1 bs=1M count=1 1+0 records in 1+0 records out 1048576 bytes (1.0 MB) copied, 0.256289 s, 4.1 MB/s $ tahoe cp rnd1 priv: Success: files copied $ tahoe check priv:rnd1 Summary: Healthy storage index: rzlouseev5hbrte62facntsj4i good-shares: 3 (encoding is 1-of-3) wrong-shares: 0 $ tahoe check --verify priv:rnd1 Summary: Not Healthy: 0 shares (enc 1-of-3) storage index: rzlouseev5hbrte62facntsj4i good-shares: 0 (encoding is 1-of-3) wrong-shares: 0 corrupt shares: server bzyf23mghgxycnr34pdkqdmybnevf4ks, SI rzlouseev5hbrte62facntsj4i, shnum 2 server 44g5kkgwulzrrrntdzci7jtt5rgt6nuo, SI rzlouseev5hbrte62facntsj4i, shnum 0 server 5yea4my3w3frgp524lgthrb7rdd6frtr, SI rzlouseev5hbrte62facntsj4i, shnum 1 ``` which is bad.

rycee commented

2011-06-22 09:01:52 +00:00

Author

Owner

Attachment tahoe.cfg (391 bytes) added

The tahoe.cfg of my storage/helper node (the non-helper nodes just have enabled=false in helper

**Attachment** tahoe.cfg (391 bytes) added The tahoe.cfg of my storage/helper node (the non-helper nodes just have enabled=false in helper

tahoe.cfg

391 B

rycee commented

2011-06-22 09:04:06 +00:00

Author

Owner

Attachment tahoe-client.cfg (340 bytes) added

My client config with helper enabled.

**Attachment** tahoe-client.cfg (340 bytes) added My client config with helper enabled.

tahoe-client.cfg

340 B

daira added

c/code

and removed

c/unknown

labels 2011-06-22 09:29:49 +00:00

daira changed title from ~~"cannot convert float NaN to integer" in next_power_of_k~~ to "cannot convert float NaN to integer" in next_power_of_k, during upload via helper

2011-06-22 09:29:49 +00:00

zooko commented

2011-06-22 17:45:54 +00:00

Replying to rycee:

With zooko's patch I get pretty much the same thing as earlier (do I need to give python some special flags to enable the preconditions?)

No, the [precondition()]source:trunk/src/allmydata/util/assertutil.py?annotate=blame&rev=4329 function runs the tests regardless of any flags. (There is a flag in Python to disable assert, but precondition() doesn't use assert or respect that flag.)

Oh, I guess my patch was written under the mistaken assumption that this NaN was arising because one of the inputs to math.log() was not an int or long. I'll need to think harder to provide a more useful diagnostics patch, probably starting by re-reading davidsarah's comment:384951.

However, when I enable the use of a helper and restart the client node I get:

$ dd if=/dev/urandom of=rnd1 bs=1M count=1
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.256289 s, 4.1 MB/s
$ tahoe cp rnd1 priv:
Success: files copied
$ tahoe check priv:rnd1
Summary: Healthy
 storage index: rzlouseev5hbrte62facntsj4i
 good-shares: 3 (encoding is 1-of-3)
 wrong-shares: 0
$ tahoe check --verify priv:rnd1
Summary: Not Healthy: 0 shares (enc 1-of-3)
 storage index: rzlouseev5hbrte62facntsj4i
 good-shares: 0 (encoding is 1-of-3)
 wrong-shares: 0
 corrupt shares:
  server bzyf23mghgxycnr34pdkqdmybnevf4ks, SI rzlouseev5hbrte62facntsj4i, shnum 2
  server 44g5kkgwulzrrrntdzci7jtt5rgt6nuo, SI rzlouseev5hbrte62facntsj4i, shnum 0
  server 5yea4my3w3frgp524lgthrb7rdd6frtr, SI rzlouseev5hbrte62facntsj4i, shnum 1

which is bad.

Wait a minute, are you saying that tahoe cp reports Success: files copied even though the helper encountered this uncaught exception!? That seems unlikely. Which process encounters the uncaught exception and in which step of the transcript above?

Replying to [rycee](/tahoe-lafs/trac/issues/1418#issuecomment-384953): > With zooko's patch I get pretty much the same thing as earlier (do I need to give python some special flags to enable the preconditions?) No, the [precondition()]source:trunk/src/allmydata/util/assertutil.py?annotate=blame&rev=4329 function runs the tests regardless of any flags. (There is a flag in Python to disable `assert`, but `precondition()` doesn't use `assert` or respect that flag.) Oh, I guess my patch was written under the mistaken assumption that this `NaN` was arising because one of the inputs to `math.log()` was not an int or long. I'll need to think harder to provide a more useful diagnostics patch, probably starting by re-reading davidsarah's [comment:384951](/tahoe-lafs/trac/issues/1418#issuecomment-384951). > However, when I enable the use of a helper and restart the client node I get: > > ``` > $ dd if=/dev/urandom of=rnd1 bs=1M count=1 > 1+0 records in > 1+0 records out > 1048576 bytes (1.0 MB) copied, 0.256289 s, 4.1 MB/s > $ tahoe cp rnd1 priv: > Success: files copied > $ tahoe check priv:rnd1 > Summary: Healthy > storage index: rzlouseev5hbrte62facntsj4i > good-shares: 3 (encoding is 1-of-3) > wrong-shares: 0 > $ tahoe check --verify priv:rnd1 > Summary: Not Healthy: 0 shares (enc 1-of-3) > storage index: rzlouseev5hbrte62facntsj4i > good-shares: 0 (encoding is 1-of-3) > wrong-shares: 0 > corrupt shares: > server bzyf23mghgxycnr34pdkqdmybnevf4ks, SI rzlouseev5hbrte62facntsj4i, shnum 2 > server 44g5kkgwulzrrrntdzci7jtt5rgt6nuo, SI rzlouseev5hbrte62facntsj4i, shnum 0 > server 5yea4my3w3frgp524lgthrb7rdd6frtr, SI rzlouseev5hbrte62facntsj4i, shnum 1 > ``` > > which is bad. Wait a minute, are you saying that `tahoe cp` reports `Success: files copied` even though the helper encountered this uncaught exception!? That seems unlikely. Which process encounters the uncaught exception and in which step of the transcript above?

zooko commented

2011-06-22 18:26:00 +00:00

Attachment diagnostics-about-NaN-from-npok.diff (1105 bytes) added

**Attachment** diagnostics-about-NaN-from-npok.diff (1105 bytes) added

diagnostics-about-NaN-from-npok.diff

1.1 KiB

zooko commented

2011-06-22 18:26:40 +00:00

Okay, this patch will definitely give actually useful diagnostics! attachment:diagnostics-about-NaN-from-npok.diff Please try it out.

Okay, this patch will definitely give actually useful diagnostics! [attachment:diagnostics-about-NaN-from-npok.diff](/tahoe-lafs/trac/attachments/000078ac-e3ed-3963-c28f-385c10a71f02) Please try it out.

zooko commented

2011-06-22 18:28:53 +00:00

Oh, by the way, are applying these diagnostics patches to the helper source code, too? And restarting the helper? The code in the stack trace would be running in the helper, not in the client, I believe.

Oh, by the way, are applying these diagnostics patches to the *helper* source code, too? And restarting the helper? The code in the stack trace would be running in the helper, not in the client, I believe.

rycee commented

2011-06-23 08:56:45 +00:00

Author

Owner

  File "/home/rycee/allmydata-tahoe-1.8.2-bug1418/support/lib/python2.6/site-packages/Twisted-10.2.0-py2.6-linux-i686.egg/twisted/internet/defer.py", line 542, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/home/rycee/allmydata-tahoe-1.8.2/src/allmydata/immutable/upload.py", line 926, in locate_all_shareholders
    num_segments, n, k, desired)
  File "/home/rycee/allmydata-tahoe-1.8.2/src/allmydata/immutable/upload.py", line 225, in get_shareholders
    None)
  File "/home/rycee/allmydata-tahoe-1.8.2/src/allmydata/immutable/layout.py", line 88, in make_write_bucket_proxy
    num_share_hashes, uri_extension_size_max, nodeid)
  File "/home/rycee/allmydata-tahoe-1.8.2/src/allmydata/immutable/layout.py", line 108, in __init__
    effective_segments = mathutil.next_power_of_k(num_segments,2)
  File "/home/rycee/allmydata-tahoe-1.8.2/src/allmydata/util/mathutil.py", line 49, in next_power_of_k
    return next_power_of_k_math(n, k)
  File "/home/rycee/allmydata-tahoe-1.8.2/src/allmydata/util/mathutil.py", line 35, in next_power_of_k_math
    x = int(math.log(n, k) + 0.5)
exceptions.ValueError: ('cannot convert float NaN to integer', 30L, 2, 32)
]'>

and in the node's twistd.log I found

2011-06-23 08:19:10+0200 [Negotiation,1,46.10.48.88] XXX n: 30 :: <type 'long'>, k: 2 :: <type 'int'>, next_power_of_k_alt: 32

In the python REPL on the same computer I get

$ python
Python 2.6.6 (r266:84292, Dec 27 2010, 00:02:40) 
[GCC 4.4.5] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import math
>>> math.log(30L, 2)
4.9068905956085187
>>> int(math.log(30L, 2) + 0.5)
5
>>> 2**5
32

This, together with the Success: files copied, makes me quite confused. Feels like the NaN error is a decoy put under our noses while the real problem slips quietly into the night.

I also tried creating a completely pristine 1.8.2 build on my helper node and now it also fails in the same way as the cp did, i.e., it claims success and the CHK-caps reported by a verbose backup are OK when running check but fails when running check --verify. For example, backup says

/home/rycee/photos.new/2011/05/22/IMGP4679.JPG -> URI:CHK:k3fpasihz7g7ogsrmbywdfgdy4:jlcyer5z43nuuvdm72qyqlw2eq5uyjubpxey25gfdizidmcdlnrq:1:3:3877153

and checking gives

$ tahoe check URI:CHK:k3fpasihz7g7ogsrmbywdfgdy4:jlcyer5z43nuuvdm72qyqlw2eq5uyjubpxey25gfdizidmcdlnrq:1:3:3877153
Summary: Healthy
 storage index: 7sxgsu3edgy43at77tuskouuay
 good-shares: 3 (encoding is 1-of-3)
 wrong-shares: 0
$ tahoe check --verify URI:CHK:k3fpasihz7g7ogsrmbywdfgdy4:jlcyer5z43nuuvdm72qyqlw2eq5uyjubpxey25gfdizidmcdlnrq:1:3:3877153
Summary: Not Healthy: 0 shares (enc 1-of-3)
 storage index: 7sxgsu3edgy43at77tuskouuay
 good-shares: 0 (encoding is 1-of-3)
 wrong-shares: 0
 corrupt shares:
  server bzyf23mghgxycnr34pdkqdmybnevf4ks, SI 7sxgsu3edgy43at77tuskouuay, shnum 2
  server 44g5kkgwulzrrrntdzci7jtt5rgt6nuo, SI 7sxgsu3edgy43at77tuskouuay, shnum 0
  server 5yea4my3w3frgp524lgthrb7rdd6frtr, SI 7sxgsu3edgy43at77tuskouuay, shnum 1

I will attach the result of a diff -ur between the version giving the exception (including zooko's patches from this bug) and the pristine version. Note that the version giving exception contains edits of mine that changes minimum_cycle_time in crawler.py and expirer.py but I have had those changes without trouble since a long time (many versions ago).

Oh, finally. Since next_power_of_k_alt returns a sensible result, I tried making next_power_of_k return that value in my build that produces exceptions. Running a backup then proceeds in the same way as the pristine version, i.e., the client reports success but check --verify fails.

Yes, I have applied the patches you've given on the helper node, not the client node. With the new patch I did indeed get some output but being a Python novice I feel more confused, not less. The stacktrace says: ``` File "/home/rycee/allmydata-tahoe-1.8.2-bug1418/support/lib/python2.6/site-packages/Twisted-10.2.0-py2.6-linux-i686.egg/twisted/internet/defer.py", line 542, in _runCallbacks current.result = callback(current.result, *args, **kw) File "/home/rycee/allmydata-tahoe-1.8.2/src/allmydata/immutable/upload.py", line 926, in locate_all_shareholders num_segments, n, k, desired) File "/home/rycee/allmydata-tahoe-1.8.2/src/allmydata/immutable/upload.py", line 225, in get_shareholders None) File "/home/rycee/allmydata-tahoe-1.8.2/src/allmydata/immutable/layout.py", line 88, in make_write_bucket_proxy num_share_hashes, uri_extension_size_max, nodeid) File "/home/rycee/allmydata-tahoe-1.8.2/src/allmydata/immutable/layout.py", line 108, in __init__ effective_segments = mathutil.next_power_of_k(num_segments,2) File "/home/rycee/allmydata-tahoe-1.8.2/src/allmydata/util/mathutil.py", line 49, in next_power_of_k return next_power_of_k_math(n, k) File "/home/rycee/allmydata-tahoe-1.8.2/src/allmydata/util/mathutil.py", line 35, in next_power_of_k_math x = int(math.log(n, k) + 0.5) exceptions.ValueError: ('cannot convert float NaN to integer', 30L, 2, 32) ]'> ``` and in the node's twistd.log I found ``` 2011-06-23 08:19:10+0200 [Negotiation,1,46.10.48.88] XXX n: 30 :: <type 'long'>, k: 2 :: <type 'int'>, next_power_of_k_alt: 32 ``` In the python REPL on the **same** computer I get ``` $ python Python 2.6.6 (r266:84292, Dec 27 2010, 00:02:40) [GCC 4.4.5] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import math >>> math.log(30L, 2) 4.9068905956085187 >>> int(math.log(30L, 2) + 0.5) 5 >>> 2**5 32 ``` This, together with the `Success: files copied`, makes me quite confused. Feels like the NaN error is a decoy put under our noses while the real problem slips quietly into the night. I also tried creating a completely pristine 1.8.2 build on my helper node and now it also fails in the same way as the cp did, i.e., it claims success and the CHK-caps reported by a verbose backup are OK when running `check` but fails when running `check --verify`. For example, backup says ``` /home/rycee/photos.new/2011/05/22/IMGP4679.JPG -> URI:CHK:k3fpasihz7g7ogsrmbywdfgdy4:jlcyer5z43nuuvdm72qyqlw2eq5uyjubpxey25gfdizidmcdlnrq:1:3:3877153 ``` and checking gives ``` $ tahoe check URI:CHK:k3fpasihz7g7ogsrmbywdfgdy4:jlcyer5z43nuuvdm72qyqlw2eq5uyjubpxey25gfdizidmcdlnrq:1:3:3877153 Summary: Healthy storage index: 7sxgsu3edgy43at77tuskouuay good-shares: 3 (encoding is 1-of-3) wrong-shares: 0 $ tahoe check --verify URI:CHK:k3fpasihz7g7ogsrmbywdfgdy4:jlcyer5z43nuuvdm72qyqlw2eq5uyjubpxey25gfdizidmcdlnrq:1:3:3877153 Summary: Not Healthy: 0 shares (enc 1-of-3) storage index: 7sxgsu3edgy43at77tuskouuay good-shares: 0 (encoding is 1-of-3) wrong-shares: 0 corrupt shares: server bzyf23mghgxycnr34pdkqdmybnevf4ks, SI 7sxgsu3edgy43at77tuskouuay, shnum 2 server 44g5kkgwulzrrrntdzci7jtt5rgt6nuo, SI 7sxgsu3edgy43at77tuskouuay, shnum 0 server 5yea4my3w3frgp524lgthrb7rdd6frtr, SI 7sxgsu3edgy43at77tuskouuay, shnum 1 ``` I will attach the result of a `diff -ur` between the version giving the exception (including zooko's patches from this bug) and the pristine version. Note that the version giving exception contains edits of mine that changes minimum_cycle_time in `crawler.py` and `expirer.py` but I have had those changes without trouble since a long time (many versions ago). Oh, finally. Since `next_power_of_k_alt` returns a sensible result, I tried making `next_power_of_k` return that value in my build that produces exceptions. Running a backup then proceeds in the same way as the pristine version, i.e., the client reports success but `check --verify` fails.

rycee commented

2011-06-23 08:58:49 +00:00

Author

Owner

Attachment 1.8.2.diff (17244 bytes) added

diff -ur between build throwing NaN exception and pristine build

**Attachment** 1.8.2.diff (17244 bytes) added diff -ur between build throwing NaN exception and pristine build

1.8.2.diff

17 KiB

zooko commented

2011-06-23 13:11:00 +00:00

Yes, this is getting weird. It looks like next_power_of_k_math raised this exception when its inputs were n=30, k=2, but when you tried it yourself in the REPL, the same calls to math.log() on the same computer worked. It could be a failure in your CPU's floating point support, but I would have expected that to be sporadic or permanent, rather than to fail every time under tahoe and work every time under the REPL! I wonder if executing tahoe is somehow changing the floating point mode of your CPU...

Maybe there's a bug in next_power_of_k_math(). Could you please try something like:

HACK zompu:~/playground/tahoe-lafs/trunk$ PYTHONPATH=src python
Python 2.7.1+ (r271:86832, Apr 11 2011, 18:13:53) 
[GCC 4.5.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from allmydata.util import mathutil
>>> mathutil.next_power_of_k(30, 2)
32
>>> mathutil.next_power_of_k_math(30, 2)
32
>>> mathutil.next_power_of_k_alt(30, 2)
32

A next-step on this for me is to try to understand why upload is succeeding but then verify failing. I guess this error is happening during write. If the error is causing it to get an incorrect value for effective_segments:

  File "/home/rycee/allmydata-tahoe-1.8.2/src/allmydata/immutable/layout.py", line 108, in __init__
    effective_segments = mathutil.next_power_of_k(num_segments,2)

Then it might write the data out incorrectly. If the file you are uploading was previously uploaded then the deduplication will prevent your gateway from uploading a new copy. This would explain why changing it to use next_power_of_k_alt() and then uploading and verifying previously written files got the same failure-to-verify. Oh wait, did you upload a new random file instead of a previously uploaded file, when you did the experiment that showed next_power_of_k_alt had the same problem? Gah! If you already tried that, then perhaps there are two bugs here -- the NaN exception and a different bug that is corrupting files on write.

Anyway it wouldn't make sense for this NaN exception to result in an incorrect value of effective_segments, when we can clearly see that what results is an uncaught exception!

Here are a couple of ideas you could try:

Run next_power_of_k() (the version that uses _math and then uses _alt if _math raised an exception) in a tight loop, possibly with multiple processes doing it, and leave those running and reporting if they got any exceptions.
Edit the code to use next_power_of_k_alt exclusively and mv your entire storage directory aside, or create an entirely separate storage server and introducer for testing, and upload and then verify a random file. (If you haven't already done this.)

Yes, this is getting weird. It *looks* like `next_power_of_k_math` raised this exception when its inputs were `n=30, k=2`, but when you tried it yourself in the REPL, the same calls to `math.log()` on the **same** computer worked. It could be a failure in your CPU's floating point support, but I would have expected that to be sporadic or permanent, rather than to fail every time under `tahoe` and work every time under the REPL! I wonder if executing `tahoe` is somehow changing the floating point mode of your CPU... Maybe there's a bug in `next_power_of_k_math()`. Could you please try something like: ``` HACK zompu:~/playground/tahoe-lafs/trunk$ PYTHONPATH=src python Python 2.7.1+ (r271:86832, Apr 11 2011, 18:13:53) [GCC 4.5.2] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from allmydata.util import mathutil >>> mathutil.next_power_of_k(30, 2) 32 >>> mathutil.next_power_of_k_math(30, 2) 32 >>> mathutil.next_power_of_k_alt(30, 2) 32 ``` A next-step on this for me is to try to understand why upload is succeeding but then verify failing. I guess this error is happening during *write*. If the error is causing it to get an incorrect value for `effective_segments`: ``` File "/home/rycee/allmydata-tahoe-1.8.2/src/allmydata/immutable/layout.py", line 108, in __init__ effective_segments = mathutil.next_power_of_k(num_segments,2) ``` Then it might write the data out incorrectly. If the file you are uploading was previously uploaded then the deduplication will prevent your gateway from uploading a new copy. This would explain why changing it to use `next_power_of_k_alt()` and then uploading and verifying previously written files got the same failure-to-verify. Oh wait, did you upload a new random file instead of a previously uploaded file, when you did the experiment that showed `next_power_of_k_alt` had the same problem? Gah! If you already tried that, then perhaps there are *two* bugs here -- the `NaN` exception and a different bug that is corrupting files on write. Anyway it wouldn't make sense for this `NaN` exception to result in an incorrect value of `effective_segments`, when we can clearly see that what results is an uncaught exception! Here are a couple of ideas you could try: 1. Run `next_power_of_k()` (the version that uses `_math` and then uses `_alt` if `_math` raised an exception) in a tight loop, possibly with multiple processes doing it, and leave those running and reporting if they got any exceptions. 2. Edit the code to use `next_power_of_k_alt` exclusively and `mv` your entire `storage` directory aside, or create an entirely separate storage server and introducer for testing, and upload and then verify a random file. (If you haven't already done this.)

zooko commented

2011-06-23 13:22:47 +00:00

Two more ideas:
3. Run trial --until-failure allmydata.test and just leave it running.
4. memtest86 :-)

Two more ideas: 3. Run `trial --until-failure allmydata.test` and just leave it running. 4. memtest86 :-)

rycee commented

2011-06-23 19:36:08 +00:00

Author

Owner

I don't have physical access to the machine but I ran memtester for quite some time and it didn't show anything. Neither did some stress testing of math.log. But running the trial thingy was good, I noticed that allmydata.test.test_cli.Backup.test_backup failed with an exception on the x = int(math.log(n, k) + 0.5) line. And I found something that seems rather curious to me, putting print math.log(10) first in next_power_of_k and running test_backup gives

...
  File "/home/rycee/allmydata-tahoe-1.8.2/src/allmydata/util/mathutil.py", line 32, in next_power_of_k
    print math.log(10)
exceptions.ValueError: math domain error

Yeah, that's right. Running math.log(10) gives an exception and since it's only run once I can't imagine it's being from the CPU being overloaded or anything. Further:

$ python
Python 2.6.6 (r266:84292, Dec 27 2010, 00:02:40) 
[GCC 4.4.5] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import math
>>> math.log(10)
2.3025850929940459

If I change math.log(10) to math.log10(10) it still gives the error. But – and hang on to your hat – if I use math.sqrt(10) it works, not only does the sqrt work but the math.log(n, k) a few lines down does as well! With next_power_of_k defined

def next_power_of_k(n, k):
    hmmmm = math.sqrt(10)
    if n == 0:
        x = 0
    else:
        x = int(math.log(n, k) + 0.5)
    if k**x < n:
        return k**(x+1)
    else:
        return k**x

the unit test passes!

Now, while this elegant fix removes the exception during backup the uploaded files unfortunately still fail check --verify).

Anyway, obviously something strange happens with some floating point operations and like you said, some floating point mode is probably being altered when running tahoe but not otherwise.

I also tried adding a print of effective_segments right after its assignment. It turns out that this variable is being set to 1 when I try doing a cp of a 1kB file and 128 with a 10MB file. I don't know if these values are OK but regardless, it appears that next_power_of_k is OK to call sometimes but not otherwise.

I don't have physical access to the machine but I ran memtester for quite some time and it didn't show anything. Neither did some stress testing of `math.log`. But running the trial thingy was good, I noticed that `allmydata.test.test_cli.Backup.test_backup` failed with an exception on the `x = int(math.log(n, k) + 0.5)` line. And I found something that seems rather curious to me, putting `print math.log(10)` first in `next_power_of_k` and running `test_backup` gives ``` ... File "/home/rycee/allmydata-tahoe-1.8.2/src/allmydata/util/mathutil.py", line 32, in next_power_of_k print math.log(10) exceptions.ValueError: math domain error ``` Yeah, that's right. Running `math.log(10)` gives an exception and since it's only run once I can't imagine it's being from the CPU being overloaded or anything. Further: ``` $ python Python 2.6.6 (r266:84292, Dec 27 2010, 00:02:40) [GCC 4.4.5] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import math >>> math.log(10) 2.3025850929940459 ``` If I change `math.log(10)` to `math.log10(10)` it still gives the error. But – and hang on to your hat – if I use `math.sqrt(10)` it works, not only does the sqrt work but the `math.log(n, k)` a few lines down does as well! With `next_power_of_k` defined ``` def next_power_of_k(n, k): hmmmm = math.sqrt(10) if n == 0: x = 0 else: x = int(math.log(n, k) + 0.5) if k**x < n: return k**(x+1) else: return k**x ``` the unit test passes! Now, while this elegant fix removes the exception during backup the uploaded files unfortunately still fail `check --verify`). Anyway, obviously something strange happens with some floating point operations and like you said, some floating point mode is probably being altered when running tahoe but not otherwise. I also tried adding a print of `effective_segments` right after its assignment. It turns out that this variable is being set to 1 when I try doing a `cp` of a 1kB file and 128 with a 10MB file. I don't know if these values are OK but regardless, it appears that `next_power_of_k` is OK to call sometimes but not otherwise.

zooko commented

2011-06-23 19:42:49 +00:00

Could you please run this single unit test that fails -- allmydata.test.test_cli.Backup.test_backup -- with the math.log(10) in place, under valgrind?

http://valgrind.org

Valgrind is very good. You'll need to use the Python suppressions file with it:

http://svn.python.org/projects/python/trunk/Misc/valgrind-python.supp

Another idea suggested by Riastradh on IRC is to turn on floating point exceptions: http://docs.python.org/library/fpectl.html

Could you please run this single unit test that fails -- `allmydata.test.test_cli.Backup.test_backup` -- with the `math.log(10)` in place, under valgrind? <http://valgrind.org> Valgrind is very good. You'll need to use the Python suppressions file with it: <http://svn.python.org/projects/python/trunk/Misc/valgrind-python.supp> Another idea suggested by Riastradh on IRC is to turn on floating point exceptions: <http://docs.python.org/library/fpectl.html>

zooko commented

2011-06-23 19:46:12 +00:00

Could you please attach the output of:

$ python misc/build_helpers/show-tool-versions.py

and

$ ./bin/tahoe --version

Could you please attach the output of: ``` $ python misc/build_helpers/show-tool-versions.py ``` and ``` $ ./bin/tahoe --version ```

rycee commented

2011-06-23 19:59:57 +00:00

Author

Owner

I'll attach the show-tool-versions output as a file (it's a bit messy).

$ ./bin/tahoe --version
allmydata-tahoe: 1.8.2,
foolscap: 0.6.1,
pycryptopp: 0.5.17,
zfec: 1.4.10,
Twisted: 10.2.0,
Nevow: 0.10.0,
zope.interface: unknown,
python: 2.6.6,
platform: Linux-debian_6.0.1-i686-32bit_ELF,
pyOpenSSL: 0.10,
simplejson: 2.1.6,
pycrypto: 2.1.0,
pyasn1: unknown,
mock: 0.6.0,
sqlite3: 2.4.1 [sqlite 3.7.3],
setuptools: 0.6c16dev3

I'll attach the show-tool-versions output as a file (it's a bit messy). ``` $ ./bin/tahoe --version allmydata-tahoe: 1.8.2, foolscap: 0.6.1, pycryptopp: 0.5.17, zfec: 1.4.10, Twisted: 10.2.0, Nevow: 0.10.0, zope.interface: unknown, python: 2.6.6, platform: Linux-debian_6.0.1-i686-32bit_ELF, pyOpenSSL: 0.10, simplejson: 2.1.6, pycrypto: 2.1.0, pyasn1: unknown, mock: 0.6.0, sqlite3: 2.4.1 [sqlite 3.7.3], setuptools: 0.6c16dev3 ```

rycee commented

2011-06-23 20:00:36 +00:00

Author

Owner

Attachment show-tool-versions-output.txt (11059 bytes) added

Output from show-tools-versions

**Attachment** show-tool-versions-output.txt (11059 bytes) added Output from show-tools-versions

show-tool-versions-output.txt

11 KiB

rycee commented

2011-06-23 20:01:05 +00:00

Author

Owner

Attachment valgrind-output.txt (84317 bytes) added

Output from valgrind.

**Attachment** valgrind-output.txt (84317 bytes) added Output from valgrind.

valgrind-output.txt

82 KiB

rycee commented

2011-06-23 20:12:03 +00:00

Author

Owner

About the fpectl library. I tried adding fpectl.turnon_sigfpe() to the beginning of the test_backup method, this gave me

$ bin/tahoe debug trial allmydata.test.test_cli.Backup.test_backup
allmydata.test.test_cli
  Backup
    test_backup ... Fatal Python error: Unprotected floating point exception

which seems … interesting.

I don't know how to associate this to a source location though so I dunno if it corresponds to the math.log(10) or some other piece of code.

About the `fpectl` library. I tried adding `fpectl.turnon_sigfpe()` to the beginning of the `test_backup` method, this gave me ``` $ bin/tahoe debug trial allmydata.test.test_cli.Backup.test_backup allmydata.test.test_cli Backup test_backup ... Fatal Python error: Unprotected floating point exception ``` which seems … interesting. I don't know how to associate this to a source location though so I dunno if it corresponds to the `math.log(10)` or some other piece of code.

zooko commented

2011-06-23 20:24:21 +00:00

All of those "Conditional jump or move depends on uninitialised value(s)" and "Use of uninitialised value of size 4" emitting from PyObject_Free or PyObject_Realloc are false positives. You can generate new suppression entries to put into a suppressions file by executing

valgrind --gen-suppressions=yes bin/tahoe debug trial allmydata.test.test_cli.Backup.test_backup

You might also want to report a bug to the distributor of your valgrind-python.supp file. If you do, please let me know the URL to the ticket.

Sigh. The Valgrind Python suppressions file that you used is apparently not working. Perhaps it was written for 64-bit architecture and has a bunch of "8" in it which need to be changed to "4" to make it suppress all of those false positives in [attachment:valgrind-output.txt](/tahoe-lafs/trac/attachments/000078ac-e3ed-3963-c28f-bae4c5724209) ? All of those "Conditional jump or move depends on uninitialised value(s)" and "Use of uninitialised value of size 4" emitting from `PyObject_Free` or `PyObject_Realloc` are false positives. You can generate new suppression entries to put into a suppressions file by executing ``` valgrind --gen-suppressions=yes bin/tahoe debug trial allmydata.test.test_cli.Backup.test_backup ``` You might also want to report a bug to the distributor of your valgrind-python.supp file. If you do, please let me know the URL to the ticket.

zooko commented

2011-06-23 20:34:00 +00:00

Hm, judging from http://docs.python.org/library/fpectl.html , any Python code (in the CPython interpreter or in an extension module) which does floating point math is supposed to wrap the calls to floating point functions in macros named PyFPE_START_PROTECT and PyFPE_END_PROTECT in order to trap the exception and raise a nice Python exception. Apparently this hasn't happened!

Hopefully you can get a run under valgrind without all the noise from false positives. Maybe whatever problem ends in the floating point exception starts with a memory mis-usage that valgrind can report.

Jason Riedy on identi.ca wrote: "If the 30L somehow is pushed on a stack as an integer, the resulting corruption likely would produce a NaN." -- http://identi.ca/notice/77048991

Hm, judging from <http://docs.python.org/library/fpectl.html> , any Python code (in the CPython interpreter or in an extension module) which does floating point math is supposed to wrap the calls to floating point functions in macros named `PyFPE_START_PROTECT` and `PyFPE_END_PROTECT` in order to trap the exception and raise a nice Python exception. Apparently this hasn't happened! Hopefully you can get a run under valgrind without all the noise from false positives. Maybe whatever problem ends in the floating point exception starts with a memory mis-usage that valgrind can report. Jason Riedy on identi.ca wrote: "If the 30L somehow is pushed on a stack as an integer, the resulting corruption likely would produce a NaN." -- <http://identi.ca/notice/77048991>

rycee commented

2011-06-23 20:55:41 +00:00

Author

Owner

As suggested by Riastradh in #tahoe-lafs, I ran it under strace. Using strace -f python bin/tahoe debug trial allmydata.test.test_cli.Backup.test_backup I got a 3.3MB file which I can attach if requested. The last 20 lines are

[pid  3482] gettimeofday({1308861861, 391151}, NULL) = 0
[pid  3482] gettimeofday({1308861861, 391248}, NULL) = 0
[pid  3482] gettimeofday({1308861861, 391487}, NULL) = 0
[pid  3482] gettimeofday({1308861861, 391770}, NULL) = 0
[pid  3482] gettimeofday({1308861861, 391924}, NULL) = 0
[pid  3482] gettimeofday({1308861861, 392020}, NULL) = 0
[pid  3482] --- SIGFPE (Floating point exception) @ 0 (0) ---
[pid  3482] rt_sigaction(SIGFPE, {0xb77028d0, [], 0}, {0xb77028d0, [], 0}, 8) = 0
[pid  3482] write(2, "Fatal Python error: Unprotected "..., 57) = 57
[pid  3482] rt_sigprocmask(SIG_UNBLOCK, [ABRT], NULL, 8) = 0
[pid  3482] tgkill(3482, 3482, SIGABRT) = 0
[pid  3482] --- SIGABRT (Aborted) @ 0 (0) ---
Process 3481 resumed
Process 3482 detached
[pid  3487] +++ killed by SIGABRT +++
<... waitpid resumed> [{WIFSIGNALED(s) && WTERMSIG(s) == SIGABRT}], 0) = 3482
--- SIGCHLD (Child exited) @ 0 (0) ---
rt_sigaction(SIGINT, {SIG_DFL, [], 0}, {0x8110720, [], 0}, 8) = 0
exit_group(-6)                          = ?

As suggested by Riastradh in #tahoe-lafs, I ran it under strace. Using `strace -f python bin/tahoe debug trial allmydata.test.test_cli.Backup.test_backup` I got a 3.3MB file which I can attach if requested. The last 20 lines are ``` [pid 3482] gettimeofday({1308861861, 391151}, NULL) = 0 [pid 3482] gettimeofday({1308861861, 391248}, NULL) = 0 [pid 3482] gettimeofday({1308861861, 391487}, NULL) = 0 [pid 3482] gettimeofday({1308861861, 391770}, NULL) = 0 [pid 3482] gettimeofday({1308861861, 391924}, NULL) = 0 [pid 3482] gettimeofday({1308861861, 392020}, NULL) = 0 [pid 3482] --- SIGFPE (Floating point exception) @ 0 (0) --- [pid 3482] rt_sigaction(SIGFPE, {0xb77028d0, [], 0}, {0xb77028d0, [], 0}, 8) = 0 [pid 3482] write(2, "Fatal Python error: Unprotected "..., 57) = 57 [pid 3482] rt_sigprocmask(SIG_UNBLOCK, [ABRT], NULL, 8) = 0 [pid 3482] tgkill(3482, 3482, SIGABRT) = 0 [pid 3482] --- SIGABRT (Aborted) @ 0 (0) --- Process 3481 resumed Process 3482 detached [pid 3487] +++ killed by SIGABRT +++ <... waitpid resumed> [{WIFSIGNALED(s) && WTERMSIG(s) == SIGABRT}], 0) = 3482 --- SIGCHLD (Child exited) @ 0 (0) --- rt_sigaction(SIGINT, {SIG_DFL, [], 0}, {0x8110720, [], 0}, 8) = 0 exit_group(-6) = ? ```

rycee commented

2011-06-23 21:29:16 +00:00

Author

Owner

After some help I also managed to catch the SIGFPE in gdb. Since I don't have debug symbols in my python (simply the version found in Debian Squeeze), I don't know how useful it is. Anyway, I pasted the stack trace to http://hpaste.org/48228

rycee commented

2011-06-24 06:48:28 +00:00

Author

Owner

Riastradh suggested that the cause could be a cosmic ray flipping a bit in memory, especially after seeing strange output from info float in gdb. I tried flushing the file cache, rebooting, reinstalling python 2.6, uninstalling python 2.5 (in case it was somehow interfering), rebuilding tahoe and building tahoe 1.8.1. Still get an exception from x = int(math.log(n, k) + 0.5). Note, I also get an exception from my newly built 1.8.1. I have previously used 1.8.1 reliably as a helper so I suppose the problem might be caused due to some newer version of a library pulled in during python setup.py build?

Riastradh suggested that the cause could be a cosmic ray flipping a bit in memory, especially after seeing [strange output](http://hpaste.org/48231) from `info float` in gdb. I tried flushing the file cache, rebooting, reinstalling python 2.6, uninstalling python 2.5 (in case it was somehow interfering), rebuilding tahoe and building tahoe 1.8.1. Still get an exception from `x = int(math.log(n, k) + 0.5)`. Note, I also get an exception from my newly built 1.8.1. I have previously used 1.8.1 reliably as a helper so I suppose the problem might be caused due to some newer version of a library pulled in during `python setup.py build`?

zooko commented

2011-06-24 13:14:13 +00:00

Replying to rycee:

I have previously used 1.8.1 reliably as a helper so I suppose the problem might be caused due to some newer version of a library pulled in during python setup.py build?

Yes, Tahoe-LAFS itself is written in pure Python. I suspect this problem is due to memory corruption or some other incorrect C/C++ code, so it is probably due to one of the dependencies. I really want to track down which one it is, so please try to continue maintaining a "paper trail" of what you changed at each step of this. Thanks!

To see which dependencies have native code in them, look at this table: http://tahoe-lafs.org/source/tahoe-lafs/deps/tahoe-lafs-dep-eggs/README.html Every column there is a library that uses native (C/C++) code.

If I were you, my next step would be to clear the false positives out of valgrind (as I described in comment:384968). valgrind will either identify the problem, or if it gives a clean report then we can be confident that there is no branching on uninitialized memory, writing to memory out of bounds, or other of the memory errors that valgrind checks for.

If you're not interested in doing that then say so and lets think of some other approaches to track down this bug.

Replying to [rycee](/tahoe-lafs/trac/issues/1418#issuecomment-384972): > I have previously used 1.8.1 reliably as a helper so I suppose the problem might be caused due to some newer version of a library pulled in during `python setup.py build`? Yes, Tahoe-LAFS itself is written in pure Python. I suspect this problem is due to memory corruption or some other incorrect C/C++ code, so it is probably due to one of the dependencies. I really want to track down which one it is, so please try to continue maintaining a "paper trail" of what you changed at each step of this. Thanks! To see which dependencies have native code in them, look at this table: <http://tahoe-lafs.org/source/tahoe-lafs/deps/tahoe-lafs-dep-eggs/README.html> Every column there is a library that uses native (C/C++) code. If I were you, my next step would be to clear the false positives out of valgrind (as I described in [comment:384968](/tahoe-lafs/trac/issues/1418#issuecomment-384968)). valgrind will either identify the problem, or if it gives a clean report then we can be confident that there is *no* branching on uninitialized memory, writing to memory out of bounds, or other of the memory errors that valgrind checks for. If you're not interested in doing that then say so and lets think of some other approaches to track down this bug.

zooko commented

2011-06-24 13:20:01 +00:00

Oh I have an idea. Turn on tracing -- http://www.doughellmann.com/PyMOTW/trace/ -- then you'll know what line of source code triggered the fp exception. (Caveat: the Python process might buffer the output and then die due to the fp exception before it flushes the line telling the actual Python code which came immediately before the exception. So turn on the PYTHONUNBUFFERED=1 env var or pass -u to python.

Oh I have an idea. Turn on tracing -- <http://www.doughellmann.com/PyMOTW/trace/> -- then you'll know what line of source code triggered the fp exception. (Caveat: the Python process might buffer the output and then die due to the fp exception before it flushes the line telling the actual Python code which came immediately before the exception. So turn on the `PYTHONUNBUFFERED=1` env var or pass `-u` to `python`.

rycee commented

2011-06-24 15:49:39 +00:00

Author

Owner

Turns out Debian stable has a python2.6-dbg package. I installed that and caught the SIGFPE when running the allmydata.test.test_upload.EncodingParameters.test_upper_limit_on_readonly_queries test (it also fails and runs quicker than the test_backup test). Anyway, I get

Program received signal SIGFPE, Arithmetic exception.
0x0807de7c in int_float (v=0xa65a168) at ../Objects/intobject.c:945
945	../Objects/intobject.c: No such file or directory.
	in ../Objects/intobject.c
(gdb) info float
  R7: Valid   0x400e9c42000000000000 +40002                     
  R6: Valid   0x400e9c42000000000000 +40002                     
  R5: Zero    0x00000000000000000000 +0                         
  R4: Valid   0x40029000000000000000 +9                         
  R3: Zero    0x00000000000000000000 +0                         
  R2: Zero    0x00000000000000000000 +0                         
  R1: Special 0xffff6edb1408f6ce3bbe Unsupported
=>R0: Special 0xffffe649a0c99e4d8714 QNaN

Status Word:         0xc2e1   IE             PE   ES   SF      C1    C3
                       TOP: 0
Control Word:        0x1372      DM       UM PM
                       PC: Extended Precision (64-bits)
                       RC: Round to nearest
Tag Word:            0x045a
Instruction Pointer: 0x73:0x0807de79
Operand Pointer:     0x7b:0x0a65a170
Opcode:              0xdb40
(gdb) where
#0  0x0807de7c in int_float (v=0xa65a168) at ../Objects/intobject.c:945
#1  0x08062c78 in PyNumber_Float (o=40000) at ../Objects/abstract.c:1749
#2  0x0807d277 in float_new (type=0x822cf20, args=(40000,), kwds=0x0)
    at ../Objects/floatobject.c:1657
#3  0x080accbd in type_call (type=0x822cf20, args=(40000,), kwds=0x0)
    at ../Objects/typeobject.c:726
#4  0x0806232a in PyObject_Call (func=<type at remote 0x822cf20>, arg=(40000,), 
    kw=0x0) at ../Objects/abstract.c:2492
#5  0x080e016b in do_call (f=
    Frame 0xac1249c, for file /home/rycee/allmydata-tahoe-1.8.2/src/allmydata/immutable/upload.py, line 1331, in _got (params=(3, 1, 10, 40002), k=3, happy=1, n=10, segsize=40002, f=<cStringIO.StringI at remote 0xa5c83f8>, enckey_hasher=<_SHA256d_Hasher(_digest=None, h=<_sha256.SHA256 at remote 0xb73bd700>, truncate_to=16) at remote 0x9c5f4cc>, BLOCKSIZE=65536, bytes_read=40000, data='datadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadata...(truncated), throwflag=0)
    at ../Python/ceval.c:3968
Quit
(gdb) print v
$1 = (PyIntObject *) 0xa65a168
(gdb) print *v
$2 = {ob_refcnt = 2, ob_type = 0x822da20, ob_ival = 40000}

The failing line in intobject.c is return [PyFloat](wiki/PyFloat)_FromDouble((double)(v -> ob_ival)); so presumably it is the cast to double that fails. But since ob_ival = 40000 I can't see any reasonable reason why. The failing line in tahoe is self._status.set_progress(0, float(bytes_read)/self._size) in immutable/upload.py.

If I have time, then I'll try the valgrind thingy but at the moment I'm pretty much ready to chalk it all up to hexed hardware.

Turns out Debian stable has a python2.6-dbg package. I installed that and caught the SIGFPE when running the `allmydata.test.test_upload.EncodingParameters.test_upper_limit_on_readonly_queries` test (it also fails and runs quicker than the `test_backup` test). Anyway, I get ``` Program received signal SIGFPE, Arithmetic exception. 0x0807de7c in int_float (v=0xa65a168) at ../Objects/intobject.c:945 945 ../Objects/intobject.c: No such file or directory. in ../Objects/intobject.c (gdb) info float R7: Valid 0x400e9c42000000000000 +40002 R6: Valid 0x400e9c42000000000000 +40002 R5: Zero 0x00000000000000000000 +0 R4: Valid 0x40029000000000000000 +9 R3: Zero 0x00000000000000000000 +0 R2: Zero 0x00000000000000000000 +0 R1: Special 0xffff6edb1408f6ce3bbe Unsupported =>R0: Special 0xffffe649a0c99e4d8714 QNaN Status Word: 0xc2e1 IE PE ES SF C1 C3 TOP: 0 Control Word: 0x1372 DM UM PM PC: Extended Precision (64-bits) RC: Round to nearest Tag Word: 0x045a Instruction Pointer: 0x73:0x0807de79 Operand Pointer: 0x7b:0x0a65a170 Opcode: 0xdb40 (gdb) where #0 0x0807de7c in int_float (v=0xa65a168) at ../Objects/intobject.c:945 #1 0x08062c78 in PyNumber_Float (o=40000) at ../Objects/abstract.c:1749 #2 0x0807d277 in float_new (type=0x822cf20, args=(40000,), kwds=0x0) at ../Objects/floatobject.c:1657 #3 0x080accbd in type_call (type=0x822cf20, args=(40000,), kwds=0x0) at ../Objects/typeobject.c:726 #4 0x0806232a in PyObject_Call (func=<type at remote 0x822cf20>, arg=(40000,), kw=0x0) at ../Objects/abstract.c:2492 #5 0x080e016b in do_call (f= Frame 0xac1249c, for file /home/rycee/allmydata-tahoe-1.8.2/src/allmydata/immutable/upload.py, line 1331, in _got (params=(3, 1, 10, 40002), k=3, happy=1, n=10, segsize=40002, f=<cStringIO.StringI at remote 0xa5c83f8>, enckey_hasher=<_SHA256d_Hasher(_digest=None, h=<_sha256.SHA256 at remote 0xb73bd700>, truncate_to=16) at remote 0x9c5f4cc>, BLOCKSIZE=65536, bytes_read=40000, data='datadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadata...(truncated), throwflag=0) at ../Python/ceval.c:3968 Quit (gdb) print v $1 = (PyIntObject *) 0xa65a168 (gdb) print *v $2 = {ob_refcnt = 2, ob_type = 0x822da20, ob_ival = 40000} ``` The failing line in `intobject.c` is `return [PyFloat](wiki/PyFloat)_FromDouble((double)(v -> ob_ival));` so presumably it is the cast to double that fails. But since `ob_ival = 40000` I can't see any reasonable reason why. The failing line in tahoe is `self._status.set_progress(0, float(bytes_read)/self._size)` in `immutable/upload.py`. If I have time, then I'll try the valgrind thingy but at the moment I'm pretty much ready to chalk it all up to hexed hardware.

zooko commented

2011-06-24 16:43:05 +00:00

Also the tracing approach that I suggested in comment:384974 is quite likely to give us useful information. If you're not comfortable with all the Python coding suggested on http://www.doughellmann.com/PyMOTW/trace/ then I can whip up a script that'll do it for you. :-)

It can't be hexed hardware, because it is reproducible (at least mostly?) and it doesn't happen with any other executables, nor with other Python experiments. Say, here's an idea. My best guess is that one of the native-code dependencies (hopefully not zfec or pycryptopp -- the ones I'm responsible for!) are doing something bad. You could test this by running some complicated floating point math, then importing each native code dependency (see comment:23) and then running the floating point math again. Also the tracing approach that I suggested in [comment:384974](/tahoe-lafs/trac/issues/1418#issuecomment-384974) is quite likely to give us useful information. If you're not comfortable with all the Python coding suggested on <http://www.doughellmann.com/PyMOTW/trace/> then I can whip up a script that'll do it for you. :-)

zooko commented

2011-06-25 06:46:19 +00:00

Chad Austin had a suggestion on twitter:

Did you try asserting on the x87 and SSE control words periodically? I put that assertion after every test in our test runner.

https://twitter.com/#!/chadaustin/status/84508454357696514

Chad Austin had a suggestion on twitter: Did you try asserting on the x87 and SSE control words periodically? I put that assertion after every test in our test runner. <https://twitter.com/#!/chadaustin/status/84508454357696514>

zooko commented

2011-06-25 10:41:53 +00:00

I asked Chad Austin on twitter how to detect the x87 and SSE control words. He replied:

On Windows with Python 2.6, this seems to work:

> from ctypes import *
> x87_control_word = c_uint()
> sse_control_word = c_uint()
> print 'success?', cdll.msvcr90.__control87_2(0, 0, pointer(x87_control_word), pointer(sse_control_word))
> print 'x87', hex(x87_control_word.value)
> print 'sse', hex(sse_control_word.value)

The meanings of the control registers: http://software.intel.com/en-us/articles/x87-and-sse-floating-point-assists-in-ia-32-flush-to-zero-ftz-and-denormals-are-zero-daz/

I asked Chad Austin on twitter how to detect the x87 and SSE control words. He replied: On Windows with Python 2.6, this seems to work: ``` > from ctypes import * > x87_control_word = c_uint() > sse_control_word = c_uint() > print 'success?', cdll.msvcr90.__control87_2(0, 0, pointer(x87_control_word), pointer(sse_control_word)) > print 'x87', hex(x87_control_word.value) > print 'sse', hex(sse_control_word.value) ``` The meanings of the control registers: <http://software.intel.com/en-us/articles/x87-and-sse-floating-point-assists-in-ia-32-flush-to-zero-ftz-and-denormals-are-zero-daz/>

daira commented

2011-08-16 00:31:11 +00:00

Kludgy but more cross-platform way to get the FP environment (assuming sizeof(fenv_t) <= 100):

from ctypes import cdll, create_string_buffer
from ctypes.util import find_library
libm = cdll.LoadLibrary(find_library('m'))
b = create_string_buffer(100)
assert libm.fegetenv(b) == 0
b.raw

which on my system gives

'\x7f\x03\xff\xff\x00\x00\xff\xff\xff\xff\xff\xff\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xff\xff\xa0\x1f\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'

To decode the structure, see http://linux.die.net/include/bits/fenv.h. Note that we probably only need to detect when it changes during a unit test. For that, I would try hacking the run method of TestCase in twisted.trial.unittest so that it calls libm.fegetenv(b) before and after running the test code (the result.startTest(self) line) and prints the values of b.raw if they are different.

Kludgy but more cross-platform way to get the FP environment (assuming `sizeof(fenv_t) <= 100`): ``` from ctypes import cdll, create_string_buffer from ctypes.util import find_library libm = cdll.LoadLibrary(find_library('m')) b = create_string_buffer(100) assert libm.fegetenv(b) == 0 b.raw ``` which on my system gives ``` '\x7f\x03\xff\xff\x00\x00\xff\xff\xff\xff\xff\xff\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xff\xff\xa0\x1f\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00' ``` To decode the structure, see <http://linux.die.net/include/bits/fenv.h>. Note that we probably only need to detect when it changes during a unit test. For that, I would try hacking the `run` method of `TestCase` in `twisted.trial.unittest` so that it calls `libm.fegetenv(b)` before and after running the test code (the `result.startTest(self)` line) and prints the values of `b.raw` if they are different.

zooko commented

2012-02-28 16:37:35 +00:00

There was a bug in the Linux kernel that would cause floating point state to be wrong "on 32-bit linux on modern CPUs":

http://article.gmane.org/gmane.linux.kernel/1258780

comment:384966 says that rycee's machine was in 32-bit mode. rycee's [/attachment/ticket/1418/show-tool-versions-output.txt show-tool-versions output] says that the kernel was 2.6.32-5 (from Debian). I wasn't able to tell from the kernel patch comments when this bug was introduced into the kernel.

Anyway, since rycee appears to have lost interest in debugging this, I suggest we close this ticket as unreproducible until rycee or someone else who can reproduce it reappears.

There was a bug in the Linux kernel that would cause floating point state to be wrong "on 32-bit linux on modern CPUs": <http://article.gmane.org/gmane.linux.kernel/1258780> [comment:384966](/tahoe-lafs/trac/issues/1418#issuecomment-384966) says that rycee's machine was in 32-bit mode. rycee's [/attachment/ticket/1418/show-tool-versions-output.txt show-tool-versions output] says that the kernel was 2.6.32-5 (from Debian). I wasn't able to tell from the kernel patch comments when this bug was introduced into the kernel. Anyway, since rycee appears to have lost interest in debugging this, I suggest we close this ticket as unreproducible until rycee or someone else who can reproduce it reappears.

rycee commented

2012-02-28 17:23:20 +00:00

Author

Owner

Yeah, sorry about not being active on this bug. I've been happy with using my tahoe grid without helper so I haven't felt very motivated to dig around.

zooko: The Linux bug you linked seem unrelated since the CPU on this computer is an old Pentium III-M, 1.2GHz and the bug was for newer, presumably 64-bit, CPUs.

But I have interesting news, just tried downloading and building Tahoe 1.9.1. When I ran allmydata.test.test_cli.Backup.test_backup it was successful! I thought it might be related to some glibc update or something so I also downloaded both 1.8.2 and 1.8.3, built them and ran the same test. They both failed in the same way as described in my previous comments. So, I assume that some change occurred between 1.8.3 and 1.9.0 that must have fixed my problem.

Just to double check that 1.9.1 actually fixes my problem I enabled the helper functionality and tried to upload through it, indeed it was successful.

Yeah, sorry about not being active on this bug. I've been happy with using my tahoe grid without helper so I haven't felt very motivated to dig around. zooko: The Linux bug you linked seem unrelated since the CPU on this computer is an old Pentium III-M, 1.2GHz and the bug was for newer, presumably 64-bit, CPUs. But I have interesting news, just tried downloading and building Tahoe 1.9.1. When I ran `allmydata.test.test_cli.Backup.test_backup` it was successful! I thought it might be related to some glibc update or something so I also downloaded both 1.8.2 and 1.8.3, built them and ran the same test. They both failed in the same way as described in my previous comments. So, I assume that some change occurred between 1.8.3 and 1.9.0 that must have fixed my problem. Just to double check that 1.9.1 actually fixes my problem I enabled the helper functionality and tried to upload through it, indeed it was successful.

zooko commented

2012-02-28 17:30:39 +00:00

rycee: oh boy! You've managed to reproducibly isolate a way to make the bug happen or not happen! Now here's a question: can you run git bisect and get it to automatically run the unit tests in between each version change so that it finds exactly what patch makes the difference? Or do the equivalent manually -- try different versions and narrow down exactly what causes the problem?Image()

Honestly, I'm still skeptical that anything we do in our 100% Python code could cause this bug, but perhaps something about the sequence of instructions that we're sending through the CPU accidentally avoids or accidentally triggers it...

Also I am still willing to help you use some diagnostic tools like valgrind or the fp env from comment:384979...

rycee: oh boy! You've managed to reproducibly isolate a way to make the bug happen or not happen! Now here's a question: can you run `git bisect` and get it to automatically run the unit tests in between each version change so that it finds exactly what patch makes the difference? Or do the equivalent manually -- try different versions and narrow down exactly what causes the problem?Image() Honestly, I'm still skeptical that anything we do in our 100% Python code could cause this bug, but perhaps something about the sequence of instructions that we're sending through the CPU accidentally avoids or accidentally triggers it... Also I am still willing to help you use some diagnostic tools like valgrind or the fp env from [comment:384979](/tahoe-lafs/trac/issues/1418#issuecomment-384979)...

zooko commented

2012-02-28 17:31:52 +00:00

Also you might want to try applying that one patchset from linux to change the way it manages floating point state during context switches and see if that makes the problem go away. I don't know for sure that the linux bug doesn't apply on an old Pentium III -- it was just something from, IIRC, gregkh's announcement that mentioned "newer CPUs".

rycee commented

2012-02-28 18:49:58 +00:00

Author

Owner

Ok, gave bisect a try and it told me that the issue was fixed in commit 095efbf7a5f8ee375a81136f87ba3943ec995fc6. This change seems to make Tahoe depend on pycryptopp version 0.5.20 or later on my machine, before this change it was satisfied with version 0.5.14 or later.

On my Debian Squeeze installation, the python-pycryptopp package is of version 0.5.17-1+b1 satisfying the dependency before davidsarah's change but not after. Finding this out, I of course tried uninstalling the python-pycryptopp Debian package and rebuild a pristine Tahoe 1.8.2 source tree. The failing test now happily passes!

About the kernel patch. I'm reluctant to try this since I have no physical access to the machine and rely on it being up stably.

Ok, gave bisect a try and it told me that the issue was fixed in commit [095efbf7a5f8ee375a81136f87ba3943ec995fc6](https://github.com/warner/tahoe-lafs/commit/095efbf7a5f8ee375a81136f87ba3943ec995fc6). This change seems to make Tahoe depend on pycryptopp version 0.5.20 or later on my machine, before this change it was satisfied with version 0.5.14 or later. On my Debian Squeeze installation, the python-pycryptopp package is of version 0.5.17-1+b1 satisfying the dependency before davidsarah's change but not after. Finding this out, I of course tried uninstalling the python-pycryptopp Debian package and rebuild a pristine Tahoe 1.8.2 source tree. The failing test now happily passes! About the kernel patch. I'm reluctant to try this since I have no physical access to the machine and rely on it being up stably.

daira commented

2012-02-28 18:59:33 +00:00

Aha. It seems quite likely that one of the bugs fixed between pycryptopp ~~0.5.14~~ 0.5.17-1+b1 and 0.5.20 could have been responsible. I doubt it has anything to do with the kernel bug.

Aha. It seems quite likely that one of the bugs fixed between pycryptopp ~~0.5.14~~ 0.5.17-1+b1 and 0.5.20 could have been responsible. I doubt it has anything to do with the kernel bug.

rycee commented

2012-02-28 19:06:56 +00:00

Author

Owner

Anyway, thank you David-Sarah for making the commit 095efbf7a5f8ee375a81136f87ba3943ec995fc6 that got us closer to the underlying cause :-)

I guess we can raise the lower bound to pycryptopp version 0.5.17-1+b1 since that's what my Debian box was using when the test was failing. I don't know whether it is worth trying to determine the problem in pycryptopp… Anyway, thank you David-Sarah for making the commit 095efbf7a5f8ee375a81136f87ba3943ec995fc6 that got us closer to the underlying cause :-)

daira commented

2012-02-28 22:15:58 +00:00

Note that it's possible that the Debian cryptopp was broken or miscompiled or included a bad Debian-specific patch, rather than this being a fix between upstream pycryptopp 0.5.17 and 0.5.20.

(Can you tell I don't really trust the Debian packaging policies or practices? We'd have a lot fewer problems if they would just follow upstream rather than picking out individual patches without really understanding them.)

Note that it's possible that the Debian cryptopp was broken or miscompiled or included a bad Debian-specific patch, rather than this being a fix between upstream pycryptopp 0.5.17 and 0.5.20. (Can you tell I don't really trust the Debian packaging policies or practices? We'd have a lot fewer problems if they would just follow upstream rather than picking out individual patches without really understanding them.)

daira commented

2012-02-28 22:18:10 +00:00

Oh, also note that the Debian pycryptopp links against a non-embedded Crypto++, so we actually have no idea what version of Crypto++ that is (unless rycee can tell what it was).

daira commented

2012-02-28 23:30:03 +00:00

I considered closing this ticket as fixed, since it seems relatively certain from the bisection that this is a fixed pycryptopp or Crypto++ issue. However, since we don't know whether Debian added a bad patch to their pycryptopp or Crypto++, I'll leave it open until we can determine that.

zooko commented

2012-02-28 23:37:00 +00:00

Yes, it is also possible that there was a bug in upstream pycryptopp itself. rycee: can you experiment with swapping in the old Debian version of pycryptopp versus the old upstream version of pycryptopp, maybe?

daira commented

2012-02-29 01:31:20 +00:00

Replying to zooko:

Yes, it is also possible that there was a bug in upstream pycryptopp itself.

One reason I thought that less likely is that I reviewed all the patches between 0.5.17 and 0.5.20 and didn't see any fixes that could have caused this bug.

rycee: can you experiment with swapping in the old Debian version of pycryptopp versus the old upstream version of pycryptopp, maybe?

Yes, that's a good test.

Replying to [zooko](/tahoe-lafs/trac/issues/1418#issuecomment-384990): > Yes, it is also possible that there was a bug in upstream pycryptopp itself. One reason I thought that less likely is that I reviewed [all the patches between 0.5.17 and 0.5.20](https://tahoe-lafs.org/trac/pycryptopp/log/trunk/?action=stop_on_copy&mode=stop_on_copy&rev=726&stop_rev=674&limit=100&verbose=on) and didn't see any fixes that could have caused this bug. > rycee: can you experiment with swapping in the old Debian version of pycryptopp versus the old upstream version of pycryptopp, maybe? Yes, that's a good test.

rycee commented

2012-02-29 07:19:39 +00:00

Author

Owner

I added 0.5.17 as an upper bound in _auto_deps.py and running setup.py build now prints

…
Searching for pycryptopp>=0.5.14,<=0.5.17
Reading http://tahoe-lafs.org/source/tahoe-lafs/deps/tahoe-dep-sdists/
Reading http://tahoe-lafs.org/source/tahoe-lafs/deps/tahoe-lafs-dep-eggs/
Reading http://pypi.python.org/simple/pycryptopp/
Reading http://allmydata.org/source/pycryptopp
Reading http://allmydata.org/trac/pycryptopp
Reading http://tahoe-lafs.org/trac/pycryptopp
Reading http://allmydata.org/source/zfec
Best match: pycryptopp 0.5.17
Downloading http://pypi.python.org/packages/source/p/pycryptopp/pycryptopp-0.5.17.tar.gz#md5=b3d19e7203531f8bd241ae58062f99e4
Processing pycryptopp-0.5.17.tar.gz
Running pycryptopp-0.5.17/setup.py -q bdist_egg --dist-dir /tmp/easy_install-SlyAet/pycryptopp-0.5.17/egg-dist-tmp-2wdZ7B
…

and it goes on to spend quite some time building the crypto++ library. After finishing, this build successfully runs the the previously failing test even though it's Tahoe version 1.8.2. If I then replace the _pycryptopp.so file from inside this Tahoe tree with the corresponding file from the Debian package the error comes back. The Debian version of pycryptopp is dynamically linked against crypto++ version 5.6.0-6.

So, finally, leaving Debian's _pycryptopp.so in place, I tried downloading and building a dynamic library from upstream crypto++ 5.6.0 and 5.6.1 and substituting Debian's copy. Curiously the test failed with 5.6.0 and worked with 5.6.1…

I added 0.5.17 as an upper bound in _auto_deps.py and running setup.py build now prints ``` … Searching for pycryptopp>=0.5.14,<=0.5.17 Reading http://tahoe-lafs.org/source/tahoe-lafs/deps/tahoe-dep-sdists/ Reading http://tahoe-lafs.org/source/tahoe-lafs/deps/tahoe-lafs-dep-eggs/ Reading http://pypi.python.org/simple/pycryptopp/ Reading http://allmydata.org/source/pycryptopp Reading http://allmydata.org/trac/pycryptopp Reading http://tahoe-lafs.org/trac/pycryptopp Reading http://allmydata.org/source/zfec Best match: pycryptopp 0.5.17 Downloading http://pypi.python.org/packages/source/p/pycryptopp/pycryptopp-0.5.17.tar.gz#md5=b3d19e7203531f8bd241ae58062f99e4 Processing pycryptopp-0.5.17.tar.gz Running pycryptopp-0.5.17/setup.py -q bdist_egg --dist-dir /tmp/easy_install-SlyAet/pycryptopp-0.5.17/egg-dist-tmp-2wdZ7B … ``` and it goes on to spend quite some time building the crypto++ library. After finishing, this build successfully runs the the previously failing test even though it's Tahoe version 1.8.2. If I then replace the `_pycryptopp.so` file from inside this Tahoe tree with the corresponding file from the Debian package the error comes back. The Debian version of pycryptopp is dynamically linked against [crypto++ version 5.6.0-6](http://packages.debian.org/squeeze/libcrypto++8). So, finally, leaving Debian's `_pycryptopp.so` in place, I tried downloading and building a dynamic library from upstream crypto++ 5.6.0 and 5.6.1 and substituting Debian's copy. Curiously the test failed with 5.6.0 and worked with 5.6.1…

zooko commented

2012-03-01 04:09:25 +00:00

Okay, this is very good bug isolation. I didn't quite follow the last bit though -- the bug is in upstream Crypto++ 5.6.0 and not upstream Crypto++ 5.6.1? Or it is in Debian Crypto++ 5.6.0 and not Debian Crypto++ 5.6.1?

By the way, we have a "quick startup self-test" built into pycryptopp, which is an attempt to catch the most egregious of these sorts of failures at run-time. I wonder if it would be possible to make that quick startup self-test more rigorous so that it would catch this one. I guess I'll be better able to tell once we really isolate this bug.

Okay, this is very good bug isolation. I didn't quite follow the last bit though -- the bug is in *upstream* Crypto++ 5.6.0 and not *upstream* Crypto++ 5.6.1? Or it is in Debian Crypto++ 5.6.0 and not Debian Crypto++ 5.6.1? By the way, we have a "quick startup self-test" built into pycryptopp, which is an attempt to catch the most egregious of these sorts of failures at run-time. I wonder if it would be possible to make that quick startup self-test more rigorous so that it would catch this one. I guess I'll be better able to tell once we really isolate this bug.

rycee commented

2012-03-01 06:16:03 +00:00

Author

Owner

Sorry about the vagueness. I used upstream cryptopp560.zip and cryptopp561.zip downloaded from http://www.cryptopp.com/. The 5.6.0 make file didn't have a target for a shared library so I copied that build target from 5.6.1, this is the only change I made to them. Building the cryptest.exe program and running it with ./cryptest.exe v didn't show any problems. Similarly, there were no errors reported for ./cryptest.exe tv [TestVectors](wiki/TestVectors)/all.txt.

Sorry about the vagueness. I used upstream `cryptopp560.zip` and `cryptopp561.zip` downloaded from <http://www.cryptopp.com/>. The 5.6.0 make file didn't have a target for a shared library so I copied that build target from 5.6.1, this is the only change I made to them. Building the `cryptest.exe` program and running it with `./cryptest.exe v` didn't show any problems. Similarly, there were no errors reported for `./cryptest.exe tv [TestVectors](wiki/TestVectors)/all.txt`.

zooko commented

2012-03-01 06:57:17 +00:00

Hm, well okay so we're pretty sure there is a bug in Crypto++ 5.6.0 that is fixed in Crypto++ 5.6.1, right? Maybe we could let it go at that, since it is apparently fixed. The steps to reproduce are:

Get and build Crypto++ 5.6.0
Build pycryptopp with the --disable-embedded-cryptopp option so that it will link against the .so generated in step 1 instead of using its own bundled copy of Crypto++ source code
Run python setup.py test -s allmydata.test.test_cli.Backup.test_backup

On your machine this 100% reproducible triggers the problem, and swapping in Crypto++ 5.6.1 for 5.6.0 makes it stop happening?

Okay one more thing: how do I know I don't have the same bug in the copy of Crypto++ that is bundled inside pycryptopp? I don't think we've upgraded the embedded copy of Crypto++ a newer version of Crypto++ since this patch [91f454b47f66636fffde68e8a8bf4364bed7516e]:

commit 91f454b47f66636fffde68e8a8bf4364bed7516e
Author: weidai <weidai>
Date:   Sat Feb 2 00:40:27 2008 -0800

Four years ago, before Crypto++ 5.6.0 was released. Since then we've just been cherry-picking patches from upstream Crypto++. So if this bug was present in the version that we copied, and if we didn't realize that we needed to cherry-pick the patch that fixes it, then it could be present in the embedded copy of Crypto++ inside pycryptopp.

Could you please replace step 2 above with building and omitting the --disable-embedded-cryptopp flag, to force it to use its embedded version of Crypto++? I think you've probably done that already in this process, but if you could do it again now just to be sure, that would make me feel better.

Hm, well okay so we're pretty sure there is a bug in Crypto++ 5.6.0 that is fixed in Crypto++ 5.6.1, right? Maybe we could let it go at that, since it is apparently fixed. The steps to reproduce are: 1. Get and build Crypto++ 5.6.0 2. Build pycryptopp with the `--disable-embedded-cryptopp` option so that it will link against the .so generated in step 1 instead of using its own bundled copy of Crypto++ source code 3. Run `python setup.py test -s allmydata.test.test_cli.Backup.test_backup` On your machine this 100% reproducible triggers the problem, and swapping in Crypto++ 5.6.1 for 5.6.0 makes it stop happening? Okay one more thing: how do I know I don't have the same bug in the copy of Crypto++ that is bundled inside pycryptopp? I don't think we've upgraded the embedded copy of Crypto++ a newer version of Crypto++ since this patch [91f454b47f66636fffde68e8a8bf4364bed7516e]: ``` commit 91f454b47f66636fffde68e8a8bf4364bed7516e Author: weidai <weidai> Date: Sat Feb 2 00:40:27 2008 -0800 ``` Four years ago, before Crypto++ 5.6.0 [was released](http://cryptopp.com/#news). Since then we've just been cherry-picking patches from upstream Crypto++. So if this bug was present in the version that we copied, and if we didn't realize that we needed to cherry-pick the patch that fixes it, then it could be present in the embedded copy of Crypto++ inside pycryptopp. Could you please replace step 2 above with building and omitting the `--disable-embedded-cryptopp` flag, to force it to use its embedded version of Crypto++? I think you've probably done that already in this process, but if you could do it again now just to be sure, that would make me feel better.

rycee commented

2012-03-01 20:55:57 +00:00

Author

Owner

Ok, I downloaded pycryptopp version 0.5.29 and did two builds with it. Once with --disable-embedded-cryptopp and once without. I then ran python setup.py test on both. For the build that used the embedded version of Crypto++ this finished without any errors. For the one using the dynamically linked Crypto++ (v 5.6.0) it ended in failure and I'll make an attachment containing the output of the failed run.

Note, when I used --disable-embedded-cryptopp it was built against the Debian version of Crypto++ 5.6.0 since I don't know how to convince setup.py to look for headers and libraries in a non-standard directory. It still fails in the same way, though, if I use LD_LIBRARY_PATH to force it to load the upstream Crypto++ 5.6.0. Using the same technique to force it to use Crypto++ 5.6.1 results in an undefined symbol error.

I'm not entirely sure how to use a custom pycryptopp in a Tahoe build so I didn't try to use my two pycryptopp builds in Tahoe but I suppose the pycryptopp test is telling enough. Also, when I run the Tahoe build now (once I uninstalled the Debian pycryptopp) it downloads and builds pycryptopp using the embedded Crypto++; and as previously seen those builds are doing just fine.

Anyway, I think it's safe to say that the Crypto++ embedded in pycryptopp doesn't exhibit the problem I encountered.

Ok, I downloaded pycryptopp version 0.5.29 and did two builds with it. Once with `--disable-embedded-cryptopp` and once without. I then ran `python setup.py test` on both. For the build that used the embedded version of Crypto++ this finished without any errors. For the one using the dynamically linked Crypto++ (v 5.6.0) it ended in failure and I'll make an attachment containing the output of the failed run. Note, when I used `--disable-embedded-cryptopp` it was built against the Debian version of Crypto++ 5.6.0 since I don't know how to convince `setup.py` to look for headers and libraries in a non-standard directory. It still fails in the same way, though, if I use LD_LIBRARY_PATH to force it to load the upstream Crypto++ 5.6.0. Using the same technique to force it to use Crypto++ 5.6.1 results in an undefined symbol error. I'm not entirely sure how to use a custom pycryptopp in a Tahoe build so I didn't try to use my two pycryptopp builds in Tahoe but I suppose the pycryptopp test is telling enough. Also, when I run the Tahoe build now (once I uninstalled the Debian pycryptopp) it downloads and builds pycryptopp using the embedded Crypto++; and as previously seen those builds are doing just fine. Anyway, I think it's safe to say that the Crypto++ embedded in pycryptopp doesn't exhibit the problem I encountered.

rycee commented

2012-03-01 20:57:17 +00:00

Author

Owner

Attachment pycryptopp-dynlink-test-failure.txt (58050 bytes) added

Output from pycryptopp test when built with --disable-embedded-cryptopp

**Attachment** pycryptopp-dynlink-test-failure.txt (58050 bytes) added Output from pycryptopp test when built with --disable-embedded-cryptopp

pycryptopp-dynlink-test-failure.txt

57 KiB

daira commented

2012-03-02 00:18:09 +00:00

Replying to rycee:

Note, when I used --disable-embedded-cryptopp it was built against the Debian version of Crypto++ 5.6.0 since I don't know how to convince setup.py to look for headers and libraries in a non-standard directory. It still fails in the same way, though, if I use LD_LIBRARY_PATH to force it to load the upstream Crypto++ 5.6.0.

This is almost sufficient to convince me that this is a fixed bug in Crypto++ (despite it not being clear which change fixed it). However, we haven't excluded the possibility that the problem interacts with the dynamic linking, i.e. that it fails in non-embedded mode.

Using the same technique to force it to use Crypto++ 5.6.1 results in an undefined symbol error.

It sounds like we should open a new (pycryptopp) ticket for that, since at some point Debian or Ubuntu will upgrade to 5.6.1 and then it may fail with the same error.

Replying to [rycee](/tahoe-lafs/trac/issues/1418#issuecomment-384996): > Note, when I used `--disable-embedded-cryptopp` it was built against the Debian version of Crypto++ 5.6.0 since I don't know how to convince `setup.py` to look for headers and libraries in a non-standard directory. It still fails in the same way, though, if I use LD_LIBRARY_PATH to force it to load the upstream Crypto++ 5.6.0. This is almost sufficient to convince me that this is a fixed bug in Crypto++ (despite it not being clear which change fixed it). However, we haven't excluded the possibility that the problem interacts with the dynamic linking, i.e. that it fails in non-embedded mode. > Using the same technique to force it to use Crypto++ 5.6.1 results in an undefined symbol error. It sounds like we should open a new (pycryptopp) ticket for that, since at some point Debian or Ubuntu will upgrade to 5.6.1 and then it may fail with the same error.

zooko commented

2012-03-02 05:25:30 +00:00

I agree to close this ticket as "invalid".

About the undefined symbol error, it sounds like rycee compiled it against the Crypto++ 5.6.0 headers provided by Debian, and then used LD_LIBRARY_PATH to force it to load Crypto++ 5.6.1 dynamic link library. That's not a safe thing to do and nobody, or at least not us, can do anything about that giving undefined symbol errors. Am I right, or am I confused?

I agree to close this ticket as "invalid". About the undefined symbol error, it sounds like rycee compiled it against the Crypto++ 5.6.0 headers provided by Debian, and then used `LD_LIBRARY_PATH` to force it to load Crypto++ 5.6.1 dynamic link library. That's not a safe thing to do and nobody, or at least not us, can do anything about that giving undefined symbol errors. Am I right, or am I confused?

rycee commented

2012-03-02 10:13:45 +00:00

Author

Owner

Replying to [davidsarah]comment:49:

Replying to rycee:

Note, when I used --disable-embedded-cryptopp it was built against the Debian version of Crypto++ 5.6.0 since I don't know how to convince setup.py to look for headers and libraries in a non-standard directory. It still fails in the same way, though, if I use LD_LIBRARY_PATH to force it to load the upstream Crypto++ 5.6.0.

This is almost sufficient to convince me that this is a fixed bug in Crypto++ (despite it not being clear which change fixed it). However, we haven't excluded the possibility that the problem interacts with the dynamic linking, i.e. that it fails in non-embedded mode.
Ok, so I tried a new way. I unpacked pycryptopp twice. In one of the trees I replaced the embeddedcryptopp directory by a pristine upstream Crypto++ 5.6.0 and in the other I replaced embeddedcryptopp by a pristine upstream Crypto 5.6.1. I then did a python setup.py build followed by python setup.py test in each of the two trees. The pycryptopp statically built with 5.6.0 failed the test (in the same way as the previously attached test log) while the build using Crypto++ 5.6.1 ran all tests successfully.

I'm a bit curious, however, as to why the cryptest.exe program from Crypto++ 5.6.0 didn't fail, maybe the failing test from pycryptopp could be ported to a test in cryptest.exe?

Using the same technique to force it to use Crypto++ 5.6.1 results in an undefined symbol error.

It sounds like we should open a new (pycryptopp) ticket for that, since at some point Debian or Ubuntu will upgrade to 5.6.1 and then it may fail with the same error.
It's like zooko said. Since I don't know how to make setup.py use a custom directly header-directory it built against the standard Debian Cryptop++ 5.6.0 headers.

However, curiously enough, when I used Debian's pycryptopp – which is dynamically linked against Debian's Crypto++ – I was able to change out the .so file from 5.6.0 to 5.6.1 without problem. Doing this is how I found out that 5.6.1 worked. I suppose it has something to do with the patches and how the Debian pycryptopp and Crypto++ packages are built.

Replying to [davidsarah]comment:49: > Replying to [rycee](/tahoe-lafs/trac/issues/1418#issuecomment-384996): > > Note, when I used `--disable-embedded-cryptopp` it was built against the Debian version of Crypto++ 5.6.0 since I don't know how to convince `setup.py` to look for headers and libraries in a non-standard directory. It still fails in the same way, though, if I use LD_LIBRARY_PATH to force it to load the upstream Crypto++ 5.6.0. > > This is almost sufficient to convince me that this is a fixed bug in Crypto++ (despite it not being clear which change fixed it). However, we haven't excluded the possibility that the problem interacts with the dynamic linking, i.e. that it fails in non-embedded mode. Ok, so I tried a new way. I unpacked pycryptopp twice. In one of the trees I replaced the `embeddedcryptopp` directory by a pristine upstream Crypto++ 5.6.0 and in the other I replaced `embeddedcryptopp` by a pristine upstream Crypto 5.6.1. I then did a `python setup.py build` followed by `python setup.py test` in each of the two trees. The pycryptopp statically built with 5.6.0 failed the test (in the same way as the previously attached test log) while the build using Crypto++ 5.6.1 ran all tests successfully. I'm a bit curious, however, as to why the `cryptest.exe` program from Crypto++ 5.6.0 didn't fail, maybe the failing test from pycryptopp could be ported to a test in `cryptest.exe`? > > Using the same technique to force it to use Crypto++ 5.6.1 results in an undefined symbol error. > > It sounds like we should open a new (pycryptopp) ticket for that, since at some point Debian or Ubuntu will upgrade to 5.6.1 and then it may fail with the same error. It's like [zooko said](/tahoe-lafs/trac/issues/1418#issuecomment-384998). Since I don't know how to make `setup.py` use a custom directly header-directory it built against the standard Debian Cryptop++ 5.6.0 headers. However, curiously enough, when I used Debian's pycryptopp – which is dynamically linked against Debian's Crypto++ – I *was* able to change out the `.so` file from 5.6.0 to 5.6.1 without problem. Doing this is how I found out that 5.6.1 worked. I suppose it has something to do with the patches and how the Debian pycryptopp and Crypto++ packages are built.

daira commented

2012-03-03 01:19:28 +00:00

Replying to [rycee]comment:51:

Replying to [davidsarah]comment:49:

Replying to rycee:
[...]
I'm a bit curious, however, as to why the cryptest.exe program from Crypto++ 5.6.0 didn't fail, maybe the failing test from pycryptopp could be ported to a test in cryptest.exe?

The bug seems to affect the floating point state, and it's quite plausible that cryptest.exe doesn't test for that. A regression test would indeed be nice -- if you're sufficiently motivated then I suggest asking on the Crypto++ mailing list.

Using the same technique to force it to use Crypto++ 5.6.1 results in an undefined symbol error.

It sounds like we should open a new (pycryptopp) ticket for that, since at some point Debian or Ubuntu will upgrade to 5.6.1 and then it may fail with the same error.
It's like zooko said. Since I don't know how to make setup.py use a custom directly header-directory it built against the standard Debian Cryptop++ 5.6.0 headers.

Yes, as Zooko said this is an invalid test.

However, curiously enough, when I used Debian's pycryptopp – which is dynamically linked against Debian's Crypto++ – I was able to change out the .so file from 5.6.0 to 5.6.1 without problem. Doing this is how I found out that 5.6.1 worked. I suppose it has something to do with the patches and how the Debian pycryptopp and Crypto++ packages are built.

Ah, so if that worked then it is not the dynamic linking. Okay, let's stop worrying about this one.

Replying to [rycee]comment:51: > Replying to [davidsarah]comment:49: > > Replying to [rycee](/tahoe-lafs/trac/issues/1418#issuecomment-384996): [...] > I'm a bit curious, however, as to why the `cryptest.exe` program from Crypto++ 5.6.0 didn't fail, maybe the failing test from pycryptopp could be ported to a test in `cryptest.exe`? The bug seems to affect the floating point state, and it's quite plausible that `cryptest.exe` doesn't test for that. A regression test would indeed be nice -- if you're sufficiently motivated then I suggest asking on the Crypto++ mailing list. > > > Using the same technique to force it to use Crypto++ 5.6.1 results in an undefined symbol error. > > > > It sounds like we should open a new (pycryptopp) ticket for that, since at some point Debian or Ubuntu will upgrade to 5.6.1 and then it may fail with the same error. > It's like [zooko said](/tahoe-lafs/trac/issues/1418#issuecomment-384998). Since I don't know how to make `setup.py` use a custom directly header-directory it built against the standard Debian Cryptop++ 5.6.0 headers. Yes, as Zooko said this is an invalid test. > However, curiously enough, when I used Debian's pycryptopp – which is dynamically linked against Debian's Crypto++ – I *was* able to change out the `.so` file from 5.6.0 to 5.6.1 without problem. Doing this is how I found out that 5.6.1 worked. I suppose it has something to do with the patches and how the Debian pycryptopp and Crypto++ packages are built. Ah, so if that worked then it is not the dynamic linking. Okay, let's stop worrying about this one.

zooko commented

2012-03-05 22:48:47 +00:00

Okay, so we think this is a bug that was introduced into Crypto++ after we stopped updating pycryptopp's embedded copy of Crypto++ to track upstream Crypto++, and was then released in Crypto++ v5.6.0, and then was fixed in Crypto++ v5.6.1. It would be nice if we knew more precisely what bug was introduced and what patch fixed it, or if we at least had a reliable regression test that would tell us if we later introduce this bug into pycryptopp, but I'm not sure how to go about that and it might not be worth the effort.

So, shall we close this ticket now?

rycee commented

2012-03-06 09:03:11 +00:00

Author

Owner

This bug should certainly be closed as invalid or something similar. I may explore Crypto++ a bit further as time and interest permits but will raise such a discussion in a more suitable forum.

Thanks for your help in tracking this down. It's been a tricky issue and I'm happy it resolved itself so pleasantly.

This bug should certainly be closed as invalid or something similar. I may explore Crypto++ a bit further as time and interest permits but will raise such a discussion in a more suitable forum. Thanks for your help in tracking this down. It's been a tricky issue and I'm happy it resolved itself so pleasantly.

zooko commented

2012-03-06 11:55:58 +00:00

Okay, thanks a lot for your help!