UploadUnhappinessError with available storage nodes > shares.happy #1791

Open
opened 2012-07-06 23:32:42 +00:00 by gyver · 15 comments
gyver commented 2012-07-06 23:32:42 +00:00
Owner

The error happened with 1.9.1 too. I just upgraded to 1.9.2 and fixed some files/dir that 1.9.1 couldn't repair reliably hoping the following problem would get away too (it didn't).

There are some peculiarities in my setup: I use USB disks connected to a single server so all storage nodes are running on the same server although physically on a disk that can easily be sent away for increasing the durability of the whole storage. At the time of failure there were 7 such storage nodes in my setup and my whole store was fully repaired on these 7 nodes, all the content is/was uploaded with
shares.needed = 4
shares.happy = 6
shares.total = 6

Although 7 >= 6 I get this error when trying to tahoe cp a new file:

Traceback (most recent call last):
  File \"/usr/lib64/python2.7/site-packages/foolscap/call.py\", line 677, in _done
    self.request.complete(res)
  File \"/usr/lib64/python2.7/site-packages/foolscap/call.py\", line 60, in complete
    self.deferred.callback(res)
  File \"/usr/lib64/python2.7/site-packages/twisted/internet/defer.py\", line 361, in callback
    self._startRunCallbacks(result)
  File \"/usr/lib64/python2.7/site-packages/twisted/internet/defer.py\", line 455, in _startRunCallbacks
    self._runCallbacks()
--- <exception caught here> ---
  File \"/usr/lib64/python2.7/site-packages/twisted/internet/defer.py\", line 542, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File \"/usr/lib64/python2.7/site-packages/allmydata/immutable/upload.py\", line 553, in _got_response
    return self._loop()
  File \"/usr/lib64/python2.7/site-packages/allmydata/immutable/upload.py\", line 404, in _loop
    return self._failed(\"%s (%s)\" % (failmsg, self._get_progress_message()))
  File \"/usr/lib64/python2.7/site-packages/allmydata/immutable/upload.py\", line 566, in _failed
    raise UploadUnhappinessError(msg)
allmydata.interfaces.UploadUnhappinessError: shares could be placed on only 5 server(s) such that any 4 of them have enough shares to recover the file, but we were asked to place shares on at least 6 such servers. (placed all 6 shares, want to place shares on at least 6 servers such that any 4 of them have enough shares to recover the file, sent 5 queries to 5 servers, 5 queries placed some shares, 0 placed none (of which 0 placed none due to the server being full and 0 placed none due to an error))

I recently found out about flogtool, so I run it on the client node (which is one of the 7 storage nodes btw), I only pasted the last part from CHKUploader (I can attach the whole log if needs be):

01:04:01.314 L20 []#2339 CHKUploader starting
01:04:01.314 L20 []#2340 starting upload of <allmydata.immutable.upload.EncryptAnUploadable instance at 0x2c9b5a8>
01:04:01.314 L20 []#2341 creating Encoder <Encoder for unknown storage index>
01:04:01.314 L20 []#2342 file size: 4669394
01:04:01.314 L10 []#2343 my encoding parameters: (4, 6, 6, 131072)
01:04:01.314 L20 []#2344 got encoding parameters: 4/6/6 131072
01:04:01.314 L20 []#2345 now setting up codec
01:04:01.348 L20 []#2346 using storage index k5ga2
01:04:01.348 L20 []#2347 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): starting
01:04:01.363 L10 []#2348 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): response to allocate_buckets() from server zp6jpfeu: alreadygot=(0,), allocated=()
01:04:01.372 L10 []#2349 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): response to allocate_buckets() from server pa2myijh: alreadygot=(2,), allocated=(1,)
01:04:01.375 L20 []#2350 storage: allocate_buckets k5ga2suaoz7gju523f5ni3mswe
01:04:01.377 L10 []#2351 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): response to allocate_buckets() from server omkzwfx5: alreadygot=(3,), allocated=()
01:04:01.381 L10 []#2352 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): response to allocate_buckets() from server wo6akhxt: alreadygot=(4,), allocated=()
01:04:01.404 L10 []#2353 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): response to allocate_buckets() from server ughwvrtu: alreadygot=(), allocated=(5,)
01:04:01.405 L25 []#2354 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): server selection unsuccessful for <Tahoe2ServerSelector for upload k5ga2>: shares could be placed on only 5 server(s) such that any 4 of them have enough shares to recover the file, but we were asked to place shares on at least 6 such servers. (placed all 6 shares, want to place shares on at least 6 servers such that any 4 of them have enough shares to recover the file, sent 5 queries to 5 servers, 5 queries placed some shares, 0 placed none (of which 0 placed none due to the server being full and 0 placed none due to an error)), merged=sh0: zp6jpfeu, sh1: pa2myijh, sh2: pa2myijh, sh3: omkzwfx5, sh4: wo6akhxt, sh5: ughwvrtu
01:04:01.407 L20 []#2355 web: 127.0.0.1 PUT /uri/[CENSORED].. 500 1644
The error happened with 1.9.1 too. I just upgraded to 1.9.2 and fixed some files/dir that 1.9.1 couldn't repair reliably hoping the following problem would get away too (it didn't). There are some peculiarities in my setup: I use USB disks connected to a single server so all storage nodes are running on the same server although physically on a disk that can easily be sent away for increasing the durability of the whole storage. At the time of failure there were 7 such storage nodes in my setup and my whole store was fully repaired on these 7 nodes, all the content is/was uploaded with shares.needed = 4 shares.happy = 6 shares.total = 6 Although 7 >= 6 I get this error when trying to tahoe cp a new file: ``` Traceback (most recent call last): File \"/usr/lib64/python2.7/site-packages/foolscap/call.py\", line 677, in _done self.request.complete(res) File \"/usr/lib64/python2.7/site-packages/foolscap/call.py\", line 60, in complete self.deferred.callback(res) File \"/usr/lib64/python2.7/site-packages/twisted/internet/defer.py\", line 361, in callback self._startRunCallbacks(result) File \"/usr/lib64/python2.7/site-packages/twisted/internet/defer.py\", line 455, in _startRunCallbacks self._runCallbacks() --- <exception caught here> --- File \"/usr/lib64/python2.7/site-packages/twisted/internet/defer.py\", line 542, in _runCallbacks current.result = callback(current.result, *args, **kw) File \"/usr/lib64/python2.7/site-packages/allmydata/immutable/upload.py\", line 553, in _got_response return self._loop() File \"/usr/lib64/python2.7/site-packages/allmydata/immutable/upload.py\", line 404, in _loop return self._failed(\"%s (%s)\" % (failmsg, self._get_progress_message())) File \"/usr/lib64/python2.7/site-packages/allmydata/immutable/upload.py\", line 566, in _failed raise UploadUnhappinessError(msg) allmydata.interfaces.UploadUnhappinessError: shares could be placed on only 5 server(s) such that any 4 of them have enough shares to recover the file, but we were asked to place shares on at least 6 such servers. (placed all 6 shares, want to place shares on at least 6 servers such that any 4 of them have enough shares to recover the file, sent 5 queries to 5 servers, 5 queries placed some shares, 0 placed none (of which 0 placed none due to the server being full and 0 placed none due to an error)) ``` I recently found out about flogtool, so I run it on the client node (which is one of the 7 storage nodes btw), I only pasted the last part from CHKUploader (I can attach the whole log if needs be): ``` 01:04:01.314 L20 []#2339 CHKUploader starting 01:04:01.314 L20 []#2340 starting upload of <allmydata.immutable.upload.EncryptAnUploadable instance at 0x2c9b5a8> 01:04:01.314 L20 []#2341 creating Encoder <Encoder for unknown storage index> 01:04:01.314 L20 []#2342 file size: 4669394 01:04:01.314 L10 []#2343 my encoding parameters: (4, 6, 6, 131072) 01:04:01.314 L20 []#2344 got encoding parameters: 4/6/6 131072 01:04:01.314 L20 []#2345 now setting up codec 01:04:01.348 L20 []#2346 using storage index k5ga2 01:04:01.348 L20 []#2347 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): starting 01:04:01.363 L10 []#2348 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): response to allocate_buckets() from server zp6jpfeu: alreadygot=(0,), allocated=() 01:04:01.372 L10 []#2349 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): response to allocate_buckets() from server pa2myijh: alreadygot=(2,), allocated=(1,) 01:04:01.375 L20 []#2350 storage: allocate_buckets k5ga2suaoz7gju523f5ni3mswe 01:04:01.377 L10 []#2351 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): response to allocate_buckets() from server omkzwfx5: alreadygot=(3,), allocated=() 01:04:01.381 L10 []#2352 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): response to allocate_buckets() from server wo6akhxt: alreadygot=(4,), allocated=() 01:04:01.404 L10 []#2353 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): response to allocate_buckets() from server ughwvrtu: alreadygot=(), allocated=(5,) 01:04:01.405 L25 []#2354 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): server selection unsuccessful for <Tahoe2ServerSelector for upload k5ga2>: shares could be placed on only 5 server(s) such that any 4 of them have enough shares to recover the file, but we were asked to place shares on at least 6 such servers. (placed all 6 shares, want to place shares on at least 6 servers such that any 4 of them have enough shares to recover the file, sent 5 queries to 5 servers, 5 queries placed some shares, 0 placed none (of which 0 placed none due to the server being full and 0 placed none due to an error)), merged=sh0: zp6jpfeu, sh1: pa2myijh, sh2: pa2myijh, sh3: omkzwfx5, sh4: wo6akhxt, sh5: ughwvrtu 01:04:01.407 L20 []#2355 web: 127.0.0.1 PUT /uri/[CENSORED].. 500 1644 ```
tahoe-lafs added the
c/unknown
p/major
t/defect
v/1.9.2
labels 2012-07-06 23:32:42 +00:00
tahoe-lafs added this to the undecided milestone 2012-07-06 23:32:42 +00:00
daira was assigned by tahoe-lafs 2012-07-06 23:32:42 +00:00

Thanks for the log, that's very useful.

The uploader only tried to contact 5 servers, which is the problem. Are you absolutely sure that more than the 5 servers mentioned (i.e. zp6jpfeu, pa2myijh, omkzwfx5, wo6akhxt, ughwvrtu) are connected?

Maybe we should include the set of connected servers in the log or the error message.

Thanks for the log, that's very useful. The uploader only tried to contact 5 servers, which is the problem. Are you absolutely sure that more than the 5 servers mentioned (i.e. zp6jpfeu, pa2myijh, omkzwfx5, wo6akhxt, ughwvrtu) are connected? Maybe we should include the set of connected servers in the log or the error message.
gyver commented 2012-07-07 13:38:21 +00:00
Author
Owner

Yes, I'm quite sure all 7 were active. I start all nodes with a single script that in turns:

  • starts the introducer,
  • starts each storage node.
    If one of the expected node can't be started I see it right away in the script output when starting the grid.
    At the time of failure, I even checked the web interface of both the node I use as a client and the introducer and they both listed all 7 storage nodes.
    I even checked that there was plenty of free space on each storage node and that there was no configured reserved space that could explain a possible node refusal of storing data.

I just rechecked and noticed something. The server has 3 IP addresses: the loopback, a private IP on a local network and a private IP on a VPN (managed by OpenVPN). Apparently each node advertises its services on all 3 IPs (I assume it's by design).
But the listing of storage nodes given by my "client" node isn't exactly consistent with the one given by the introducer.

Here are the current outputs (there shouldn't be any security problem publishing this so I didn't obfuscate anything):

Introducer's Service Announcements:

Nickname Advertised IPs Announced Version Service name
store_1 omkzwfx5fconrg4fjqxwt3hdzsmvhfdv 192.168.0.1 10.8.0.10 00:45:24 07-Jul-2012 allmydata-tahoe/1.9.2 storage
store_2 d3fycappvcjuma7vwc45spfolaei6n24 192.168.0.1 10.8.0.10 00:45:25 07-Jul-2012 allmydata-tahoe/1.9.2 storage
store_3 lehmccp7am6cydjh3qsfmdp3scxzprmv 192.168.0.1 10.8.0.10 00:45:26 07-Jul-2012 allmydata-tahoe/1.9.2 storage
store_4 pa2myijhnj3ld4owmqxsihow4hv2botv 192.168.0.1 10.8.0.10 00:45:27 07-Jul-2012 allmydata-tahoe/1.9.2 storage
store_5 wo6akhxtren7esuhbtjscmpzvjwuvken 192.168.0.1 10.8.0.10 00:45:29 07-Jul-2012 allmydata-tahoe/1.9.2 storage
store_6 zp6jpfeudsrxv3n2sz6he3kep3mw3tgr 192.168.0.1 10.8.0.10 00:45:30 07-Jul-2012 allmydata-tahoe/1.9.2 storage
store_8 ughwvrtu7ht6x4xpgr2abbsb457ipcvr 192.168.0.1 10.8.0.10 00:45:31 07-Jul-2012 allmydata-tahoe/1.9.2 storage

Introducer's Subscribed Clients:

Nickname PeerID Advertised IPs Connected From Since Version Subscribed To
store_1 omkzwfx5fconrg4fjqxwt3hdzsmvhfdv 192.168.0.1 10.8.0.10 192.168.0.1:54872 00:45:24 07-Jul-2012 allmydata-tahoe/1.9.2 storage
store_5 wo6akhxtren7esuhbtjscmpzvjwuvken 192.168.0.1 10.8.0.10 192.168.0.1:54949 00:45:29 07-Jul-2012 allmydata-tahoe/1.9.2 storage
store_2 d3fycappvcjuma7vwc45spfolaei6n24 192.168.0.1 10.8.0.10 192.168.0.1:54884 00:45:25 07-Jul-2012 allmydata-tahoe/1.9.2 storage
store_3 lehmccp7am6cydjh3qsfmdp3scxzprmv 192.168.0.1 10.8.0.10 127.0.0.1:55124 00:45:26 07-Jul-2012 allmydata-tahoe/1.9.2 storage
store_4 pa2myijhnj3ld4owmqxsihow4hv2botv 192.168.0.1 10.8.0.10 127.0.0.1:55143 00:45:27 07-Jul-2012 allmydata-tahoe/1.9.2 storage
store_6 zp6jpfeudsrxv3n2sz6he3kep3mw3tgr 192.168.0.1 10.8.0.10 127.0.0.1:55207 00:45:30 07-Jul-2012 allmydata-tahoe/1.9.2 storage
store_8 ughwvrtu7ht6x4xpgr2abbsb457ipcvr 192.168.0.1 10.8.0.10 127.0.0.1:55252 00:45:31 07-Jul-2012 allmydata-tahoe/1.9.2 storage

"The storage node I use as a client"'s status:

Connected to 7 of 7 known storage servers:

Service Name Nickname PeerID Connected? Since First Announced Version
storage store_2 d3fycappvcjuma7vwc45spfolaei6n24 Yes: to 127.0.0.1:48397 00:45:25 07-Jul-2012 00:45:25 07-Jul-2012 allmydata-tahoe/1.9.2
storage store_3 lehmccp7am6cydjh3qsfmdp3scxzprmv Yes: to 192.168.0.1:38708 00:45:26 07-Jul-2012 00:45:26 07-Jul-2012 allmydata-tahoe/1.9.2
storage store_1 omkzwfx5fconrg4fjqxwt3hdzsmvhfdv Yes: to (loopback) 00:45:24 07-Jul-2012 00:45:24 07-Jul-2012 allmydata-tahoe/1.9.2
storage store_4 pa2myijhnj3ld4owmqxsihow4hv2botv Yes: to 10.8.0.10:40626 00:45:27 07-Jul-2012 00:45:27 07-Jul-2012 allmydata-tahoe/1.9.2
storage store_8 ughwvrtu7ht6x4xpgr2abbsb457ipcvr Yes: to 10.8.0.10:49604 00:45:31 07-Jul-2012 00:45:31 07-Jul-2012 allmydata-tahoe/1.9.2
storage store_5 wo6akhxtren7esuhbtjscmpzvjwuvken Yes: to 127.0.0.1:50958 00:45:29 07-Jul-2012 00:45:29 07-Jul-2012 allmydata-tahoe/1.9.2
storage store_6 zp6jpfeudsrxv3n2sz6he3kep3mw3tgr Yes: to 10.8.0.10:52661 00:45:30 07-Jul-2012 00:45:30 07-Jul-2012 allmydata-tahoe/1.9.2

I'm not sure how the service announcement and IP selection works, but there seems to be at least some amount of chance involved in the IP selection. All nodes should behave themselves in the same way so AFAIK the same IP should be selected.

Yes, I'm quite sure all 7 were active. I start all nodes with a single script that in turns: - starts the introducer, - starts each storage node. If one of the expected node can't be started I see it right away in the script output when starting the grid. At the time of failure, I even checked the web interface of both the node I use as a client and the introducer and they both listed all 7 storage nodes. I even checked that there was plenty of free space on each storage node and that there was no configured reserved space that could explain a possible node refusal of storing data. I just rechecked and noticed something. The server has 3 IP addresses: the loopback, a private IP on a local network and a private IP on a VPN (managed by OpenVPN). Apparently each node advertises its services on all 3 IPs (I assume it's by design). But the listing of storage nodes given by my "client" node isn't exactly consistent with the one given by the introducer. Here are the current outputs (there shouldn't be any security problem publishing this so I didn't obfuscate anything): Introducer's Service Announcements: | Nickname | Advertised IPs | Announced | Version | Service name | |---|---|---|---|---| | store_1 omkzwfx5fconrg4fjqxwt3hdzsmvhfdv | 192.168.0.1 10.8.0.10 | 00:45:24 07-Jul-2012 | allmydata-tahoe/1.9.2 | storage | | store_2 d3fycappvcjuma7vwc45spfolaei6n24 | 192.168.0.1 10.8.0.10 | 00:45:25 07-Jul-2012 | allmydata-tahoe/1.9.2 | storage | | store_3 lehmccp7am6cydjh3qsfmdp3scxzprmv | 192.168.0.1 10.8.0.10 | 00:45:26 07-Jul-2012 | allmydata-tahoe/1.9.2 | storage | | store_4 pa2myijhnj3ld4owmqxsihow4hv2botv | 192.168.0.1 10.8.0.10 | 00:45:27 07-Jul-2012 | allmydata-tahoe/1.9.2 | storage | | store_5 wo6akhxtren7esuhbtjscmpzvjwuvken | 192.168.0.1 10.8.0.10 | 00:45:29 07-Jul-2012 | allmydata-tahoe/1.9.2 | storage | | store_6 zp6jpfeudsrxv3n2sz6he3kep3mw3tgr | 192.168.0.1 10.8.0.10 | 00:45:30 07-Jul-2012 | allmydata-tahoe/1.9.2 | storage | | store_8 ughwvrtu7ht6x4xpgr2abbsb457ipcvr | 192.168.0.1 10.8.0.10 | 00:45:31 07-Jul-2012 | allmydata-tahoe/1.9.2 | storage | Introducer's Subscribed Clients: | Nickname | PeerID | Advertised IPs | Connected From | Since | Version | Subscribed To | |---|---|---|---|---|---|---| | store_1 | omkzwfx5fconrg4fjqxwt3hdzsmvhfdv | 192.168.0.1 10.8.0.10 | 192.168.0.1:54872 | 00:45:24 07-Jul-2012 | allmydata-tahoe/1.9.2 | storage | | store_5 | wo6akhxtren7esuhbtjscmpzvjwuvken | 192.168.0.1 10.8.0.10 | 192.168.0.1:54949 | 00:45:29 07-Jul-2012 | allmydata-tahoe/1.9.2 | storage | | store_2 | d3fycappvcjuma7vwc45spfolaei6n24 | 192.168.0.1 10.8.0.10 | 192.168.0.1:54884 | 00:45:25 07-Jul-2012 | allmydata-tahoe/1.9.2 | storage | | store_3 | lehmccp7am6cydjh3qsfmdp3scxzprmv | 192.168.0.1 10.8.0.10 | 127.0.0.1:55124 | 00:45:26 07-Jul-2012 | allmydata-tahoe/1.9.2 | storage | | store_4 | pa2myijhnj3ld4owmqxsihow4hv2botv | 192.168.0.1 10.8.0.10 | 127.0.0.1:55143 | 00:45:27 07-Jul-2012 | allmydata-tahoe/1.9.2 | storage | | store_6 | zp6jpfeudsrxv3n2sz6he3kep3mw3tgr | 192.168.0.1 10.8.0.10 | 127.0.0.1:55207 | 00:45:30 07-Jul-2012 | allmydata-tahoe/1.9.2 | storage | | store_8 | ughwvrtu7ht6x4xpgr2abbsb457ipcvr | 192.168.0.1 10.8.0.10 | 127.0.0.1:55252 | 00:45:31 07-Jul-2012 | allmydata-tahoe/1.9.2 | storage | "The storage node I use as a client"'s status: Connected to 7 of 7 known storage servers: | Service Name | Nickname PeerID | Connected? | Since | First Announced | Version | |---|---|---|---|---|---| | storage | store_2 d3fycappvcjuma7vwc45spfolaei6n24 | Yes: to 127.0.0.1:48397 | 00:45:25 07-Jul-2012 | 00:45:25 07-Jul-2012 | allmydata-tahoe/1.9.2 | | storage | store_3 lehmccp7am6cydjh3qsfmdp3scxzprmv | Yes: to 192.168.0.1:38708 | 00:45:26 07-Jul-2012 | 00:45:26 07-Jul-2012 | allmydata-tahoe/1.9.2 | | storage | store_1 omkzwfx5fconrg4fjqxwt3hdzsmvhfdv | Yes: to (loopback) | 00:45:24 07-Jul-2012 | 00:45:24 07-Jul-2012 | allmydata-tahoe/1.9.2 | | storage | store_4 pa2myijhnj3ld4owmqxsihow4hv2botv | Yes: to 10.8.0.10:40626 | 00:45:27 07-Jul-2012 | 00:45:27 07-Jul-2012 | allmydata-tahoe/1.9.2 | | storage | store_8 ughwvrtu7ht6x4xpgr2abbsb457ipcvr | Yes: to 10.8.0.10:49604 | 00:45:31 07-Jul-2012 | 00:45:31 07-Jul-2012 | allmydata-tahoe/1.9.2 | | storage | store_5 wo6akhxtren7esuhbtjscmpzvjwuvken | Yes: to 127.0.0.1:50958 | 00:45:29 07-Jul-2012 | 00:45:29 07-Jul-2012 | allmydata-tahoe/1.9.2 | | storage | store_6 zp6jpfeudsrxv3n2sz6he3kep3mw3tgr | Yes: to 10.8.0.10:52661 | 00:45:30 07-Jul-2012 | 00:45:30 07-Jul-2012 | allmydata-tahoe/1.9.2 | I'm not sure how the service announcement and IP selection works, but there seems to be at least some amount of chance involved in the IP selection. All nodes should behave themselves in the same way so AFAIK the same IP should be selected.

Hmm. d3fycapp and lehmccp7 were the servers that were not contacted in the failed upload, and they have IPs 127.0.0.1 and 192.168.0.1 respectively. But wo6akhxt also has IP 127.0.0.1 and that was contacted in the failed upload, so that might be a red herring.

I don't know why the determination of which IP to use is nondeterministic. warner?

Hmm. d3fycapp and lehmccp7 were the servers that were not contacted in the failed upload, and they have IPs 127.0.0.1 and 192.168.0.1 respectively. But wo6akhxt also has IP 127.0.0.1 and that was contacted in the failed upload, so that might be a red herring. I don't know why the determination of which IP to use is nondeterministic. warner?

BTW, I have seen nondeterministic choice of IP on the Least Authority Enterprises servers (which are EC2 instances running Ubuntu) as well.

BTW, I have seen nondeterministic choice of IP on the Least Authority Enterprises servers (which are EC2 instances running Ubuntu) as well.
gyver commented 2012-07-07 15:11:42 +00:00
Author
Owner

Replying to davidsarah:

Hmm. d3fycapp and lehmccp7 were the servers that were not contacted in the failed upload, and they have IPs 127.0.0.1 and 192.168.0.1 respectively. But wo6akhxt also has IP 127.0.0.1 and that was contacted in the failed upload, so that might be a red herring.

Note that the assigned IPs are not always the same: a restart of all storage nodes reshuffles them. My last failed attempt was with d3fycapp and lehmccp7 seen both on 10.8.0.10 (VPN IP) by the client node (5 nodes where seen on 10.8.0.10 with ughwvrtu on 192.168.0.1 and the client (omkzwfx5) on loopback. It seems the IP addresses used don't change the error: I've always seen the same (only 5 servers queried) since the problem appeared.

The problem may not have anything to do with IP address choices but it seems to me these inconsistencies are odd enough to keep in mind.

Replying to [davidsarah](/tahoe-lafs/trac/issues/1791#issuecomment-390749): > Hmm. d3fycapp and lehmccp7 were the servers that were not contacted in the failed upload, and they have IPs 127.0.0.1 and 192.168.0.1 respectively. But wo6akhxt also has IP 127.0.0.1 and that was contacted in the failed upload, so that might be a red herring. Note that the assigned IPs are not always the same: a restart of all storage nodes reshuffles them. My last failed attempt was with d3fycapp and lehmccp7 seen both on 10.8.0.10 (VPN IP) by the client node (5 nodes where seen on 10.8.0.10 with ughwvrtu on 192.168.0.1 and the client (omkzwfx5) on loopback. It seems the IP addresses used don't change the error: I've always seen the same (only 5 servers queried) since the problem appeared. The problem may not have anything to do with IP address choices but it seems to me these inconsistencies are odd enough to keep in mind.

Please add the following just after line 225 (i.e. after readonly_servers = ... and before # decide upon the renewal/cancel secrets...) of [src/allmydata/immutable/upload.py in 1.9.2]source:1.9.2/src/allmydata/immutable/upload.py:

        self.log(format="servers before upload: %(connected_servers)r permuted to %(all_servers)r, of which"
                        "%(writeable_servers)r are writeable and %(readonly_servers)r are read-only",
                 connected_servers=storage_broker.get_connected_servers(),
                 all_servers=all_servers,
                 writeable_servers=writeable_servers,
                 readonly_servers=readonly_servers,
                 level=log.OPERATIONAL)

and then show the log for a failing upload.

(You need to restart the gateway after changing the code, but it's not necessary to rebuild it.)

Please add the following just after line 225 (i.e. after `readonly_servers = `... and before `# decide upon the renewal/cancel secrets`...) of [src/allmydata/immutable/upload.py in 1.9.2]source:1.9.2/src/allmydata/immutable/upload.py: ``` self.log(format="servers before upload: %(connected_servers)r permuted to %(all_servers)r, of which" "%(writeable_servers)r are writeable and %(readonly_servers)r are read-only", connected_servers=storage_broker.get_connected_servers(), all_servers=all_servers, writeable_servers=writeable_servers, readonly_servers=readonly_servers, level=log.OPERATIONAL) ``` and then show the log for a failing upload. (You need to restart the gateway after changing the code, but it's not necessary to rebuild it.)
daira added
c/code-peerselection
and removed
c/unknown
labels 2012-07-07 19:40:15 +00:00
daira modified the milestone from undecided to 1.10.0 2012-07-07 19:40:15 +00:00
gyver commented 2012-07-07 21:30:23 +00:00
Author
Owner

Replying to davidsarah:

Please add the following just after line 225 (i.e. after readonly_servers = ... and before # decide upon the renewal/cancel secrets...) of [src/allmydata/immutable/upload.py in 1.9.2]source:1.9.2/src/allmydata/immutable/upload.py:

I may not have done it right : I got the same output with this at the end:

23:09:02.238 L23 []#2436 an outbound callRemote (that we [omkz] sent to someone else [zqxq]) failed on the far end
23:09:02.238 L10 []#2437  reqID=873, rref=<RemoteReference at 0x2e780d0>, methname=RILogObserver.foolscap.lothar.com.msg
23:09:02.238 L10 []#2438  the REMOTE failure was:
 FAILURE:
 [CopiedFailure instance: Traceback from remote host -- Traceback (most recent call last):
   File "/usr/lib64/python2.7/site-packages/foolscap/slicers/root.py", line 107, in send
     d.callback(None)
   File "/usr/lib64/python2.7/site-packages/twisted/internet/defer.py", line 361, in callback
     self._startRunCallbacks(result)
   File "/usr/lib64/python2.7/site-packages/twisted/internet/defer.py", line 455, in _startRunCallbacks
     self._runCallbacks()
   File "/usr/lib64/python2.7/site-packages/twisted/internet/defer.py", line 542, in _runCallbacks
     current.result = callback(current.result, *args, **kw)
 --- <exception caught here> ---
   File "/usr/lib64/python2.7/site-packages/foolscap/banana.py", line 215, in produce
     slicer = self.newSlicerFor(obj)
   File "/usr/lib64/python2.7/site-packages/foolscap/banana.py", line 314, in newSlicerFor
     return topSlicer.slicerForObject(obj)
   File "/usr/lib64/python2.7/site-packages/foolscap/slicer.py", line 48,

BUT... I may have a lead looking at the last error message in my original log dump.

server selection unsuccessful for : shares could be placed on only 5 server(s) [...], merged=sh0: zp6jpfeu, sh1: pa2myijh, sh2: pa2myijh, sh3: omkzwfx5, sh4: wo6akhxt, sh5: ughwvrtu

I assume the sh are the shares to be placed. sh1 and sh2 were affected to pa2myijh. I'm not sure if this repartition is the result of share detection (my guess) or the result of a share placement algorithm that could produce invalid placement and needs a check before upload (late error detection isn't good practice so I bet it's not the case).

What if these shares are already stored on pa2myijh before the upload attempt (due to past uploads with a buggy version or whatever happened in the store directory out of Tahoe's control). Is the code able to detect such a case and reupload one of the two shares on a free (without one of the 6 shares) server? If not, it might be the cause of my problem (the file was part of a long list of files I tried to upload with only partial success weeks ago...) and my storage nodes are most probably polluted by "dangling" shares.

Replying to [davidsarah](/tahoe-lafs/trac/issues/1791#issuecomment-390752): > Please add the following just after line 225 (i.e. after `readonly_servers = `... and before `# decide upon the renewal/cancel secrets`...) of [src/allmydata/immutable/upload.py in 1.9.2]source:1.9.2/src/allmydata/immutable/upload.py: I may not have done it right : I got the same output with this at the end: ``` 23:09:02.238 L23 []#2436 an outbound callRemote (that we [omkz] sent to someone else [zqxq]) failed on the far end 23:09:02.238 L10 []#2437 reqID=873, rref=<RemoteReference at 0x2e780d0>, methname=RILogObserver.foolscap.lothar.com.msg 23:09:02.238 L10 []#2438 the REMOTE failure was: FAILURE: [CopiedFailure instance: Traceback from remote host -- Traceback (most recent call last): File "/usr/lib64/python2.7/site-packages/foolscap/slicers/root.py", line 107, in send d.callback(None) File "/usr/lib64/python2.7/site-packages/twisted/internet/defer.py", line 361, in callback self._startRunCallbacks(result) File "/usr/lib64/python2.7/site-packages/twisted/internet/defer.py", line 455, in _startRunCallbacks self._runCallbacks() File "/usr/lib64/python2.7/site-packages/twisted/internet/defer.py", line 542, in _runCallbacks current.result = callback(current.result, *args, **kw) --- <exception caught here> --- File "/usr/lib64/python2.7/site-packages/foolscap/banana.py", line 215, in produce slicer = self.newSlicerFor(obj) File "/usr/lib64/python2.7/site-packages/foolscap/banana.py", line 314, in newSlicerFor return topSlicer.slicerForObject(obj) File "/usr/lib64/python2.7/site-packages/foolscap/slicer.py", line 48, ``` BUT... I may have a lead looking at the last error message in my original log dump. server selection unsuccessful for <Tahoe2ServerSelector for upload k5ga2>: shares could be placed on only 5 server(s) [...], merged=sh0: zp6jpfeu, sh1: pa2myijh, sh2: pa2myijh, sh3: omkzwfx5, sh4: wo6akhxt, sh5: ughwvrtu I assume the sh<n> are the shares to be placed. sh1 and sh2 were affected to pa2myijh. I'm not sure if this repartition is the result of share detection (my guess) or the result of a share placement algorithm that could produce invalid placement and needs a check before upload (late error detection isn't good practice so I bet it's not the case). What if these shares are already stored on pa2myijh **before** the upload attempt (due to past uploads with a buggy version or whatever happened in the store directory out of Tahoe's control). Is the code able to detect such a case and reupload one of the two shares on a free (without one of the 6 shares) server? If not, it might be the cause of my problem (the file was part of a long list of files I tried to upload with only partial success weeks ago...) and my storage nodes are most probably polluted by "dangling" shares.

There was a bug in the statement I asked you to add; please replace it entirely with this one:

        self.log(format="servers before upload: %(connected_servers)s permuted to %(all_servers)s, of which"
                        "%(writeable_servers)s are writeable and %(readonly_servers)s are read-only",
                 connected_servers=repr(storage_broker.get_connected_servers()),
                 all_servers=repr(all_servers),
                 writeable_servers=repr(writeable_servers),
                 readonly_servers=repr(readonly_servers),
                 level=log.OPERATIONAL)

In answer to your last question, the fact that there are existing shares should not cause an UploadUnhappinessError. However, bugs #1124 and #1130 describe cases where we don't achieve that. I don't think that your problem is due to that, though, because it seems from the logs that the gateway is not contacting enough servers to make it possible to achieve happiness, regardless of the existing share distribution.

[more precisely it isn't receiving responses from enough servers. At this point we're not sure whether it is contacting them, although the "sent 5 queries to 5 servers" in the UploadUnhappinessError message suggests that it isn't.]Edit:

There was a bug in the statement I asked you to add; please replace it entirely with this one: ``` self.log(format="servers before upload: %(connected_servers)s permuted to %(all_servers)s, of which" "%(writeable_servers)s are writeable and %(readonly_servers)s are read-only", connected_servers=repr(storage_broker.get_connected_servers()), all_servers=repr(all_servers), writeable_servers=repr(writeable_servers), readonly_servers=repr(readonly_servers), level=log.OPERATIONAL) ``` In answer to your last question, the fact that there are existing shares *should not* cause an `UploadUnhappinessError`. However, bugs #1124 and #1130 describe cases where we don't achieve that. I don't think that your problem is due to that, though, because it seems from the logs that the gateway is not contacting enough servers to make it possible to achieve happiness, regardless of the existing share distribution. [more precisely it isn't receiving responses from enough servers. At this point we're not sure whether it is contacting them, although the "sent 5 queries to 5 servers" in the `UploadUnhappinessError` message suggests that it isn't.]Edit:

I'm hoping that this bug is the same one that has been occasionally reported on VolunteerGrid2 with uploads where shares.happy is close to shares.total (and to the number of servers). It has very similar symptoms, but gyver seems to be able to reproduce it more easily.

I'm hoping that this bug is the same one that has been occasionally reported on [VolunteerGrid](wiki/VolunteerGrid)2 with uploads where `shares.happy` is close to `shares.total` (and to the number of servers). It has very similar symptoms, but gyver seems to be able to reproduce it more easily.
gyver commented 2012-07-08 01:12:29 +00:00
Author
Owner

Replying to davidsarah:

[...] but gyver seems to be able to reproduce it more easily.
You bet: I take more time firing ssh and joining the screen session that the tahoe cp takes to fail :-)

Here's the log with a bit of context:

03:09:17.432 L20 []#2487 CHKUploader starting
03:09:17.432 L20 []#2488 starting upload of <allmydata.immutable.upload.EncryptAnUploadable instance at 0x2fe8998>
03:09:17.432 L20 []#2489 creating Encoder <Encoder for unknown storage index>
03:09:17.432 L20 []#2490 file size: 4669394
03:09:17.432 L10 []#2491 my encoding parameters: (4, 6, 6, 131072)
03:09:17.432 L20 []#2492 got encoding parameters: 4/6/6 131072
03:09:17.432 L20 []#2493 now setting up codec
03:09:17.468 L20 []#2494 using storage index k5ga2
03:09:17.468 L20 []#2495 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): starting
03:09:17.469 L20 []#2496 servers before upload: frozenset([<NativeStorageServer for zp6jpfeu>, <NativeStorageServer for omkzwfx5>, <NativeStorageServer for lehmccp7>, <NativeStorageServer for wo6akhxt>, <NativeStorageServer for ughwvrtu>, <NativeStorageServer for pa2myijh>, <NativeStorageServer for d3fycapp>]) permuted to [<NativeStorageServer for zp6jpfeu>, <NativeStorageServer for pa2myijh>, <NativeStorageServer for omkzwfx5>, <NativeStorageServer for wo6akhxt>, <NativeStorageServer for ughwvrtu>, <NativeStorageServer for lehmccp7>, <NativeStorageServer for d3fycapp>], of which[<NativeStorageServer for zp6jpfeu>, <NativeStorageServer for pa2myijh>, <NativeStorageServer for omkzwfx5>, <NativeStorageServer for wo6akhxt>, <NativeStorageServer for ughwvrtu>, <NativeStorageServer for lehmccp7>, <NativeStorageServer for d3fycapp>] are writeable and set([]) are read-only
03:09:17.485 L10 []#2497 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): response to allocate_buckets() from server zp6jpfeu: alreadygot=(0,), allocated=()
03:09:17.497 L10 []#2498 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): response to allocate_buckets() from server pa2myijh: alreadygot=(2,), allocated=(1,)
03:09:17.500 L20 []#2499 storage: allocate_buckets k5ga2suaoz7gju523f5ni3mswe
03:09:17.515 L10 []#2500 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): response to allocate_buckets() from server omkzwfx5: alreadygot=(3,), allocated=()
03:09:17.523 L10 []#2501 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): response to allocate_buckets() from server wo6akhxt: alreadygot=(4,), allocated=()
03:09:17.530 L10 []#2502 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): response to allocate_buckets() from server ughwvrtu: alreadygot=(), allocated=(5,)
03:09:17.531 L25 []#2503 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): server selection unsuccessful for <Tahoe2ServerSelector for upload k5ga2>: shares could be placed on only 5 server(s) such that any 4 of them have enough shares to recover the file, but we were asked to place shares on at least 6 such servers. (placed all 6 shares, want to place shares on at least 6 servers such that any 4 of them have enough shares to recover the file, sent 5 queries to 5 servers, 5 queries placed some shares, 0 placed none (of which 0 placed none due to the server being full and 0 placed none due to an error)), merged=sh0: zp6jpfeu, sh1: pa2myijh, sh2: pa2myijh, sh3: omkzwfx5, sh4: wo6akhxt, sh5: ughwvrtu
03:09:17.533 L20 []#2504 web: 127.0.0.1 PUT /uri/[CENSORED].. 500 1644
Replying to [davidsarah](/tahoe-lafs/trac/issues/1791#issuecomment-390758): > [...] but gyver seems to be able to reproduce it more easily. You bet: I take more time firing ssh and joining the screen session that the tahoe cp takes to fail :-) Here's the log with a bit of context: ``` 03:09:17.432 L20 []#2487 CHKUploader starting 03:09:17.432 L20 []#2488 starting upload of <allmydata.immutable.upload.EncryptAnUploadable instance at 0x2fe8998> 03:09:17.432 L20 []#2489 creating Encoder <Encoder for unknown storage index> 03:09:17.432 L20 []#2490 file size: 4669394 03:09:17.432 L10 []#2491 my encoding parameters: (4, 6, 6, 131072) 03:09:17.432 L20 []#2492 got encoding parameters: 4/6/6 131072 03:09:17.432 L20 []#2493 now setting up codec 03:09:17.468 L20 []#2494 using storage index k5ga2 03:09:17.468 L20 []#2495 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): starting 03:09:17.469 L20 []#2496 servers before upload: frozenset([<NativeStorageServer for zp6jpfeu>, <NativeStorageServer for omkzwfx5>, <NativeStorageServer for lehmccp7>, <NativeStorageServer for wo6akhxt>, <NativeStorageServer for ughwvrtu>, <NativeStorageServer for pa2myijh>, <NativeStorageServer for d3fycapp>]) permuted to [<NativeStorageServer for zp6jpfeu>, <NativeStorageServer for pa2myijh>, <NativeStorageServer for omkzwfx5>, <NativeStorageServer for wo6akhxt>, <NativeStorageServer for ughwvrtu>, <NativeStorageServer for lehmccp7>, <NativeStorageServer for d3fycapp>], of which[<NativeStorageServer for zp6jpfeu>, <NativeStorageServer for pa2myijh>, <NativeStorageServer for omkzwfx5>, <NativeStorageServer for wo6akhxt>, <NativeStorageServer for ughwvrtu>, <NativeStorageServer for lehmccp7>, <NativeStorageServer for d3fycapp>] are writeable and set([]) are read-only 03:09:17.485 L10 []#2497 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): response to allocate_buckets() from server zp6jpfeu: alreadygot=(0,), allocated=() 03:09:17.497 L10 []#2498 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): response to allocate_buckets() from server pa2myijh: alreadygot=(2,), allocated=(1,) 03:09:17.500 L20 []#2499 storage: allocate_buckets k5ga2suaoz7gju523f5ni3mswe 03:09:17.515 L10 []#2500 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): response to allocate_buckets() from server omkzwfx5: alreadygot=(3,), allocated=() 03:09:17.523 L10 []#2501 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): response to allocate_buckets() from server wo6akhxt: alreadygot=(4,), allocated=() 03:09:17.530 L10 []#2502 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): response to allocate_buckets() from server ughwvrtu: alreadygot=(), allocated=(5,) 03:09:17.531 L25 []#2503 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): server selection unsuccessful for <Tahoe2ServerSelector for upload k5ga2>: shares could be placed on only 5 server(s) such that any 4 of them have enough shares to recover the file, but we were asked to place shares on at least 6 such servers. (placed all 6 shares, want to place shares on at least 6 servers such that any 4 of them have enough shares to recover the file, sent 5 queries to 5 servers, 5 queries placed some shares, 0 placed none (of which 0 placed none due to the server being full and 0 placed none due to an error)), merged=sh0: zp6jpfeu, sh1: pa2myijh, sh2: pa2myijh, sh3: omkzwfx5, sh4: wo6akhxt, sh5: ughwvrtu 03:09:17.533 L20 []#2504 web: 127.0.0.1 PUT /uri/[CENSORED].. 500 1644 ```

Replying to [gyver]comment:12:

> 03:09:17.469 L20 []#2496 servers before upload:
> frozenset([<NativeStorageServer for zp6jpfeu>, <NativeStorageServer for omkzwfx5>, <NativeStorageServer for lehmccp7>, <NativeStorageServer for wo6akhxt>, <NativeStorageServer for ughwvrtu>, <NativeStorageServer for pa2myijh>, <NativeStorageServer for d3fycapp>])
> permuted to [<NativeStorageServer for zp6jpfeu>, <NativeStorageServer for pa2myijh>, <NativeStorageServer for omkzwfx5>, <NativeStorageServer for wo6akhxt>, <NativeStorageServer for ughwvrtu>, <NativeStorageServer for lehmccp7>, <NativeStorageServer for d3fycapp>],
> of which [<NativeStorageServer for zp6jpfeu>, <NativeStorageServer for pa2myijh>, <NativeStorageServer for omkzwfx5>, <NativeStorageServer for wo6akhxt>, <NativeStorageServer for ughwvrtu>, <NativeStorageServer for lehmccp7>, <NativeStorageServer for d3fycapp>]
> are writeable and set([]) are read-only

OK, that proves that the problem occurs after deciding which servers are writeable. We seem to be logging only responses to remote allocate_buckets requests at the gateway, so the next steps are:

a) Log when the gateway sends an allocate_buckets request.

b) Look at the logs of the storage servers to see how many of them receive an allocate_buckets request (which is logged [here]source:1.9.2/src/allmydata/storage/server.py#L248 as "storage: allocate_buckets <SI>") and what they do about it.

To do a), add this at line 105 of src/allmydata/immutable/upload.py (in the query method of ServerTracker after rref = self._server.get_rref()):

        log.msg("query(%(sharenums)s) for %(rref)s", sharenums=repr(sharenums), rref=repr(rref), level=log.OPERATIONAL)
Replying to [gyver]comment:12: ``` > 03:09:17.469 L20 []#2496 servers before upload: > frozenset([<NativeStorageServer for zp6jpfeu>, <NativeStorageServer for omkzwfx5>, <NativeStorageServer for lehmccp7>, <NativeStorageServer for wo6akhxt>, <NativeStorageServer for ughwvrtu>, <NativeStorageServer for pa2myijh>, <NativeStorageServer for d3fycapp>]) > permuted to [<NativeStorageServer for zp6jpfeu>, <NativeStorageServer for pa2myijh>, <NativeStorageServer for omkzwfx5>, <NativeStorageServer for wo6akhxt>, <NativeStorageServer for ughwvrtu>, <NativeStorageServer for lehmccp7>, <NativeStorageServer for d3fycapp>], > of which [<NativeStorageServer for zp6jpfeu>, <NativeStorageServer for pa2myijh>, <NativeStorageServer for omkzwfx5>, <NativeStorageServer for wo6akhxt>, <NativeStorageServer for ughwvrtu>, <NativeStorageServer for lehmccp7>, <NativeStorageServer for d3fycapp>] > are writeable and set([]) are read-only ``` OK, that proves that the problem occurs after deciding which servers are writeable. We seem to be logging only *responses* to remote `allocate_buckets` requests at the gateway, so the next steps are: a) Log when the gateway sends an `allocate_buckets` request. b) Look at the logs of the storage servers to see how many of them receive an `allocate_buckets` request (which is logged [here]source:1.9.2/src/allmydata/storage/server.py#L248 as "`storage: allocate_buckets <SI>`") and what they do about it. To do a), add this at line 105 of `src/allmydata/immutable/upload.py` (in the `query` method of `ServerTracker` after `rref = self._server.get_rref()`): ``` log.msg("query(%(sharenums)s) for %(rref)s", sharenums=repr(sharenums), rref=repr(rref), level=log.OPERATIONAL) ```
warner modified the milestone from 1.10.0 to 1.11.0 2012-09-04 16:59:28 +00:00

Same bug as #2016?

Same bug as #2016?

From #2016 which has now been marked as a duplicate:

[//trac/tahoe-lafs/ticket/2016#comment:390748 daira] wrote:

Here's the most important part of the log:

local#675138 20:33:50.290: <Tahoe2ServerSelector for upload jbljj>(jbljj): response to allocate_buckets() from server i76mi6: alreadygot=(0,), allocated=()
local#675139 20:33:50.457: <Tahoe2ServerSelector for upload jbljj>(jbljj): response to allocate_buckets() from server lxmst5: alreadygot=(2,), allocated=(1,)
local#675140 20:33:50.667: <Tahoe2ServerSelector for upload jbljj>(jbljj): response to allocate_buckets() from server sf7ehc: alreadygot=(3,), allocated=()
local#675141 20:33:50.822: <Tahoe2ServerSelector for upload jbljj>(jbljj): response to allocate_buckets() from server ddvfcd: alreadygot=(4,), allocated=()
local#675142 20:33:50.839: <Tahoe2ServerSelector for upload jbljj>(jbljj): server selection unsuccessful for <Tahoe2ServerSelector for upload jbljj>:
 shares could be placed on only 4 server(s) such that any 3 of them have enough shares to recover the file, but we were asked to place shares on at least 5 such servers.
 (placed all 5 shares, want to place shares on at least 5 servers such that any 3 of them have enough shares to recover the file, sent 6 queries to 6 servers, 4 queries placed some shares, 2 placed none (of which 2 placed none due to the server being full and 0 placed none due to an error)),
 merged=sh0: i76mi6en, sh1: lxmst5bx, sh2: lxmst5bx, sh3: sf7ehcpn, sh4: ddvfcdns

[//trac/tahoe-lafs/ticket/2016#comment:390749 daira] wrote:

Here's my interpretation: with h = N = 5, as soon as the Tahoe2ServerSelector decides to put two shares on the same server (here sh1 and sh2 on lxmst5bx), the upload is doomed. The shares all have to be on different servers whenever h = N, but the termination condition is just that all shares have been placed, not that they have been placed in a way that meets the happiness condition.

If that's the problem, then #1382 should fix it. This would also explain why VG2 was unreliable with h close to N.

[//trac/tahoe-lafs/ticket/2016#comment:390750 zooko] replied:

Daira: excellent work diagnosing this!! Ed: thanks so much for the bug report. Daira: it looks like you are right, and I think this does explain those bugs that the volunteergrid2 people reported and that I never understood. Thank you!

[//trac/tahoe-lafs/ticket/2016#comment:390752 kapiteined] wrote:

And to check if that is the case, i changed to 3-7-10 encoding, and now the upload succeeds! Success: file copied

Does this call for a change in code, or for a big warning sticker: "don't choose h and n to close together" ?

[//trac/tahoe-lafs/ticket/2016#comment:390753 daira] wrote:

We intend to fix it for v1.11 (Mark Berger's branch for #1382 already basically works), but there would be no harm in pointing out this problem on tahoe-dev in the meantime.

[//trac/tahoe-lafs/ticket/2016#comment:9 daira] wrote:

Same bug as #1791?

Yes, that bug also had h = N and two shares that were placed on the same server, so almost identical. I'll copy the conclusions here to that ticket.

From #2016 which has now been marked as a duplicate: [//trac/tahoe-lafs/ticket/2016#[comment:390748](/tahoe-lafs/trac/issues/1791#issuecomment-390748) daira] wrote: > Here's the most important part of the log: ``` local#675138 20:33:50.290: <Tahoe2ServerSelector for upload jbljj>(jbljj): response to allocate_buckets() from server i76mi6: alreadygot=(0,), allocated=() local#675139 20:33:50.457: <Tahoe2ServerSelector for upload jbljj>(jbljj): response to allocate_buckets() from server lxmst5: alreadygot=(2,), allocated=(1,) local#675140 20:33:50.667: <Tahoe2ServerSelector for upload jbljj>(jbljj): response to allocate_buckets() from server sf7ehc: alreadygot=(3,), allocated=() local#675141 20:33:50.822: <Tahoe2ServerSelector for upload jbljj>(jbljj): response to allocate_buckets() from server ddvfcd: alreadygot=(4,), allocated=() local#675142 20:33:50.839: <Tahoe2ServerSelector for upload jbljj>(jbljj): server selection unsuccessful for <Tahoe2ServerSelector for upload jbljj>: shares could be placed on only 4 server(s) such that any 3 of them have enough shares to recover the file, but we were asked to place shares on at least 5 such servers. (placed all 5 shares, want to place shares on at least 5 servers such that any 3 of them have enough shares to recover the file, sent 6 queries to 6 servers, 4 queries placed some shares, 2 placed none (of which 2 placed none due to the server being full and 0 placed none due to an error)), merged=sh0: i76mi6en, sh1: lxmst5bx, sh2: lxmst5bx, sh3: sf7ehcpn, sh4: ddvfcdns ``` [//trac/tahoe-lafs/ticket/2016#[comment:390749](/tahoe-lafs/trac/issues/1791#issuecomment-390749) daira] wrote: > Here's my interpretation: with h = N = 5, as soon as the Tahoe2ServerSelector decides to put two shares on the same server (here sh1 and sh2 on lxmst5bx), the upload is doomed. The shares all have to be on different servers whenever h = N, but the termination condition is just that all shares have been placed, not that they have been placed in a way that meets the happiness condition. > If that's the problem, then #1382 should fix it. This would also explain why VG2 was unreliable with h close to N. [//trac/tahoe-lafs/ticket/2016#[comment:390750](/tahoe-lafs/trac/issues/1791#issuecomment-390750) zooko] replied: > Daira: excellent work diagnosing this!! Ed: thanks so much for the bug report. Daira: it looks like you are right, and I think this does explain those bugs that the volunteergrid2 people reported and that I never understood. Thank you! [//trac/tahoe-lafs/ticket/2016#[comment:390752](/tahoe-lafs/trac/issues/1791#issuecomment-390752) kapiteined] wrote: > And to check if that is the case, i changed to 3-7-10 encoding, and now the upload succeeds! `Success: file copied` > Does this call for a change in code, or for a big warning sticker: "don't choose h and n to close together" ? [//trac/tahoe-lafs/ticket/2016#[comment:390753](/tahoe-lafs/trac/issues/1791#issuecomment-390753) daira] wrote: > We intend to fix it for v1.11 (Mark Berger's branch for #1382 already basically works), but there would be no harm in pointing out this problem on tahoe-dev in the meantime. [//trac/tahoe-lafs/ticket/2016#comment:9 daira] wrote: >> Same bug as #1791? > > Yes, that bug also had h = N and two shares that were placed on the same server, so almost identical. I'll copy the conclusions here to that ticket.

Replying to myself in comment:390757:

In answer to your last question (gyver in comment:9), the fact that there are existing shares should not cause an UploadUnhappinessError. However, bugs #1124 and #1130 describe cases where we don't achieve that. I don't think that your problem is due to that, though, because it seems from the logs that the gateway is not contacting enough servers to make it possible to achieve happiness, regardless of the existing share distribution.

[more precisely it isn't receiving responses from enough servers. At this point we're not sure whether it is contacting them, although the "sent 5 queries to 5 servers" in the UploadUnhappinessError message suggests that it isn't.]Edit:

I was wrong here; it is quite similar to #1130, which also has h = N. (#1130 has some additional oddities in the share distribution that was chosen, but I don't think they're relevant.) The fact that we terminate the distribution algorithm as soon as all shares are placed is the underlying problem in all these cases.

Replying to myself in [comment:390757](/tahoe-lafs/trac/issues/1791#issuecomment-390757): > In answer to your last question (gyver in comment:9), the fact that there are existing shares *should not* cause an `UploadUnhappinessError`. However, bugs #1124 and #1130 describe cases where we don't achieve that. I don't think that your problem is due to that, though, because it seems from the logs that the gateway is not contacting enough servers to make it possible to achieve happiness, regardless of the existing share distribution. > > [more precisely it isn't receiving responses from enough servers. At this point we're not sure whether it is contacting them, although the "sent 5 queries to 5 servers" in the `UploadUnhappinessError` message suggests that it isn't.]Edit: I was wrong here; it *is* quite similar to #1130, which also has h = N. (#1130 has some additional oddities in the share distribution that was chosen, but I don't think they're relevant.) The fact that we terminate the distribution algorithm as soon as all shares are placed is the underlying problem in all these cases.

So, the branch from #1382 will fix this bug.

So, the branch from #1382 will fix this bug.
Sign in to join this conversation.
No labels
c/code
c/code-dirnodes
c/code-encoding
c/code-frontend
c/code-frontend-cli
c/code-frontend-ftp-sftp
c/code-frontend-magic-folder
c/code-frontend-web
c/code-mutable
c/code-network
c/code-nodeadmin
c/code-peerselection
c/code-storage
c/contrib
c/dev-infrastructure
c/docs
c/operational
c/packaging
c/unknown
c/website
kw:2pc
kw:410
kw:9p
kw:ActivePerl
kw:AttributeError
kw:DataUnavailable
kw:DeadReferenceError
kw:DoS
kw:FileZilla
kw:GetLastError
kw:IFinishableConsumer
kw:K
kw:LeastAuthority
kw:Makefile
kw:RIStorageServer
kw:StringIO
kw:UncoordinatedWriteError
kw:about
kw:access
kw:access-control
kw:accessibility
kw:accounting
kw:accounting-crawler
kw:add-only
kw:aes
kw:aesthetics
kw:alias
kw:aliases
kw:aliens
kw:allmydata
kw:amazon
kw:ambient
kw:annotations
kw:anonymity
kw:anonymous
kw:anti-censorship
kw:api_auth_token
kw:appearance
kw:appname
kw:apport
kw:archive
kw:archlinux
kw:argparse
kw:arm
kw:assertion
kw:attachment
kw:auth
kw:authentication
kw:automation
kw:avahi
kw:availability
kw:aws
kw:azure
kw:backend
kw:backoff
kw:backup
kw:backupdb
kw:backward-compatibility
kw:bandwidth
kw:basedir
kw:bayes
kw:bbfreeze
kw:beta
kw:binaries
kw:binutils
kw:bitcoin
kw:bitrot
kw:blacklist
kw:blocker
kw:blocks-cloud-deployment
kw:blocks-cloud-merge
kw:blocks-magic-folder-merge
kw:blocks-merge
kw:blocks-raic
kw:blocks-release
kw:blog
kw:bom
kw:bonjour
kw:branch
kw:branding
kw:breadcrumbs
kw:brians-opinion-needed
kw:browser
kw:bsd
kw:build
kw:build-helpers
kw:buildbot
kw:builders
kw:buildslave
kw:buildslaves
kw:cache
kw:cap
kw:capleak
kw:captcha
kw:cast
kw:centos
kw:cffi
kw:chacha
kw:charset
kw:check
kw:checker
kw:chroot
kw:ci
kw:clean
kw:cleanup
kw:cli
kw:cloud
kw:cloud-backend
kw:cmdline
kw:code
kw:code-checks
kw:coding-standards
kw:coding-tools
kw:coding_tools
kw:collection
kw:compatibility
kw:completion
kw:compression
kw:confidentiality
kw:config
kw:configuration
kw:configuration.txt
kw:conflict
kw:connection
kw:connectivity
kw:consistency
kw:content
kw:control
kw:control.furl
kw:convergence
kw:coordination
kw:copyright
kw:corruption
kw:cors
kw:cost
kw:coverage
kw:coveralls
kw:coveralls.io
kw:cpu-watcher
kw:cpyext
kw:crash
kw:crawler
kw:crawlers
kw:create-container
kw:cruft
kw:crypto
kw:cryptography
kw:cryptography-lib
kw:cryptopp
kw:csp
kw:curl
kw:cutoff-date
kw:cycle
kw:cygwin
kw:d3
kw:daemon
kw:darcs
kw:darcsver
kw:database
kw:dataloss
kw:db
kw:dead-code
kw:deb
kw:debian
kw:debug
kw:deep-check
kw:defaults
kw:deferred
kw:delete
kw:deletion
kw:denial-of-service
kw:dependency
kw:deployment
kw:deprecation
kw:desert-island
kw:desert-island-build
kw:design
kw:design-review-needed
kw:detection
kw:dev-infrastructure
kw:devpay
kw:directory
kw:directory-page
kw:dirnode
kw:dirnodes
kw:disconnect
kw:discovery
kw:disk
kw:disk-backend
kw:distribute
kw:distutils
kw:dns
kw:do_http
kw:doc-needed
kw:docker
kw:docs
kw:docs-needed
kw:dokan
kw:dos
kw:download
kw:downloader
kw:dragonfly
kw:drop-upload
kw:duplicity
kw:dusty
kw:earth-dragon
kw:easy
kw:ec2
kw:ecdsa
kw:ed25519
kw:egg-needed
kw:eggs
kw:eliot
kw:email
kw:empty
kw:encoding
kw:endpoint
kw:enterprise
kw:enum34
kw:environment
kw:erasure
kw:erasure-coding
kw:error
kw:escaping
kw:etag
kw:etch
kw:evangelism
kw:eventual
kw:example
kw:excess-authority
kw:exec
kw:exocet
kw:expiration
kw:extensibility
kw:extension
kw:failure
kw:fedora
kw:ffp
kw:fhs
kw:figleaf
kw:file
kw:file-descriptor
kw:filename
kw:filesystem
kw:fileutil
kw:fips
kw:firewall
kw:first
kw:floatingpoint
kw:flog
kw:foolscap
kw:forward-compatibility
kw:forward-secrecy
kw:forwarding
kw:free
kw:freebsd
kw:frontend
kw:fsevents
kw:ftp
kw:ftpd
kw:full
kw:furl
kw:fuse
kw:garbage
kw:garbage-collection
kw:gateway
kw:gatherer
kw:gc
kw:gcc
kw:gentoo
kw:get
kw:git
kw:git-annex
kw:github
kw:glacier
kw:globalcaps
kw:glossary
kw:google-cloud-storage
kw:google-drive-backend
kw:gossip
kw:governance
kw:grid
kw:grid-manager
kw:gridid
kw:gridsync
kw:grsec
kw:gsoc
kw:gvfs
kw:hackfest
kw:hacktahoe
kw:hang
kw:hardlink
kw:heartbleed
kw:heisenbug
kw:help
kw:helper
kw:hint
kw:hooks
kw:how
kw:how-to
kw:howto
kw:hp
kw:hp-cloud
kw:html
kw:http
kw:https
kw:i18n
kw:i2p
kw:i2p-collab
kw:illustration
kw:image
kw:immutable
kw:impressions
kw:incentives
kw:incident
kw:init
kw:inlineCallbacks
kw:inotify
kw:install
kw:installer
kw:integration
kw:integration-test
kw:integrity
kw:interactive
kw:interface
kw:interfaces
kw:interoperability
kw:interstellar-exploration
kw:introducer
kw:introduction
kw:iphone
kw:ipkg
kw:iputil
kw:ipv6
kw:irc
kw:jail
kw:javascript
kw:joke
kw:jquery
kw:json
kw:jsui
kw:junk
kw:key-value-store
kw:kfreebsd
kw:known-issue
kw:konqueror
kw:kpreid
kw:kvm
kw:l10n
kw:lae
kw:large
kw:latency
kw:leak
kw:leasedb
kw:leases
kw:libgmp
kw:license
kw:licenss
kw:linecount
kw:link
kw:linux
kw:lit
kw:localhost
kw:location
kw:locking
kw:logging
kw:logo
kw:loopback
kw:lucid
kw:mac
kw:macintosh
kw:magic-folder
kw:manhole
kw:manifest
kw:manual-test-needed
kw:map
kw:mapupdate
kw:max_space
kw:mdmf
kw:memcheck
kw:memory
kw:memory-leak
kw:mesh
kw:metadata
kw:meter
kw:migration
kw:mime
kw:mingw
kw:minimal
kw:misc
kw:miscapture
kw:mlp
kw:mock
kw:more-info-needed
kw:mountain-lion
kw:move
kw:multi-users
kw:multiple
kw:multiuser-gateway
kw:munin
kw:music
kw:mutability
kw:mutable
kw:mystery
kw:names
kw:naming
kw:nas
kw:navigation
kw:needs-review
kw:needs-spawn
kw:netbsd
kw:network
kw:nevow
kw:new-user
kw:newcaps
kw:news
kw:news-done
kw:news-needed
kw:newsletter
kw:newurls
kw:nfc
kw:nginx
kw:nixos
kw:no-clobber
kw:node
kw:node-url
kw:notification
kw:notifyOnDisconnect
kw:nsa310
kw:nsa320
kw:nsa325
kw:numpy
kw:objects
kw:old
kw:openbsd
kw:openitp-packaging
kw:openssl
kw:openstack
kw:opensuse
kw:operation-helpers
kw:operational
kw:operations
kw:ophandle
kw:ophandles
kw:ops
kw:optimization
kw:optional
kw:options
kw:organization
kw:os
kw:os.abort
kw:ostrom
kw:osx
kw:osxfuse
kw:otf-magic-folder-objective1
kw:otf-magic-folder-objective2
kw:otf-magic-folder-objective3
kw:otf-magic-folder-objective4
kw:otf-magic-folder-objective5
kw:otf-magic-folder-objective6
kw:p2p
kw:packaging
kw:partial
kw:password
kw:path
kw:paths
kw:pause
kw:peer-selection
kw:performance
kw:permalink
kw:permissions
kw:persistence
kw:phone
kw:pickle
kw:pip
kw:pipermail
kw:pkg_resources
kw:placement
kw:planning
kw:policy
kw:port
kw:portability
kw:portal
kw:posthook
kw:pratchett
kw:preformance
kw:preservation
kw:privacy
kw:process
kw:profile
kw:profiling
kw:progress
kw:proxy
kw:publish
kw:pyOpenSSL
kw:pyasn1
kw:pycparser
kw:pycrypto
kw:pycrypto-lib
kw:pycryptopp
kw:pyfilesystem
kw:pyflakes
kw:pylint
kw:pypi
kw:pypy
kw:pysqlite
kw:python
kw:python3
kw:pythonpath
kw:pyutil
kw:pywin32
kw:quickstart
kw:quiet
kw:quotas
kw:quoting
kw:raic
kw:rainhill
kw:random
kw:random-access
kw:range
kw:raspberry-pi
kw:reactor
kw:readonly
kw:rebalancing
kw:recovery
kw:recursive
kw:redhat
kw:redirect
kw:redressing
kw:refactor
kw:referer
kw:referrer
kw:regression
kw:rekey
kw:relay
kw:release
kw:release-blocker
kw:reliability
kw:relnotes
kw:remote
kw:removable
kw:removable-disk
kw:rename
kw:renew
kw:repair
kw:replace
kw:report
kw:repository
kw:research
kw:reserved_space
kw:response-needed
kw:response-time
kw:restore
kw:retrieve
kw:retry
kw:review
kw:review-needed
kw:reviewed
kw:revocation
kw:roadmap
kw:rollback
kw:rpm
kw:rsa
kw:rss
kw:rst
kw:rsync
kw:rusty
kw:s3
kw:s3-backend
kw:s3-frontend
kw:s4
kw:same-origin
kw:sandbox
kw:scalability
kw:scaling
kw:scheduling
kw:schema
kw:scheme
kw:scp
kw:scripts
kw:sdist
kw:sdmf
kw:security
kw:self-contained
kw:server
kw:servermap
kw:servers-of-happiness
kw:service
kw:setup
kw:setup.py
kw:setup_requires
kw:setuptools
kw:setuptools_darcs
kw:sftp
kw:shared
kw:shareset
kw:shell
kw:signals
kw:simultaneous
kw:six
kw:size
kw:slackware
kw:slashes
kw:smb
kw:sneakernet
kw:snowleopard
kw:socket
kw:solaris
kw:space
kw:space-efficiency
kw:spam
kw:spec
kw:speed
kw:sqlite
kw:ssh
kw:ssh-keygen
kw:sshfs
kw:ssl
kw:stability
kw:standards
kw:start
kw:startup
kw:static
kw:static-analysis
kw:statistics
kw:stats
kw:stats_gatherer
kw:status
kw:stdeb
kw:storage
kw:streaming
kw:strports
kw:style
kw:stylesheet
kw:subprocess
kw:sumo
kw:survey
kw:svg
kw:symlink
kw:synchronous
kw:tac
kw:tahoe-*
kw:tahoe-add-alias
kw:tahoe-admin
kw:tahoe-archive
kw:tahoe-backup
kw:tahoe-check
kw:tahoe-cp
kw:tahoe-create-alias
kw:tahoe-create-introducer
kw:tahoe-debug
kw:tahoe-deep-check
kw:tahoe-deepcheck
kw:tahoe-lafs-trac-stream
kw:tahoe-list-aliases
kw:tahoe-ls
kw:tahoe-magic-folder
kw:tahoe-manifest
kw:tahoe-mkdir
kw:tahoe-mount
kw:tahoe-mv
kw:tahoe-put
kw:tahoe-restart
kw:tahoe-rm
kw:tahoe-run
kw:tahoe-start
kw:tahoe-stats
kw:tahoe-unlink
kw:tahoe-webopen
kw:tahoe.css
kw:tahoe_files
kw:tahoewapi
kw:tarball
kw:tarballs
kw:tempfile
kw:templates
kw:terminology
kw:test
kw:test-and-set
kw:test-from-egg
kw:test-needed
kw:testgrid
kw:testing
kw:tests
kw:throttling
kw:ticket999-s3-backend
kw:tiddly
kw:time
kw:timeout
kw:timing
kw:to
kw:to-be-closed-on-2011-08-01
kw:tor
kw:tor-protocol
kw:torsocks
kw:tox
kw:trac
kw:transparency
kw:travis
kw:travis-ci
kw:trial
kw:trickle
kw:trivial
kw:truckee
kw:tub
kw:tub.location
kw:twine
kw:twistd
kw:twistd.log
kw:twisted
kw:twisted-14
kw:twisted-trial
kw:twitter
kw:twn
kw:txaws
kw:type
kw:typeerror
kw:ubuntu
kw:ucwe
kw:ueb
kw:ui
kw:unclean
kw:uncoordinated-writes
kw:undeletable
kw:unfinished-business
kw:unhandled-error
kw:unhappy
kw:unicode
kw:unit
kw:unix
kw:unlink
kw:update
kw:upgrade
kw:upload
kw:upload-helper
kw:uri
kw:url
kw:usability
kw:use-case
kw:utf-8
kw:util
kw:uwsgi
kw:ux
kw:validation
kw:variables
kw:vdrive
kw:verify
kw:verlib
kw:version
kw:versioning
kw:versions
kw:video
kw:virtualbox
kw:virtualenv
kw:vista
kw:visualization
kw:visualizer
kw:vm
kw:volunteergrid2
kw:volunteers
kw:vpn
kw:wapi
kw:warners-opinion-needed
kw:warning
kw:weapi
kw:web
kw:web.port
kw:webapi
kw:webdav
kw:webdrive
kw:webport
kw:websec
kw:website
kw:websocket
kw:welcome
kw:welcome-page
kw:welcomepage
kw:wiki
kw:win32
kw:win64
kw:windows
kw:windows-related
kw:winscp
kw:workaround
kw:world-domination
kw:wrapper
kw:write-enabler
kw:wui
kw:x86
kw:x86-64
kw:xhtml
kw:xml
kw:xss
kw:zbase32
kw:zetuptoolz
kw:zfec
kw:zookos-opinion-needed
kw:zope
kw:zope.interface
p/blocker
p/critical
p/major
p/minor
p/normal
p/supercritical
p/trivial
r/cannot reproduce
r/duplicate
r/fixed
r/invalid
r/somebody else's problem
r/was already fixed
r/wontfix
r/worksforme
t/defect
t/enhancement
t/task
v/0.2.0
v/0.3.0
v/0.4.0
v/0.5.0
v/0.5.1
v/0.6.0
v/0.6.1
v/0.7.0
v/0.8.0
v/0.9.0
v/1.0.0
v/1.1.0
v/1.10.0
v/1.10.1
v/1.10.2
v/1.10a2
v/1.11.0
v/1.12.0
v/1.12.1
v/1.13.0
v/1.14.0
v/1.15.0
v/1.15.1
v/1.2.0
v/1.3.0
v/1.4.1
v/1.5.0
v/1.6.0
v/1.6.1
v/1.7.0
v/1.7.1
v/1.7β
v/1.8.0
v/1.8.1
v/1.8.2
v/1.8.3
v/1.8β
v/1.9.0
v/1.9.0-s3branch
v/1.9.0a1
v/1.9.0a2
v/1.9.0b1
v/1.9.1
v/1.9.2
v/1.9.2a1
v/cloud-branch
v/unknown
No milestone
No project
No assignees
3 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: tahoe-lafs/trac#1791
No description provided.