UploadUnhappinessError with available storage nodes > shares.happy #1791

New issue

Open

opened 2012-07-06 23:32:42 +00:00 by gyver · 15 comments

gyver commented

2012-07-06 23:32:42 +00:00

Owner

The error happened with 1.9.1 too. I just upgraded to 1.9.2 and fixed some files/dir that 1.9.1 couldn't repair reliably hoping the following problem would get away too (it didn't).

There are some peculiarities in my setup: I use USB disks connected to a single server so all storage nodes are running on the same server although physically on a disk that can easily be sent away for increasing the durability of the whole storage. At the time of failure there were 7 such storage nodes in my setup and my whole store was fully repaired on these 7 nodes, all the content is/was uploaded with
shares.needed = 4
shares.happy = 6
shares.total = 6

Although 7 >= 6 I get this error when trying to tahoe cp a new file:

Traceback (most recent call last):
  File \"/usr/lib64/python2.7/site-packages/foolscap/call.py\", line 677, in _done
    self.request.complete(res)
  File \"/usr/lib64/python2.7/site-packages/foolscap/call.py\", line 60, in complete
    self.deferred.callback(res)
  File \"/usr/lib64/python2.7/site-packages/twisted/internet/defer.py\", line 361, in callback
    self._startRunCallbacks(result)
  File \"/usr/lib64/python2.7/site-packages/twisted/internet/defer.py\", line 455, in _startRunCallbacks
    self._runCallbacks()
--- <exception caught here> ---
  File \"/usr/lib64/python2.7/site-packages/twisted/internet/defer.py\", line 542, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File \"/usr/lib64/python2.7/site-packages/allmydata/immutable/upload.py\", line 553, in _got_response
    return self._loop()
  File \"/usr/lib64/python2.7/site-packages/allmydata/immutable/upload.py\", line 404, in _loop
    return self._failed(\"%s (%s)\" % (failmsg, self._get_progress_message()))
  File \"/usr/lib64/python2.7/site-packages/allmydata/immutable/upload.py\", line 566, in _failed
    raise UploadUnhappinessError(msg)
allmydata.interfaces.UploadUnhappinessError: shares could be placed on only 5 server(s) such that any 4 of them have enough shares to recover the file, but we were asked to place shares on at least 6 such servers. (placed all 6 shares, want to place shares on at least 6 servers such that any 4 of them have enough shares to recover the file, sent 5 queries to 5 servers, 5 queries placed some shares, 0 placed none (of which 0 placed none due to the server being full and 0 placed none due to an error))

I recently found out about flogtool, so I run it on the client node (which is one of the 7 storage nodes btw), I only pasted the last part from CHKUploader (I can attach the whole log if needs be):

01:04:01.314 L20 []#2339 CHKUploader starting
01:04:01.314 L20 []#2340 starting upload of <allmydata.immutable.upload.EncryptAnUploadable instance at 0x2c9b5a8>
01:04:01.314 L20 []#2341 creating Encoder <Encoder for unknown storage index>
01:04:01.314 L20 []#2342 file size: 4669394
01:04:01.314 L10 []#2343 my encoding parameters: (4, 6, 6, 131072)
01:04:01.314 L20 []#2344 got encoding parameters: 4/6/6 131072
01:04:01.314 L20 []#2345 now setting up codec
01:04:01.348 L20 []#2346 using storage index k5ga2
01:04:01.348 L20 []#2347 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): starting
01:04:01.363 L10 []#2348 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): response to allocate_buckets() from server zp6jpfeu: alreadygot=(0,), allocated=()
01:04:01.372 L10 []#2349 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): response to allocate_buckets() from server pa2myijh: alreadygot=(2,), allocated=(1,)
01:04:01.375 L20 []#2350 storage: allocate_buckets k5ga2suaoz7gju523f5ni3mswe
01:04:01.377 L10 []#2351 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): response to allocate_buckets() from server omkzwfx5: alreadygot=(3,), allocated=()
01:04:01.381 L10 []#2352 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): response to allocate_buckets() from server wo6akhxt: alreadygot=(4,), allocated=()
01:04:01.404 L10 []#2353 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): response to allocate_buckets() from server ughwvrtu: alreadygot=(), allocated=(5,)
01:04:01.405 L25 []#2354 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): server selection unsuccessful for <Tahoe2ServerSelector for upload k5ga2>: shares could be placed on only 5 server(s) such that any 4 of them have enough shares to recover the file, but we were asked to place shares on at least 6 such servers. (placed all 6 shares, want to place shares on at least 6 servers such that any 4 of them have enough shares to recover the file, sent 5 queries to 5 servers, 5 queries placed some shares, 0 placed none (of which 0 placed none due to the server being full and 0 placed none due to an error)), merged=sh0: zp6jpfeu, sh1: pa2myijh, sh2: pa2myijh, sh3: omkzwfx5, sh4: wo6akhxt, sh5: ughwvrtu
01:04:01.407 L20 []#2355 web: 127.0.0.1 PUT /uri/[CENSORED].. 500 1644

The error happened with 1.9.1 too. I just upgraded to 1.9.2 and fixed some files/dir that 1.9.1 couldn't repair reliably hoping the following problem would get away too (it didn't). There are some peculiarities in my setup: I use USB disks connected to a single server so all storage nodes are running on the same server although physically on a disk that can easily be sent away for increasing the durability of the whole storage. At the time of failure there were 7 such storage nodes in my setup and my whole store was fully repaired on these 7 nodes, all the content is/was uploaded with shares.needed = 4 shares.happy = 6 shares.total = 6 Although 7 >= 6 I get this error when trying to tahoe cp a new file: ``` Traceback (most recent call last): File \"/usr/lib64/python2.7/site-packages/foolscap/call.py\", line 677, in _done self.request.complete(res) File \"/usr/lib64/python2.7/site-packages/foolscap/call.py\", line 60, in complete self.deferred.callback(res) File \"/usr/lib64/python2.7/site-packages/twisted/internet/defer.py\", line 361, in callback self._startRunCallbacks(result) File \"/usr/lib64/python2.7/site-packages/twisted/internet/defer.py\", line 455, in _startRunCallbacks self._runCallbacks() --- <exception caught here> --- File \"/usr/lib64/python2.7/site-packages/twisted/internet/defer.py\", line 542, in _runCallbacks current.result = callback(current.result, *args, **kw) File \"/usr/lib64/python2.7/site-packages/allmydata/immutable/upload.py\", line 553, in _got_response return self._loop() File \"/usr/lib64/python2.7/site-packages/allmydata/immutable/upload.py\", line 404, in _loop return self._failed(\"%s (%s)\" % (failmsg, self._get_progress_message())) File \"/usr/lib64/python2.7/site-packages/allmydata/immutable/upload.py\", line 566, in _failed raise UploadUnhappinessError(msg) allmydata.interfaces.UploadUnhappinessError: shares could be placed on only 5 server(s) such that any 4 of them have enough shares to recover the file, but we were asked to place shares on at least 6 such servers. (placed all 6 shares, want to place shares on at least 6 servers such that any 4 of them have enough shares to recover the file, sent 5 queries to 5 servers, 5 queries placed some shares, 0 placed none (of which 0 placed none due to the server being full and 0 placed none due to an error)) ``` I recently found out about flogtool, so I run it on the client node (which is one of the 7 storage nodes btw), I only pasted the last part from CHKUploader (I can attach the whole log if needs be): ``` 01:04:01.314 L20 []#2339 CHKUploader starting 01:04:01.314 L20 []#2340 starting upload of <allmydata.immutable.upload.EncryptAnUploadable instance at 0x2c9b5a8> 01:04:01.314 L20 []#2341 creating Encoder <Encoder for unknown storage index> 01:04:01.314 L20 []#2342 file size: 4669394 01:04:01.314 L10 []#2343 my encoding parameters: (4, 6, 6, 131072) 01:04:01.314 L20 []#2344 got encoding parameters: 4/6/6 131072 01:04:01.314 L20 []#2345 now setting up codec 01:04:01.348 L20 []#2346 using storage index k5ga2 01:04:01.348 L20 []#2347 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): starting 01:04:01.363 L10 []#2348 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): response to allocate_buckets() from server zp6jpfeu: alreadygot=(0,), allocated=() 01:04:01.372 L10 []#2349 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): response to allocate_buckets() from server pa2myijh: alreadygot=(2,), allocated=(1,) 01:04:01.375 L20 []#2350 storage: allocate_buckets k5ga2suaoz7gju523f5ni3mswe 01:04:01.377 L10 []#2351 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): response to allocate_buckets() from server omkzwfx5: alreadygot=(3,), allocated=() 01:04:01.381 L10 []#2352 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): response to allocate_buckets() from server wo6akhxt: alreadygot=(4,), allocated=() 01:04:01.404 L10 []#2353 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): response to allocate_buckets() from server ughwvrtu: alreadygot=(), allocated=(5,) 01:04:01.405 L25 []#2354 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): server selection unsuccessful for <Tahoe2ServerSelector for upload k5ga2>: shares could be placed on only 5 server(s) such that any 4 of them have enough shares to recover the file, but we were asked to place shares on at least 6 such servers. (placed all 6 shares, want to place shares on at least 6 servers such that any 4 of them have enough shares to recover the file, sent 5 queries to 5 servers, 5 queries placed some shares, 0 placed none (of which 0 placed none due to the server being full and 0 placed none due to an error)), merged=sh0: zp6jpfeu, sh1: pa2myijh, sh2: pa2myijh, sh3: omkzwfx5, sh4: wo6akhxt, sh5: ughwvrtu 01:04:01.407 L20 []#2355 web: 127.0.0.1 PUT /uri/[CENSORED].. 500 1644 ```

tahoe-lafs added the

labels 2012-07-06 23:32:42 +00:00

tahoe-lafs added this to the undecided milestone 2012-07-06 23:32:42 +00:00

daira was assigned by tahoe-lafs

2012-07-06 23:32:42 +00:00

daira commented

2012-07-07 12:55:47 +00:00

Thanks for the log, that's very useful.

The uploader only tried to contact 5 servers, which is the problem. Are you absolutely sure that more than the 5 servers mentioned (i.e. zp6jpfeu, pa2myijh, omkzwfx5, wo6akhxt, ughwvrtu) are connected?

Maybe we should include the set of connected servers in the log or the error message.

Thanks for the log, that's very useful. The uploader only tried to contact 5 servers, which is the problem. Are you absolutely sure that more than the 5 servers mentioned (i.e. zp6jpfeu, pa2myijh, omkzwfx5, wo6akhxt, ughwvrtu) are connected? Maybe we should include the set of connected servers in the log or the error message.

gyver commented

2012-07-07 13:38:21 +00:00

Author

Owner

Yes, I'm quite sure all 7 were active. I start all nodes with a single script that in turns:

starts the introducer,
starts each storage node.
If one of the expected node can't be started I see it right away in the script output when starting the grid.
At the time of failure, I even checked the web interface of both the node I use as a client and the introducer and they both listed all 7 storage nodes.
I even checked that there was plenty of free space on each storage node and that there was no configured reserved space that could explain a possible node refusal of storing data.

I just rechecked and noticed something. The server has 3 IP addresses: the loopback, a private IP on a local network and a private IP on a VPN (managed by OpenVPN). Apparently each node advertises its services on all 3 IPs (I assume it's by design).
But the listing of storage nodes given by my "client" node isn't exactly consistent with the one given by the introducer.

Here are the current outputs (there shouldn't be any security problem publishing this so I didn't obfuscate anything):

Introducer's Service Announcements:

Nickname	Advertised IPs	Announced	Version	Service name
store_1 omkzwfx5fconrg4fjqxwt3hdzsmvhfdv	192.168.0.1 10.8.0.10	00:45:24 07-Jul-2012	allmydata-tahoe/1.9.2	storage
store_2 d3fycappvcjuma7vwc45spfolaei6n24	192.168.0.1 10.8.0.10	00:45:25 07-Jul-2012	allmydata-tahoe/1.9.2	storage
store_3 lehmccp7am6cydjh3qsfmdp3scxzprmv	192.168.0.1 10.8.0.10	00:45:26 07-Jul-2012	allmydata-tahoe/1.9.2	storage
store_4 pa2myijhnj3ld4owmqxsihow4hv2botv	192.168.0.1 10.8.0.10	00:45:27 07-Jul-2012	allmydata-tahoe/1.9.2	storage
store_5 wo6akhxtren7esuhbtjscmpzvjwuvken	192.168.0.1 10.8.0.10	00:45:29 07-Jul-2012	allmydata-tahoe/1.9.2	storage
store_6 zp6jpfeudsrxv3n2sz6he3kep3mw3tgr	192.168.0.1 10.8.0.10	00:45:30 07-Jul-2012	allmydata-tahoe/1.9.2	storage
store_8 ughwvrtu7ht6x4xpgr2abbsb457ipcvr	192.168.0.1 10.8.0.10	00:45:31 07-Jul-2012	allmydata-tahoe/1.9.2	storage

Introducer's Subscribed Clients:

Nickname	PeerID	Advertised IPs	Connected From	Since	Version	Subscribed To
store_1	omkzwfx5fconrg4fjqxwt3hdzsmvhfdv	192.168.0.1 10.8.0.10	192.168.0.1:54872	00:45:24 07-Jul-2012	allmydata-tahoe/1.9.2	storage
store_5	wo6akhxtren7esuhbtjscmpzvjwuvken	192.168.0.1 10.8.0.10	192.168.0.1:54949	00:45:29 07-Jul-2012	allmydata-tahoe/1.9.2	storage
store_2	d3fycappvcjuma7vwc45spfolaei6n24	192.168.0.1 10.8.0.10	192.168.0.1:54884	00:45:25 07-Jul-2012	allmydata-tahoe/1.9.2	storage
store_3	lehmccp7am6cydjh3qsfmdp3scxzprmv	192.168.0.1 10.8.0.10	127.0.0.1:55124	00:45:26 07-Jul-2012	allmydata-tahoe/1.9.2	storage
store_4	pa2myijhnj3ld4owmqxsihow4hv2botv	192.168.0.1 10.8.0.10	127.0.0.1:55143	00:45:27 07-Jul-2012	allmydata-tahoe/1.9.2	storage
store_6	zp6jpfeudsrxv3n2sz6he3kep3mw3tgr	192.168.0.1 10.8.0.10	127.0.0.1:55207	00:45:30 07-Jul-2012	allmydata-tahoe/1.9.2	storage
store_8	ughwvrtu7ht6x4xpgr2abbsb457ipcvr	192.168.0.1 10.8.0.10	127.0.0.1:55252	00:45:31 07-Jul-2012	allmydata-tahoe/1.9.2	storage

"The storage node I use as a client"'s status:

Connected to 7 of 7 known storage servers:

Service Name	Nickname PeerID	Connected?	Since	First Announced	Version
storage	store_2 d3fycappvcjuma7vwc45spfolaei6n24	Yes: to 127.0.0.1:48397	00:45:25 07-Jul-2012	00:45:25 07-Jul-2012	allmydata-tahoe/1.9.2
storage	store_3 lehmccp7am6cydjh3qsfmdp3scxzprmv	Yes: to 192.168.0.1:38708	00:45:26 07-Jul-2012	00:45:26 07-Jul-2012	allmydata-tahoe/1.9.2
storage	store_1 omkzwfx5fconrg4fjqxwt3hdzsmvhfdv	Yes: to (loopback)	00:45:24 07-Jul-2012	00:45:24 07-Jul-2012	allmydata-tahoe/1.9.2
storage	store_4 pa2myijhnj3ld4owmqxsihow4hv2botv	Yes: to 10.8.0.10:40626	00:45:27 07-Jul-2012	00:45:27 07-Jul-2012	allmydata-tahoe/1.9.2
storage	store_8 ughwvrtu7ht6x4xpgr2abbsb457ipcvr	Yes: to 10.8.0.10:49604	00:45:31 07-Jul-2012	00:45:31 07-Jul-2012	allmydata-tahoe/1.9.2
storage	store_5 wo6akhxtren7esuhbtjscmpzvjwuvken	Yes: to 127.0.0.1:50958	00:45:29 07-Jul-2012	00:45:29 07-Jul-2012	allmydata-tahoe/1.9.2
storage	store_6 zp6jpfeudsrxv3n2sz6he3kep3mw3tgr	Yes: to 10.8.0.10:52661	00:45:30 07-Jul-2012	00:45:30 07-Jul-2012	allmydata-tahoe/1.9.2

I'm not sure how the service announcement and IP selection works, but there seems to be at least some amount of chance involved in the IP selection. All nodes should behave themselves in the same way so AFAIK the same IP should be selected.

Yes, I'm quite sure all 7 were active. I start all nodes with a single script that in turns: - starts the introducer, - starts each storage node. If one of the expected node can't be started I see it right away in the script output when starting the grid. At the time of failure, I even checked the web interface of both the node I use as a client and the introducer and they both listed all 7 storage nodes. I even checked that there was plenty of free space on each storage node and that there was no configured reserved space that could explain a possible node refusal of storing data. I just rechecked and noticed something. The server has 3 IP addresses: the loopback, a private IP on a local network and a private IP on a VPN (managed by OpenVPN). Apparently each node advertises its services on all 3 IPs (I assume it's by design). But the listing of storage nodes given by my "client" node isn't exactly consistent with the one given by the introducer. Here are the current outputs (there shouldn't be any security problem publishing this so I didn't obfuscate anything): Introducer's Service Announcements: | Nickname | Advertised IPs | Announced | Version | Service name | |---|---|---|---|---| | store_1 omkzwfx5fconrg4fjqxwt3hdzsmvhfdv | 192.168.0.1 10.8.0.10 | 00:45:24 07-Jul-2012 | allmydata-tahoe/1.9.2 | storage | | store_2 d3fycappvcjuma7vwc45spfolaei6n24 | 192.168.0.1 10.8.0.10 | 00:45:25 07-Jul-2012 | allmydata-tahoe/1.9.2 | storage | | store_3 lehmccp7am6cydjh3qsfmdp3scxzprmv | 192.168.0.1 10.8.0.10 | 00:45:26 07-Jul-2012 | allmydata-tahoe/1.9.2 | storage | | store_4 pa2myijhnj3ld4owmqxsihow4hv2botv | 192.168.0.1 10.8.0.10 | 00:45:27 07-Jul-2012 | allmydata-tahoe/1.9.2 | storage | | store_5 wo6akhxtren7esuhbtjscmpzvjwuvken | 192.168.0.1 10.8.0.10 | 00:45:29 07-Jul-2012 | allmydata-tahoe/1.9.2 | storage | | store_6 zp6jpfeudsrxv3n2sz6he3kep3mw3tgr | 192.168.0.1 10.8.0.10 | 00:45:30 07-Jul-2012 | allmydata-tahoe/1.9.2 | storage | | store_8 ughwvrtu7ht6x4xpgr2abbsb457ipcvr | 192.168.0.1 10.8.0.10 | 00:45:31 07-Jul-2012 | allmydata-tahoe/1.9.2 | storage | Introducer's Subscribed Clients: | Nickname | PeerID | Advertised IPs | Connected From | Since | Version | Subscribed To | |---|---|---|---|---|---|---| | store_1 | omkzwfx5fconrg4fjqxwt3hdzsmvhfdv | 192.168.0.1 10.8.0.10 | 192.168.0.1:54872 | 00:45:24 07-Jul-2012 | allmydata-tahoe/1.9.2 | storage | | store_5 | wo6akhxtren7esuhbtjscmpzvjwuvken | 192.168.0.1 10.8.0.10 | 192.168.0.1:54949 | 00:45:29 07-Jul-2012 | allmydata-tahoe/1.9.2 | storage | | store_2 | d3fycappvcjuma7vwc45spfolaei6n24 | 192.168.0.1 10.8.0.10 | 192.168.0.1:54884 | 00:45:25 07-Jul-2012 | allmydata-tahoe/1.9.2 | storage | | store_3 | lehmccp7am6cydjh3qsfmdp3scxzprmv | 192.168.0.1 10.8.0.10 | 127.0.0.1:55124 | 00:45:26 07-Jul-2012 | allmydata-tahoe/1.9.2 | storage | | store_4 | pa2myijhnj3ld4owmqxsihow4hv2botv | 192.168.0.1 10.8.0.10 | 127.0.0.1:55143 | 00:45:27 07-Jul-2012 | allmydata-tahoe/1.9.2 | storage | | store_6 | zp6jpfeudsrxv3n2sz6he3kep3mw3tgr | 192.168.0.1 10.8.0.10 | 127.0.0.1:55207 | 00:45:30 07-Jul-2012 | allmydata-tahoe/1.9.2 | storage | | store_8 | ughwvrtu7ht6x4xpgr2abbsb457ipcvr | 192.168.0.1 10.8.0.10 | 127.0.0.1:55252 | 00:45:31 07-Jul-2012 | allmydata-tahoe/1.9.2 | storage | "The storage node I use as a client"'s status: Connected to 7 of 7 known storage servers: | Service Name | Nickname PeerID | Connected? | Since | First Announced | Version | |---|---|---|---|---|---| | storage | store_2 d3fycappvcjuma7vwc45spfolaei6n24 | Yes: to 127.0.0.1:48397 | 00:45:25 07-Jul-2012 | 00:45:25 07-Jul-2012 | allmydata-tahoe/1.9.2 | | storage | store_3 lehmccp7am6cydjh3qsfmdp3scxzprmv | Yes: to 192.168.0.1:38708 | 00:45:26 07-Jul-2012 | 00:45:26 07-Jul-2012 | allmydata-tahoe/1.9.2 | | storage | store_1 omkzwfx5fconrg4fjqxwt3hdzsmvhfdv | Yes: to (loopback) | 00:45:24 07-Jul-2012 | 00:45:24 07-Jul-2012 | allmydata-tahoe/1.9.2 | | storage | store_4 pa2myijhnj3ld4owmqxsihow4hv2botv | Yes: to 10.8.0.10:40626 | 00:45:27 07-Jul-2012 | 00:45:27 07-Jul-2012 | allmydata-tahoe/1.9.2 | | storage | store_8 ughwvrtu7ht6x4xpgr2abbsb457ipcvr | Yes: to 10.8.0.10:49604 | 00:45:31 07-Jul-2012 | 00:45:31 07-Jul-2012 | allmydata-tahoe/1.9.2 | | storage | store_5 wo6akhxtren7esuhbtjscmpzvjwuvken | Yes: to 127.0.0.1:50958 | 00:45:29 07-Jul-2012 | 00:45:29 07-Jul-2012 | allmydata-tahoe/1.9.2 | | storage | store_6 zp6jpfeudsrxv3n2sz6he3kep3mw3tgr | Yes: to 10.8.0.10:52661 | 00:45:30 07-Jul-2012 | 00:45:30 07-Jul-2012 | allmydata-tahoe/1.9.2 | I'm not sure how the service announcement and IP selection works, but there seems to be at least some amount of chance involved in the IP selection. All nodes should behave themselves in the same way so AFAIK the same IP should be selected.

daira commented

2012-07-07 14:59:54 +00:00

Hmm. d3fycapp and lehmccp7 were the servers that were not contacted in the failed upload, and they have IPs 127.0.0.1 and 192.168.0.1 respectively. But wo6akhxt also has IP 127.0.0.1 and that was contacted in the failed upload, so that might be a red herring.

I don't know why the determination of which IP to use is nondeterministic. warner?

daira commented

2012-07-07 15:02:02 +00:00

BTW, I have seen nondeterministic choice of IP on the Least Authority Enterprises servers (which are EC2 instances running Ubuntu) as well.

gyver commented

2012-07-07 15:11:42 +00:00

Author

Owner

Replying to davidsarah:

Hmm. d3fycapp and lehmccp7 were the servers that were not contacted in the failed upload, and they have IPs 127.0.0.1 and 192.168.0.1 respectively. But wo6akhxt also has IP 127.0.0.1 and that was contacted in the failed upload, so that might be a red herring.

Note that the assigned IPs are not always the same: a restart of all storage nodes reshuffles them. My last failed attempt was with d3fycapp and lehmccp7 seen both on 10.8.0.10 (VPN IP) by the client node (5 nodes where seen on 10.8.0.10 with ughwvrtu on 192.168.0.1 and the client (omkzwfx5) on loopback. It seems the IP addresses used don't change the error: I've always seen the same (only 5 servers queried) since the problem appeared.

The problem may not have anything to do with IP address choices but it seems to me these inconsistencies are odd enough to keep in mind.

Replying to [davidsarah](/tahoe-lafs/trac/issues/1791#issuecomment-390749): > Hmm. d3fycapp and lehmccp7 were the servers that were not contacted in the failed upload, and they have IPs 127.0.0.1 and 192.168.0.1 respectively. But wo6akhxt also has IP 127.0.0.1 and that was contacted in the failed upload, so that might be a red herring. Note that the assigned IPs are not always the same: a restart of all storage nodes reshuffles them. My last failed attempt was with d3fycapp and lehmccp7 seen both on 10.8.0.10 (VPN IP) by the client node (5 nodes where seen on 10.8.0.10 with ughwvrtu on 192.168.0.1 and the client (omkzwfx5) on loopback. It seems the IP addresses used don't change the error: I've always seen the same (only 5 servers queried) since the problem appeared. The problem may not have anything to do with IP address choices but it seems to me these inconsistencies are odd enough to keep in mind.

daira commented

2012-07-07 19:38:30 +00:00

Please add the following just after line 225 (i.e. after readonly_servers = ... and before # decide upon the renewal/cancel secrets...) of [src/allmydata/immutable/upload.py in 1.9.2]source:1.9.2/src/allmydata/immutable/upload.py:

        self.log(format="servers before upload: %(connected_servers)r permuted to %(all_servers)r, of which"
                        "%(writeable_servers)r are writeable and %(readonly_servers)r are read-only",
                 connected_servers=storage_broker.get_connected_servers(),
                 all_servers=all_servers,
                 writeable_servers=writeable_servers,
                 readonly_servers=readonly_servers,
                 level=log.OPERATIONAL)

and then show the log for a failing upload.

(You need to restart the gateway after changing the code, but it's not necessary to rebuild it.)

Please add the following just after line 225 (i.e. after `readonly_servers = `... and before `# decide upon the renewal/cancel secrets`...) of [src/allmydata/immutable/upload.py in 1.9.2]source:1.9.2/src/allmydata/immutable/upload.py: ``` self.log(format="servers before upload: %(connected_servers)r permuted to %(all_servers)r, of which" "%(writeable_servers)r are writeable and %(readonly_servers)r are read-only", connected_servers=storage_broker.get_connected_servers(), all_servers=all_servers, writeable_servers=writeable_servers, readonly_servers=readonly_servers, level=log.OPERATIONAL) ``` and then show the log for a failing upload. (You need to restart the gateway after changing the code, but it's not necessary to rebuild it.)

daira added

c/code-peerselection

and removed

c/unknown

labels 2012-07-07 19:40:15 +00:00

daira modified the milestone from undecided to 1.10.0

2012-07-07 19:40:15 +00:00

gyver commented

2012-07-07 21:30:23 +00:00

Author

Owner

Replying to davidsarah:

Please add the following just after line 225 (i.e. after readonly_servers = ... and before # decide upon the renewal/cancel secrets...) of [src/allmydata/immutable/upload.py in 1.9.2]source:1.9.2/src/allmydata/immutable/upload.py:

I may not have done it right : I got the same output with this at the end:

23:09:02.238 L23 []#2436 an outbound callRemote (that we [omkz] sent to someone else [zqxq]) failed on the far end
23:09:02.238 L10 []#2437  reqID=873, rref=<RemoteReference at 0x2e780d0>, methname=RILogObserver.foolscap.lothar.com.msg
23:09:02.238 L10 []#2438  the REMOTE failure was:
 FAILURE:
 [CopiedFailure instance: Traceback from remote host -- Traceback (most recent call last):
   File "/usr/lib64/python2.7/site-packages/foolscap/slicers/root.py", line 107, in send
     d.callback(None)
   File "/usr/lib64/python2.7/site-packages/twisted/internet/defer.py", line 361, in callback
     self._startRunCallbacks(result)
   File "/usr/lib64/python2.7/site-packages/twisted/internet/defer.py", line 455, in _startRunCallbacks
     self._runCallbacks()
   File "/usr/lib64/python2.7/site-packages/twisted/internet/defer.py", line 542, in _runCallbacks
     current.result = callback(current.result, *args, **kw)
 --- <exception caught here> ---
   File "/usr/lib64/python2.7/site-packages/foolscap/banana.py", line 215, in produce
     slicer = self.newSlicerFor(obj)
   File "/usr/lib64/python2.7/site-packages/foolscap/banana.py", line 314, in newSlicerFor
     return topSlicer.slicerForObject(obj)
   File "/usr/lib64/python2.7/site-packages/foolscap/slicer.py", line 48,

BUT... I may have a lead looking at the last error message in my original log dump.

server selection unsuccessful for : shares could be placed on only 5 server(s) [...], merged=sh0: zp6jpfeu, sh1: pa2myijh, sh2: pa2myijh, sh3: omkzwfx5, sh4: wo6akhxt, sh5: ughwvrtu

I assume the sh are the shares to be placed. sh1 and sh2 were affected to pa2myijh. I'm not sure if this repartition is the result of share detection (my guess) or the result of a share placement algorithm that could produce invalid placement and needs a check before upload (late error detection isn't good practice so I bet it's not the case).

What if these shares are already stored on pa2myijh before the upload attempt (due to past uploads with a buggy version or whatever happened in the store directory out of Tahoe's control). Is the code able to detect such a case and reupload one of the two shares on a free (without one of the 6 shares) server? If not, it might be the cause of my problem (the file was part of a long list of files I tried to upload with only partial success weeks ago...) and my storage nodes are most probably polluted by "dangling" shares.

Replying to [davidsarah](/tahoe-lafs/trac/issues/1791#issuecomment-390752): > Please add the following just after line 225 (i.e. after `readonly_servers = `... and before `# decide upon the renewal/cancel secrets`...) of [src/allmydata/immutable/upload.py in 1.9.2]source:1.9.2/src/allmydata/immutable/upload.py: I may not have done it right : I got the same output with this at the end: ``` 23:09:02.238 L23 []#2436 an outbound callRemote (that we [omkz] sent to someone else [zqxq]) failed on the far end 23:09:02.238 L10 []#2437 reqID=873, rref=<RemoteReference at 0x2e780d0>, methname=RILogObserver.foolscap.lothar.com.msg 23:09:02.238 L10 []#2438 the REMOTE failure was: FAILURE: [CopiedFailure instance: Traceback from remote host -- Traceback (most recent call last): File "/usr/lib64/python2.7/site-packages/foolscap/slicers/root.py", line 107, in send d.callback(None) File "/usr/lib64/python2.7/site-packages/twisted/internet/defer.py", line 361, in callback self._startRunCallbacks(result) File "/usr/lib64/python2.7/site-packages/twisted/internet/defer.py", line 455, in _startRunCallbacks self._runCallbacks() File "/usr/lib64/python2.7/site-packages/twisted/internet/defer.py", line 542, in _runCallbacks current.result = callback(current.result, *args, **kw) --- <exception caught here> --- File "/usr/lib64/python2.7/site-packages/foolscap/banana.py", line 215, in produce slicer = self.newSlicerFor(obj) File "/usr/lib64/python2.7/site-packages/foolscap/banana.py", line 314, in newSlicerFor return topSlicer.slicerForObject(obj) File "/usr/lib64/python2.7/site-packages/foolscap/slicer.py", line 48, ``` BUT... I may have a lead looking at the last error message in my original log dump. server selection unsuccessful for <Tahoe2ServerSelector for upload k5ga2>: shares could be placed on only 5 server(s) [...], merged=sh0: zp6jpfeu, sh1: pa2myijh, sh2: pa2myijh, sh3: omkzwfx5, sh4: wo6akhxt, sh5: ughwvrtu I assume the sh<n> are the shares to be placed. sh1 and sh2 were affected to pa2myijh. I'm not sure if this repartition is the result of share detection (my guess) or the result of a share placement algorithm that could produce invalid placement and needs a check before upload (late error detection isn't good practice so I bet it's not the case). What if these shares are already stored on pa2myijh **before** the upload attempt (due to past uploads with a buggy version or whatever happened in the store directory out of Tahoe's control). Is the code able to detect such a case and reupload one of the two shares on a free (without one of the 6 shares) server? If not, it might be the cause of my problem (the file was part of a long list of files I tried to upload with only partial success weeks ago...) and my storage nodes are most probably polluted by "dangling" shares.

daira commented

2012-07-08 00:47:59 +00:00

There was a bug in the statement I asked you to add; please replace it entirely with this one:

        self.log(format="servers before upload: %(connected_servers)s permuted to %(all_servers)s, of which"
                        "%(writeable_servers)s are writeable and %(readonly_servers)s are read-only",
                 connected_servers=repr(storage_broker.get_connected_servers()),
                 all_servers=repr(all_servers),
                 writeable_servers=repr(writeable_servers),
                 readonly_servers=repr(readonly_servers),
                 level=log.OPERATIONAL)

In answer to your last question, the fact that there are existing shares should not cause an UploadUnhappinessError. However, bugs #1124 and #1130 describe cases where we don't achieve that. I don't think that your problem is due to that, though, because it seems from the logs that the gateway is not contacting enough servers to make it possible to achieve happiness, regardless of the existing share distribution.

[more precisely it isn't receiving responses from enough servers. At this point we're not sure whether it is contacting them, although the "sent 5 queries to 5 servers" in the UploadUnhappinessError message suggests that it isn't.]Edit:

There was a bug in the statement I asked you to add; please replace it entirely with this one: ``` self.log(format="servers before upload: %(connected_servers)s permuted to %(all_servers)s, of which" "%(writeable_servers)s are writeable and %(readonly_servers)s are read-only", connected_servers=repr(storage_broker.get_connected_servers()), all_servers=repr(all_servers), writeable_servers=repr(writeable_servers), readonly_servers=repr(readonly_servers), level=log.OPERATIONAL) ``` In answer to your last question, the fact that there are existing shares *should not* cause an `UploadUnhappinessError`. However, bugs #1124 and #1130 describe cases where we don't achieve that. I don't think that your problem is due to that, though, because it seems from the logs that the gateway is not contacting enough servers to make it possible to achieve happiness, regardless of the existing share distribution. [more precisely it isn't receiving responses from enough servers. At this point we're not sure whether it is contacting them, although the "sent 5 queries to 5 servers" in the `UploadUnhappinessError` message suggests that it isn't.]Edit:

daira commented

2012-07-08 00:52:25 +00:00

I'm hoping that this bug is the same one that has been occasionally reported on VolunteerGrid2 with uploads where shares.happy is close to shares.total (and to the number of servers). It has very similar symptoms, but gyver seems to be able to reproduce it more easily.

I'm hoping that this bug is the same one that has been occasionally reported on [VolunteerGrid](wiki/VolunteerGrid)2 with uploads where `shares.happy` is close to `shares.total` (and to the number of servers). It has very similar symptoms, but gyver seems to be able to reproduce it more easily.

gyver commented

2012-07-08 01:12:29 +00:00

Author

Owner

Replying to davidsarah:

[...] but gyver seems to be able to reproduce it more easily.
You bet: I take more time firing ssh and joining the screen session that the tahoe cp takes to fail :-)

Here's the log with a bit of context:

03:09:17.432 L20 []#2487 CHKUploader starting
03:09:17.432 L20 []#2488 starting upload of <allmydata.immutable.upload.EncryptAnUploadable instance at 0x2fe8998>
03:09:17.432 L20 []#2489 creating Encoder <Encoder for unknown storage index>
03:09:17.432 L20 []#2490 file size: 4669394
03:09:17.432 L10 []#2491 my encoding parameters: (4, 6, 6, 131072)
03:09:17.432 L20 []#2492 got encoding parameters: 4/6/6 131072
03:09:17.432 L20 []#2493 now setting up codec
03:09:17.468 L20 []#2494 using storage index k5ga2
03:09:17.468 L20 []#2495 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): starting
03:09:17.469 L20 []#2496 servers before upload: frozenset([<NativeStorageServer for zp6jpfeu>, <NativeStorageServer for omkzwfx5>, <NativeStorageServer for lehmccp7>, <NativeStorageServer for wo6akhxt>, <NativeStorageServer for ughwvrtu>, <NativeStorageServer for pa2myijh>, <NativeStorageServer for d3fycapp>]) permuted to [<NativeStorageServer for zp6jpfeu>, <NativeStorageServer for pa2myijh>, <NativeStorageServer for omkzwfx5>, <NativeStorageServer for wo6akhxt>, <NativeStorageServer for ughwvrtu>, <NativeStorageServer for lehmccp7>, <NativeStorageServer for d3fycapp>], of which[<NativeStorageServer for zp6jpfeu>, <NativeStorageServer for pa2myijh>, <NativeStorageServer for omkzwfx5>, <NativeStorageServer for wo6akhxt>, <NativeStorageServer for ughwvrtu>, <NativeStorageServer for lehmccp7>, <NativeStorageServer for d3fycapp>] are writeable and set([]) are read-only
03:09:17.485 L10 []#2497 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): response to allocate_buckets() from server zp6jpfeu: alreadygot=(0,), allocated=()
03:09:17.497 L10 []#2498 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): response to allocate_buckets() from server pa2myijh: alreadygot=(2,), allocated=(1,)
03:09:17.500 L20 []#2499 storage: allocate_buckets k5ga2suaoz7gju523f5ni3mswe
03:09:17.515 L10 []#2500 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): response to allocate_buckets() from server omkzwfx5: alreadygot=(3,), allocated=()
03:09:17.523 L10 []#2501 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): response to allocate_buckets() from server wo6akhxt: alreadygot=(4,), allocated=()
03:09:17.530 L10 []#2502 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): response to allocate_buckets() from server ughwvrtu: alreadygot=(), allocated=(5,)
03:09:17.531 L25 []#2503 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): server selection unsuccessful for <Tahoe2ServerSelector for upload k5ga2>: shares could be placed on only 5 server(s) such that any 4 of them have enough shares to recover the file, but we were asked to place shares on at least 6 such servers. (placed all 6 shares, want to place shares on at least 6 servers such that any 4 of them have enough shares to recover the file, sent 5 queries to 5 servers, 5 queries placed some shares, 0 placed none (of which 0 placed none due to the server being full and 0 placed none due to an error)), merged=sh0: zp6jpfeu, sh1: pa2myijh, sh2: pa2myijh, sh3: omkzwfx5, sh4: wo6akhxt, sh5: ughwvrtu
03:09:17.533 L20 []#2504 web: 127.0.0.1 PUT /uri/[CENSORED].. 500 1644

Replying to [davidsarah](/tahoe-lafs/trac/issues/1791#issuecomment-390758): > [...] but gyver seems to be able to reproduce it more easily. You bet: I take more time firing ssh and joining the screen session that the tahoe cp takes to fail :-) Here's the log with a bit of context: ``` 03:09:17.432 L20 []#2487 CHKUploader starting 03:09:17.432 L20 []#2488 starting upload of <allmydata.immutable.upload.EncryptAnUploadable instance at 0x2fe8998> 03:09:17.432 L20 []#2489 creating Encoder <Encoder for unknown storage index> 03:09:17.432 L20 []#2490 file size: 4669394 03:09:17.432 L10 []#2491 my encoding parameters: (4, 6, 6, 131072) 03:09:17.432 L20 []#2492 got encoding parameters: 4/6/6 131072 03:09:17.432 L20 []#2493 now setting up codec 03:09:17.468 L20 []#2494 using storage index k5ga2 03:09:17.468 L20 []#2495 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): starting 03:09:17.469 L20 []#2496 servers before upload: frozenset([<NativeStorageServer for zp6jpfeu>, <NativeStorageServer for omkzwfx5>, <NativeStorageServer for lehmccp7>, <NativeStorageServer for wo6akhxt>, <NativeStorageServer for ughwvrtu>, <NativeStorageServer for pa2myijh>, <NativeStorageServer for d3fycapp>]) permuted to [<NativeStorageServer for zp6jpfeu>, <NativeStorageServer for pa2myijh>, <NativeStorageServer for omkzwfx5>, <NativeStorageServer for wo6akhxt>, <NativeStorageServer for ughwvrtu>, <NativeStorageServer for lehmccp7>, <NativeStorageServer for d3fycapp>], of which[<NativeStorageServer for zp6jpfeu>, <NativeStorageServer for pa2myijh>, <NativeStorageServer for omkzwfx5>, <NativeStorageServer for wo6akhxt>, <NativeStorageServer for ughwvrtu>, <NativeStorageServer for lehmccp7>, <NativeStorageServer for d3fycapp>] are writeable and set([]) are read-only 03:09:17.485 L10 []#2497 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): response to allocate_buckets() from server zp6jpfeu: alreadygot=(0,), allocated=() 03:09:17.497 L10 []#2498 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): response to allocate_buckets() from server pa2myijh: alreadygot=(2,), allocated=(1,) 03:09:17.500 L20 []#2499 storage: allocate_buckets k5ga2suaoz7gju523f5ni3mswe 03:09:17.515 L10 []#2500 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): response to allocate_buckets() from server omkzwfx5: alreadygot=(3,), allocated=() 03:09:17.523 L10 []#2501 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): response to allocate_buckets() from server wo6akhxt: alreadygot=(4,), allocated=() 03:09:17.530 L10 []#2502 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): response to allocate_buckets() from server ughwvrtu: alreadygot=(), allocated=(5,) 03:09:17.531 L25 []#2503 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): server selection unsuccessful for <Tahoe2ServerSelector for upload k5ga2>: shares could be placed on only 5 server(s) such that any 4 of them have enough shares to recover the file, but we were asked to place shares on at least 6 such servers. (placed all 6 shares, want to place shares on at least 6 servers such that any 4 of them have enough shares to recover the file, sent 5 queries to 5 servers, 5 queries placed some shares, 0 placed none (of which 0 placed none due to the server being full and 0 placed none due to an error)), merged=sh0: zp6jpfeu, sh1: pa2myijh, sh2: pa2myijh, sh3: omkzwfx5, sh4: wo6akhxt, sh5: ughwvrtu 03:09:17.533 L20 []#2504 web: 127.0.0.1 PUT /uri/[CENSORED].. 500 1644 ```

daira commented

2012-07-08 16:44:48 +00:00

Replying to [gyver]comment:12:

> 03:09:17.469 L20 []#2496 servers before upload:
> frozenset([<NativeStorageServer for zp6jpfeu>, <NativeStorageServer for omkzwfx5>, <NativeStorageServer for lehmccp7>, <NativeStorageServer for wo6akhxt>, <NativeStorageServer for ughwvrtu>, <NativeStorageServer for pa2myijh>, <NativeStorageServer for d3fycapp>])
> permuted to [<NativeStorageServer for zp6jpfeu>, <NativeStorageServer for pa2myijh>, <NativeStorageServer for omkzwfx5>, <NativeStorageServer for wo6akhxt>, <NativeStorageServer for ughwvrtu>, <NativeStorageServer for lehmccp7>, <NativeStorageServer for d3fycapp>],
> of which [<NativeStorageServer for zp6jpfeu>, <NativeStorageServer for pa2myijh>, <NativeStorageServer for omkzwfx5>, <NativeStorageServer for wo6akhxt>, <NativeStorageServer for ughwvrtu>, <NativeStorageServer for lehmccp7>, <NativeStorageServer for d3fycapp>]
> are writeable and set([]) are read-only

OK, that proves that the problem occurs after deciding which servers are writeable. We seem to be logging only responses to remote allocate_buckets requests at the gateway, so the next steps are:

a) Log when the gateway sends an allocate_buckets request.

b) Look at the logs of the storage servers to see how many of them receive an allocate_buckets request (which is logged [here]source:1.9.2/src/allmydata/storage/server.py#L248 as "storage: allocate_buckets <SI>") and what they do about it.

To do a), add this at line 105 of src/allmydata/immutable/upload.py (in the query method of ServerTracker after rref = self._server.get_rref()):

        log.msg("query(%(sharenums)s) for %(rref)s", sharenums=repr(sharenums), rref=repr(rref), level=log.OPERATIONAL)

Replying to [gyver]comment:12: ``` > 03:09:17.469 L20 []#2496 servers before upload: > frozenset([<NativeStorageServer for zp6jpfeu>, <NativeStorageServer for omkzwfx5>, <NativeStorageServer for lehmccp7>, <NativeStorageServer for wo6akhxt>, <NativeStorageServer for ughwvrtu>, <NativeStorageServer for pa2myijh>, <NativeStorageServer for d3fycapp>]) > permuted to [<NativeStorageServer for zp6jpfeu>, <NativeStorageServer for pa2myijh>, <NativeStorageServer for omkzwfx5>, <NativeStorageServer for wo6akhxt>, <NativeStorageServer for ughwvrtu>, <NativeStorageServer for lehmccp7>, <NativeStorageServer for d3fycapp>], > of which [<NativeStorageServer for zp6jpfeu>, <NativeStorageServer for pa2myijh>, <NativeStorageServer for omkzwfx5>, <NativeStorageServer for wo6akhxt>, <NativeStorageServer for ughwvrtu>, <NativeStorageServer for lehmccp7>, <NativeStorageServer for d3fycapp>] > are writeable and set([]) are read-only ``` OK, that proves that the problem occurs after deciding which servers are writeable. We seem to be logging only *responses* to remote `allocate_buckets` requests at the gateway, so the next steps are: a) Log when the gateway sends an `allocate_buckets` request. b) Look at the logs of the storage servers to see how many of them receive an `allocate_buckets` request (which is logged [here]source:1.9.2/src/allmydata/storage/server.py#L248 as "`storage: allocate_buckets <SI>`") and what they do about it. To do a), add this at line 105 of `src/allmydata/immutable/upload.py` (in the `query` method of `ServerTracker` after `rref = self._server.get_rref()`): ``` log.msg("query(%(sharenums)s) for %(rref)s", sharenums=repr(sharenums), rref=repr(rref), level=log.OPERATIONAL) ```

warner modified the milestone from 1.10.0 to 1.11.0

2012-09-04 16:59:28 +00:00

daira commented

2013-07-09 14:32:38 +00:00

Same bug as #2016?

daira commented

2013-07-09 14:45:41 +00:00

From #2016 which has now been marked as a duplicate:

[//trac/tahoe-lafs/ticket/2016#comment:390748 daira] wrote:

Here's the most important part of the log:

local#675138 20:33:50.290: <Tahoe2ServerSelector for upload jbljj>(jbljj): response to allocate_buckets() from server i76mi6: alreadygot=(0,), allocated=()
local#675139 20:33:50.457: <Tahoe2ServerSelector for upload jbljj>(jbljj): response to allocate_buckets() from server lxmst5: alreadygot=(2,), allocated=(1,)
local#675140 20:33:50.667: <Tahoe2ServerSelector for upload jbljj>(jbljj): response to allocate_buckets() from server sf7ehc: alreadygot=(3,), allocated=()
local#675141 20:33:50.822: <Tahoe2ServerSelector for upload jbljj>(jbljj): response to allocate_buckets() from server ddvfcd: alreadygot=(4,), allocated=()
local#675142 20:33:50.839: <Tahoe2ServerSelector for upload jbljj>(jbljj): server selection unsuccessful for <Tahoe2ServerSelector for upload jbljj>:
 shares could be placed on only 4 server(s) such that any 3 of them have enough shares to recover the file, but we were asked to place shares on at least 5 such servers.
 (placed all 5 shares, want to place shares on at least 5 servers such that any 3 of them have enough shares to recover the file, sent 6 queries to 6 servers, 4 queries placed some shares, 2 placed none (of which 2 placed none due to the server being full and 0 placed none due to an error)),
 merged=sh0: i76mi6en, sh1: lxmst5bx, sh2: lxmst5bx, sh3: sf7ehcpn, sh4: ddvfcdns

[//trac/tahoe-lafs/ticket/2016#comment:390749 daira] wrote:

Here's my interpretation: with h = N = 5, as soon as the Tahoe2ServerSelector decides to put two shares on the same server (here sh1 and sh2 on lxmst5bx), the upload is doomed. The shares all have to be on different servers whenever h = N, but the termination condition is just that all shares have been placed, not that they have been placed in a way that meets the happiness condition.

If that's the problem, then #1382 should fix it. This would also explain why VG2 was unreliable with h close to N.

[//trac/tahoe-lafs/ticket/2016#comment:390750 zooko] replied:

Daira: excellent work diagnosing this!! Ed: thanks so much for the bug report. Daira: it looks like you are right, and I think this does explain those bugs that the volunteergrid2 people reported and that I never understood. Thank you!

[//trac/tahoe-lafs/ticket/2016#comment:390752 kapiteined] wrote:

And to check if that is the case, i changed to 3-7-10 encoding, and now the upload succeeds! Success: file copied

Does this call for a change in code, or for a big warning sticker: "don't choose h and n to close together" ?

[//trac/tahoe-lafs/ticket/2016#comment:390753 daira] wrote:

We intend to fix it for v1.11 (Mark Berger's branch for #1382 already basically works), but there would be no harm in pointing out this problem on tahoe-dev in the meantime.

[//trac/tahoe-lafs/ticket/2016#comment:9 daira] wrote:

Same bug as #1791?

Yes, that bug also had h = N and two shares that were placed on the same server, so almost identical. I'll copy the conclusions here to that ticket.

From #2016 which has now been marked as a duplicate: [//trac/tahoe-lafs/ticket/2016#[comment:390748](/tahoe-lafs/trac/issues/1791#issuecomment-390748) daira] wrote: > Here's the most important part of the log: ``` local#675138 20:33:50.290: <Tahoe2ServerSelector for upload jbljj>(jbljj): response to allocate_buckets() from server i76mi6: alreadygot=(0,), allocated=() local#675139 20:33:50.457: <Tahoe2ServerSelector for upload jbljj>(jbljj): response to allocate_buckets() from server lxmst5: alreadygot=(2,), allocated=(1,) local#675140 20:33:50.667: <Tahoe2ServerSelector for upload jbljj>(jbljj): response to allocate_buckets() from server sf7ehc: alreadygot=(3,), allocated=() local#675141 20:33:50.822: <Tahoe2ServerSelector for upload jbljj>(jbljj): response to allocate_buckets() from server ddvfcd: alreadygot=(4,), allocated=() local#675142 20:33:50.839: <Tahoe2ServerSelector for upload jbljj>(jbljj): server selection unsuccessful for <Tahoe2ServerSelector for upload jbljj>: shares could be placed on only 4 server(s) such that any 3 of them have enough shares to recover the file, but we were asked to place shares on at least 5 such servers. (placed all 5 shares, want to place shares on at least 5 servers such that any 3 of them have enough shares to recover the file, sent 6 queries to 6 servers, 4 queries placed some shares, 2 placed none (of which 2 placed none due to the server being full and 0 placed none due to an error)), merged=sh0: i76mi6en, sh1: lxmst5bx, sh2: lxmst5bx, sh3: sf7ehcpn, sh4: ddvfcdns ``` [//trac/tahoe-lafs/ticket/2016#[comment:390749](/tahoe-lafs/trac/issues/1791#issuecomment-390749) daira] wrote: > Here's my interpretation: with h = N = 5, as soon as the Tahoe2ServerSelector decides to put two shares on the same server (here sh1 and sh2 on lxmst5bx), the upload is doomed. The shares all have to be on different servers whenever h = N, but the termination condition is just that all shares have been placed, not that they have been placed in a way that meets the happiness condition. > If that's the problem, then #1382 should fix it. This would also explain why VG2 was unreliable with h close to N. [//trac/tahoe-lafs/ticket/2016#[comment:390750](/tahoe-lafs/trac/issues/1791#issuecomment-390750) zooko] replied: > Daira: excellent work diagnosing this!! Ed: thanks so much for the bug report. Daira: it looks like you are right, and I think this does explain those bugs that the volunteergrid2 people reported and that I never understood. Thank you! [//trac/tahoe-lafs/ticket/2016#[comment:390752](/tahoe-lafs/trac/issues/1791#issuecomment-390752) kapiteined] wrote: > And to check if that is the case, i changed to 3-7-10 encoding, and now the upload succeeds! `Success: file copied` > Does this call for a change in code, or for a big warning sticker: "don't choose h and n to close together" ? [//trac/tahoe-lafs/ticket/2016#[comment:390753](/tahoe-lafs/trac/issues/1791#issuecomment-390753) daira] wrote: > We intend to fix it for v1.11 (Mark Berger's branch for #1382 already basically works), but there would be no harm in pointing out this problem on tahoe-dev in the meantime. [//trac/tahoe-lafs/ticket/2016#comment:9 daira] wrote: >> Same bug as #1791? > > Yes, that bug also had h = N and two shares that were placed on the same server, so almost identical. I'll copy the conclusions here to that ticket.

daira commented

2013-07-09 14:55:05 +00:00

Replying to myself in comment:390757:

In answer to your last question (gyver in comment:9), the fact that there are existing shares should not cause an UploadUnhappinessError. However, bugs #1124 and #1130 describe cases where we don't achieve that. I don't think that your problem is due to that, though, because it seems from the logs that the gateway is not contacting enough servers to make it possible to achieve happiness, regardless of the existing share distribution.

[more precisely it isn't receiving responses from enough servers. At this point we're not sure whether it is contacting them, although the "sent 5 queries to 5 servers" in the UploadUnhappinessError message suggests that it isn't.]Edit:

I was wrong here; it is quite similar to #1130, which also has h = N. (#1130 has some additional oddities in the share distribution that was chosen, but I don't think they're relevant.) The fact that we terminate the distribution algorithm as soon as all shares are placed is the underlying problem in all these cases.

Replying to myself in [comment:390757](/tahoe-lafs/trac/issues/1791#issuecomment-390757): > In answer to your last question (gyver in comment:9), the fact that there are existing shares *should not* cause an `UploadUnhappinessError`. However, bugs #1124 and #1130 describe cases where we don't achieve that. I don't think that your problem is due to that, though, because it seems from the logs that the gateway is not contacting enough servers to make it possible to achieve happiness, regardless of the existing share distribution. > > [more precisely it isn't receiving responses from enough servers. At this point we're not sure whether it is contacting them, although the "sent 5 queries to 5 servers" in the `UploadUnhappinessError` message suggests that it isn't.]Edit: I was wrong here; it *is* quite similar to #1130, which also has h = N. (#1130 has some additional oddities in the share distribution that was chosen, but I don't think they're relevant.) The fact that we terminate the distribution algorithm as soon as all shares are placed is the underlying problem in all these cases.

Rows
Columns