rebalance during repair or upload #699

Open
opened 2009-05-08 16:11:28 +00:00 by zooko · 13 comments
zooko commented 2009-05-08 16:11:28 +00:00
Owner

In this mailing list message, Humberto Ortiz-Zuazaga asks how to rebalance the shares of a file. To close this ticket, ensure (either as an option or unconditionally) that repairing or uploading a file attempts to "rebalance" its shares, making it healthy as defined by #614.

See also related tickets #481 (build some share-migration tools), #231 (good handling of small numbers of servers, or strange choice of servers), #232 (Peer selection doesn't rebalance shares on overwrite of mutable file.), and #543 ('rebalancing manager')

In [this mailing list message](http://allmydata.org/pipermail/tahoe-dev/2009-May/001729.html), Humberto Ortiz-Zuazaga asks how to rebalance the shares of a file. To close this ticket, ensure (either as an option or unconditionally) that repairing or uploading a file attempts to "rebalance" its shares, making it healthy as defined by #614. See also related tickets #481 (build some share-migration tools), #231 (good handling of small numbers of servers, or strange choice of servers), #232 (Peer selection doesn't rebalance shares on overwrite of mutable file.), and #543 ('rebalancing manager')
tahoe-lafs added the
code-peerselection
major
enhancement
1.4.1
labels 2009-05-08 16:11:28 +00:00
tahoe-lafs added this to the eventually milestone 2009-05-08 16:11:28 +00:00
zooko commented 2009-08-10 15:28:29 +00:00
Author
Owner

The following clump of tickets might be of interest to people who are interested in this ticket: #711 (repair to different levels of M), #699 (optionally rebalance during repair or upload), #543 ('rebalancing manager'), #232 (Peer selection doesn't rebalance shares on overwrite of mutable file.), #678 (converge same file, same K, different M), #610 (upload should take better advantage of existing shares), #573 (Allow client to control which storage servers receive shares).

The following clump of tickets might be of interest to people who are interested in this ticket: #711 (repair to different levels of M), #699 (optionally rebalance during repair or upload), #543 ('rebalancing manager'), #232 (Peer selection doesn't rebalance shares on overwrite of mutable file.), #678 (converge same file, same K, different M), #610 (upload should take better advantage of existing shares), #573 (Allow client to control which storage servers receive shares).
zooko commented 2009-08-10 15:45:34 +00:00
Author
Owner

Also related: #778 ("shares of happiness" is the wrong measure; "servers of happiness" is better).

Also related: #778 ("shares of happiness" is the wrong measure; "servers of happiness" is better).
davidsarah commented 2010-03-24 23:26:36 +00:00
Author
Owner

It seems more usable for this behaviour to be unconditional rather than an option: why would you not want to attempt to ensure that a file is healthy (per #614) on a repair or upload?

It seems more usable for this behaviour to be unconditional rather than an option: why would you not want to attempt to ensure that a file is healthy (per #614) on a repair or upload?
tahoe-lafs added
defect
and removed
enhancement
labels 2010-03-24 23:26:36 +00:00
tahoe-lafs modified the milestone from eventually to 1.7.0 2010-03-24 23:26:36 +00:00
tahoe-lafs changed title from optionally rebalance during repair or upload to rebalance during repair or upload 2010-03-24 23:26:36 +00:00
zooko commented 2010-03-25 04:56:28 +00:00
Author
Owner

Replying to davidsarah:

why would you not want to attempt to ensure that a file is healthy (per #614) on a repair or upload?

I guess I was thinking about bandwidth and storage space. There could be some case where you want to repair but you don't want -- at that particular time -- to rebalance because rebalancing might cost more bandwidth or something.

However, I strongly prefer to avoid offering options if we don't have to, and I can't think of a really good answer to this question, so I agree that this should be unconditional.

Replying to [davidsarah](/tahoe-lafs/trac-2024-07-25/issues/699#issuecomment-112732): > why would you not want to attempt to ensure that a file is healthy (per #614) on a repair or upload? I guess I was thinking about bandwidth and storage space. There could be some case where you want to repair but you don't want -- at that particular time -- to rebalance because rebalancing might cost more bandwidth or something. However, I strongly prefer to avoid offering options if we don't have to, and I can't think of a really good answer to this question, so I agree that this should be unconditional.
zooko commented 2010-05-16 05:15:40 +00:00
Author
Owner

#778 ("shares of happiness" is the wrong measure; "servers of happiness" is better) is fixed! I think that this fixes the parts of this ticket that have to do with immutable files. Once we likewise have rebalancing on upload of mutable files then we can close this ticket.

#778 ("shares of happiness" is the wrong measure; "servers of happiness" is better) is fixed! I think that this fixes the parts of this ticket that have to do with immutable files. Once we likewise have rebalancing on upload of mutable files then we can close this ticket.
davidsarah commented 2010-05-16 16:18:57 +00:00
Author
Owner

Replying to zooko:

#778 ("shares of happiness" is the wrong measure; "servers of happiness" is better) is fixed! I think that this fixes the parts of this ticket that have to do with immutable files.

Doesn't #778 only fix this for upload of immutable files, not repair?

Replying to [zooko](/tahoe-lafs/trac-2024-07-25/issues/699#issuecomment-112738): > #778 ("shares of happiness" is the wrong measure; "servers of happiness" is better) is fixed! I think that this fixes the parts of this ticket that have to do with immutable files. Doesn't #778 only fix this for upload of immutable files, not repair?
tahoe-lafs modified the milestone from 1.7.0 to 1.8.0 2010-05-16 16:18:57 +00:00
zooko commented 2010-05-16 16:58:12 +00:00
Author
Owner

Hm, let's see... [the current repairer]source:src/allmydata/immutable/repairer.py uses the current immutable uploader so I think it inherits the improvements from #778. However, I guess this means we need a unit test that would be red if the repairer fails to rebalance (at least up to servers-of-happiness and at least when not in one of the "tricky cases" for which our current uploader fails to achieve servers-of-happiness).

Hm, let's see... [the current repairer]source:src/allmydata/immutable/repairer.py uses the current immutable uploader so I think it inherits the improvements from #778. However, I guess this means we need a unit test that would be red if the repairer fails to rebalance (at least up to servers-of-happiness and at least when not in one of the "tricky cases" for which our current uploader fails to achieve servers-of-happiness).
tahoe-lafs modified the milestone from 1.8.0 to eventually 2010-08-12 20:55:04 +00:00
davidsarah commented 2010-08-12 23:44:12 +00:00
Author
Owner

Replying to zooko:

Hm, let's see... [the current repairer]source:src/allmydata/immutable/repairer.py uses the current immutable uploader so I think it inherits the improvements from #778. However, I guess this means we need a unit test that would be red if the repairer fails to rebalance (at least up to servers-of-happiness and at least when not in one of the "tricky cases" for which our current uploader fails to achieve servers-of-happiness).

Those cases are described in #1124 and #1130. I don't think we should consider this ticket resolved for immutable uploads until those have been fixed.

Also, the uploader/repairer should be attempting to achieve "full happiness", i.e. a happiness value of N -- even though it only reports failure when it fails to meet the happiness threshold which may be lower than N. See ticket:778#comment:175.

Replying to [zooko](/tahoe-lafs/trac-2024-07-25/issues/699#issuecomment-112741): > Hm, let's see... [the current repairer]source:src/allmydata/immutable/repairer.py uses the current immutable uploader so I think it inherits the improvements from #778. However, I guess this means we need a unit test that would be red if the repairer fails to rebalance (at least up to servers-of-happiness and at least when not in one of the "tricky cases" for which our current uploader fails to achieve servers-of-happiness). Those cases are described in #1124 and #1130. I don't think we should consider this ticket resolved for immutable uploads until those have been fixed. Also, the uploader/repairer should be *attempting* to achieve "full happiness", i.e. a happiness value of N -- even though it only reports failure when it fails to meet the happiness threshold which may be lower than N. See ticket:778#comment:175.
sickness commented 2010-10-14 07:05:25 +00:00
Author
Owner

I've found the same behaviour experimenting with 1.8.0 on 10 nodes with 3-7-10.
I've tried to upload a file with one of the servers shut down, it said ok and put:
1 share on 8 servers, 2 shares on 1 server, 0 shares on the shut down server (obviously).
Then when I've powered the shutdown server back on, I've tried to do a repair on that file, it
said that it needed rebalance but didn't do it no matter what.
The solution was following zooko's advice in this mail:
http://tahoe-lafs.org/pipermail/tahoe-dev/2009-May/001735.html
and
http://tahoe-lafs.org/pipermail/tahoe-dev/2009-May/001739.html
So basically I've deleted 1 of the 2 shares of the server with 2 shares and then repaired the file.
Now all the servers had 1 share.
I'd like to have an option to tell the uploader to never put more than N shares per server "no matter what" (I'd set it at 1 because I don't want one server to hold more "responsability" that I've planned to...)
tnx! :)

I've found the same behaviour experimenting with 1.8.0 on 10 nodes with 3-7-10. I've tried to upload a file with one of the servers shut down, it said ok and put: 1 share on 8 servers, 2 shares on 1 server, 0 shares on the shut down server (obviously). Then when I've powered the shutdown server back on, I've tried to do a repair on that file, it said that it needed rebalance but didn't do it no matter what. The solution was following zooko's advice in this mail: <http://tahoe-lafs.org/pipermail/tahoe-dev/2009-May/001735.html> and <http://tahoe-lafs.org/pipermail/tahoe-dev/2009-May/001739.html> So basically I've deleted 1 of the 2 shares of the server with 2 shares and then repaired the file. Now all the servers had 1 share. I'd like to have an option to tell the uploader to never put more than N shares per server "no matter what" (I'd set it at 1 because I don't want one server to hold more "responsability" that I've planned to...) tnx! :)
amontero commented 2012-03-03 18:46:33 +00:00
Author
Owner

Replying to sickness:

I've found the same behaviour experimenting with 1.8.0 on 10 nodes with 3-7-10.
I've tried to upload a file with one of the servers shut down, it said ok and put:
1 share on 8 servers, 2 shares on 1 server, 0 shares on the shut down server (obviously).
Then when I've powered the shutdown server back on, I've tried to do a repair on that file, it
said that it needed rebalance but didn't do it no matter what.
The solution was following zooko's advice in this mail:
http://tahoe-lafs.org/pipermail/tahoe-dev/2009-May/001735.html
and
http://tahoe-lafs.org/pipermail/tahoe-dev/2009-May/001739.html
So basically I've deleted 1 of the 2 shares of the server with 2 shares and then repaired the file.
Now all the servers had 1 share.
I'd like to have an option to tell the uploader to never put more than N shares per server "no matter what" (I'd set it at 1 because I don't want one server to hold more "responsability" that I've planned to...)
tnx! :)

Hi sickness.

I'm following zooko's advice, since I have the same problem. My use case is described in #1657 if you're interested.
I've created a bash script that may help you. It's not the most efficient way of doing it, but saves me a lot of time.
If some maintainer can upload it to some misc tools folder in the repo, maybe others will be able to use it as starting point and hopefully improve it, since my bash skills are somewhat limited.

Hope you find it useful.

Replying to [sickness](/tahoe-lafs/trac-2024-07-25/issues/699#issuecomment-112744): > I've found the same behaviour experimenting with 1.8.0 on 10 nodes with 3-7-10. > I've tried to upload a file with one of the servers shut down, it said ok and put: > 1 share on 8 servers, 2 shares on 1 server, 0 shares on the shut down server (obviously). > Then when I've powered the shutdown server back on, I've tried to do a repair on that file, it > said that it needed rebalance but didn't do it no matter what. > The solution was following zooko's advice in this mail: > <http://tahoe-lafs.org/pipermail/tahoe-dev/2009-May/001735.html> > and > <http://tahoe-lafs.org/pipermail/tahoe-dev/2009-May/001739.html> > So basically I've deleted 1 of the 2 shares of the server with 2 shares and then repaired the file. > Now all the servers had 1 share. > I'd like to have an option to tell the uploader to never put more than N shares per server "no matter what" (I'd set it at 1 because I don't want one server to hold more "responsability" that I've planned to...) > tnx! :) Hi sickness. I'm following zooko's advice, since I have the same problem. My use case is described in #1657 if you're interested. I've created a [bash script](http://pastebin.com/6w8AibEf) that may help you. It's not the most efficient way of doing it, but saves me a lot of time. If some maintainer can upload it to some misc tools folder in the repo, maybe others will be able to use it as starting point and hopefully improve it, since my bash skills are somewhat limited. Hope you find it useful.
sickness commented 2012-10-27 18:39:45 +00:00
Author
Owner

Replying to [amontero]comment:14:

Replying to sickness:

I've found the same behaviour experimenting with 1.8.0 on 10 nodes with 3-7-10.
I've tried to upload a file with one of the servers shut down, it said ok and put:
1 share on 8 servers, 2 shares on 1 server, 0 shares on the shut down server (obviously).
Then when I've powered the shutdown server back on, I've tried to do a repair on that file, it
said that it needed rebalance but didn't do it no matter what.
The solution was following zooko's advice in this mail:
http://tahoe-lafs.org/pipermail/tahoe-dev/2009-May/001735.html
and
http://tahoe-lafs.org/pipermail/tahoe-dev/2009-May/001739.html
So basically I've deleted 1 of the 2 shares of the server with 2 shares and then repaired the file.
Now all the servers had 1 share.
I'd like to have an option to tell the uploader to never put more than N shares per server "no matter what" (I'd set it at 1 because I don't want one server to hold more "responsability" that I've planned to...)
tnx! :)

Hi sickness.

I'm following zooko's advice, since I have the same problem. My use case is described in #1657 if you're interested.
I've created a bash script that may help you. It's not the most efficient way of doing it, but saves me a lot of time.
If some maintainer can upload it to some misc tools folder in the repo, maybe others will be able to use it as starting point and hopefully improve it, since my bash skills are somewhat limited.

Hope you find it useful.
yeah! tnx for this script!
it would be really useful, but I've found it just now :/
and it spells out this error:
tahoe-prune.sh: line 49: read: -i: invalid option
read: usage: read [-ers] [-u fd] [-t timeout] [-p prompt] [-a array] [-n nchars] [-d delim] [...]name
tahoe-prune.sh: line 55: read: -i: invalid option
read: usage: read [-ers] [-u fd] [-t timeout] [-p prompt] [-a array] [-n nchars] [-d delim] [...]name
(this on a debian with bash)
then I've also found this tool:
http://killyourtv.i2p.to/tahoe-lafs/rebalance-shares.py/
it seems to basically do the same but in python

Replying to [amontero]comment:14: > Replying to [sickness](/tahoe-lafs/trac-2024-07-25/issues/699#issuecomment-112744): > > I've found the same behaviour experimenting with 1.8.0 on 10 nodes with 3-7-10. > > I've tried to upload a file with one of the servers shut down, it said ok and put: > > 1 share on 8 servers, 2 shares on 1 server, 0 shares on the shut down server (obviously). > > Then when I've powered the shutdown server back on, I've tried to do a repair on that file, it > > said that it needed rebalance but didn't do it no matter what. > > The solution was following zooko's advice in this mail: > > <http://tahoe-lafs.org/pipermail/tahoe-dev/2009-May/001735.html> > > and > > <http://tahoe-lafs.org/pipermail/tahoe-dev/2009-May/001739.html> > > So basically I've deleted 1 of the 2 shares of the server with 2 shares and then repaired the file. > > Now all the servers had 1 share. > > I'd like to have an option to tell the uploader to never put more than N shares per server "no matter what" (I'd set it at 1 because I don't want one server to hold more "responsability" that I've planned to...) > > tnx! :) > > Hi sickness. > > I'm following zooko's advice, since I have the same problem. My use case is described in #1657 if you're interested. > I've created a [bash script](http://pastebin.com/6w8AibEf) that may help you. It's not the most efficient way of doing it, but saves me a lot of time. > If some maintainer can upload it to some misc tools folder in the repo, maybe others will be able to use it as starting point and hopefully improve it, since my bash skills are somewhat limited. > > Hope you find it useful. yeah! tnx for this script! it would be really useful, but I've found it just now :/ and it spells out this error: tahoe-prune.sh: line 49: read: -i: invalid option read: usage: read [-ers] [-u fd] [-t timeout] [-p prompt] [-a array] [-n nchars] [-d delim] [...]name tahoe-prune.sh: line 55: read: -i: invalid option read: usage: read [-ers] [-u fd] [-t timeout] [-p prompt] [-a array] [-n nchars] [-d delim] [...]name (this on a debian with bash) then I've also found this tool: <http://killyourtv.i2p.to/tahoe-lafs/rebalance-shares.py/> it seems to basically do the same but in python
amontero commented 2012-10-28 01:08:38 +00:00
Author
Owner

I can't help in why -i doesn't works for you. Anyway, you can just omit the

-i "n"

part, since it's only meant to default to "no" when asking.

Thanks for the pyhton script. Will keep it at hand when comes my time to learn Python :)
By now, any of those scripts can do a "dirty rebalancing".

As a workaround, I think that a "hold no more than Z shares" setting in each server can make this easier. Just firing repairs at regular intervals would eventually create shares on all servers. Any server having the desired shares should just not accept any more shares for that file, so there would be no need of pruning. This way, we can easily tune how much "responsibility" each node would accept for each file. Arriving servers would eventually catch up when repairing.

I'm interested in developers feedback about this path. In case it could be easy enough for a Python novice, I could take a stab at it. I tried it months ago, but could not find my way through the code, so implementation advice is very welcome.

A future rebalancer would be way smarter, but meanwhile I think this approach will solve some use cases.

I can't help in why -i doesn't works for you. Anyway, you can just omit the ```sh -i "n" ``` part, since it's only meant to default to "no" when asking. Thanks for the pyhton script. Will keep it at hand when comes my time to learn Python :) By now, any of those scripts can do a "dirty rebalancing". As a workaround, I think that a "hold no more than Z shares" setting in each server can make this easier. Just firing repairs at regular intervals would eventually create shares on all servers. Any server having the desired shares should just not accept any more shares for that file, so there would be no need of pruning. This way, we can easily tune how much "responsibility" each node would accept for each file. Arriving servers would eventually catch up when repairing. I'm interested in developers feedback about this path. In case it could be easy enough for a Python novice, I could take a stab at it. I tried it months ago, but could not find my way through the code, so implementation advice is very welcome. A future rebalancer would be way smarter, but meanwhile I think this approach will solve some use cases.
tahoe-lafs modified the milestone from eventually to 1.11.0 2013-02-15 03:31:02 +00:00
davidsarah commented 2013-02-15 03:51:31 +00:00
Author
Owner

This ticket is likely to be fixed by implementing the repair algorithm in #1130.

This ticket is likely to be fixed by implementing the repair algorithm in #1130.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: tahoe-lafs/trac-2024-07-25#699
No description provided.