immutable peer selection refactoring and enhancements #1382

New issue

Closed

opened 2011-03-29 03:43:47 +00:00 by kevan · 84 comments

kevan commented

2011-03-29 03:43:47 +00:00

Owner

I've been working on refactoring and improving immutable peer selection. I have several immediate goals for this project.

Decouple servers of happiness specific peer selection from the more mechanical aspects of peer selection so that it can be easily integrated into SDMF and MDMF mutable files now, and other formats later;
Address the shortcomings in the servers of happiness file health measure and its current implementation;
Extend servers of happiness to file check, verify, and repair operations;
Improve the quality and clarity of the peer selection code, easing future maintenance and experimentation.

These improvements correspond roughly to issues presented in tickets #610, #614, #1115, #1124, #1130, and #1293. Unifying mutable and immutable peer selection is ticket #1057, though I don't expect to address that until after MDMF (#393) is merged and until we're done with this ticket.

I've been working on refactoring and improving immutable peer selection. I have several immediate goals for this project. * Decouple servers of happiness specific peer selection from the more mechanical aspects of peer selection so that it can be easily integrated into SDMF and MDMF mutable files now, and other formats later; * Address the shortcomings in the servers of happiness file health measure and its current implementation; * Extend servers of happiness to file check, verify, and repair operations; * Improve the quality and clarity of the peer selection code, easing future maintenance and experimentation. These improvements correspond roughly to issues presented in tickets #610, #614, #1115, #1124, #1130, and #1293. Unifying mutable and immutable peer selection is ticket #1057, though I don't expect to address that until after MDMF (#393) is merged and until we're done with this ticket.

tahoe-lafs added the

labels 2011-03-29 03:43:47 +00:00

tahoe-lafs added this to the undecided milestone 2011-03-29 03:43:47 +00:00

kevan commented

2011-03-29 04:05:29 +00:00

Author

Owner

First up is a draft of of IPeerSelector, an abstract peer selection class. The draft is contained in attachment:interface.dpatch.

A significant annoyance with servers of happiness (and a source of many of its shortcomings) is the fact that it wasn't used to motivate share placement or server selection in a well thought out way; rather, it was just used to evaluate the job that the existing share placement strategy did. So, with IPeerSelector (and with the implicit assumption that the peer selection and share placement strategies you'd want to use when uploading a file are probably generally motivated by how you define file health), I wanted to combine both file health and share placement in one abstraction. I still have a few small issues to fix with the uploader changes that actually use the interface, so you'll have to wait to see the code that exercises and implements the interface, but the interaction is basically:

Initialize upload, upload initializes peer selector.
Upload tries to find existing shares as appropriate. For each existing share:
- Call add_peer_with_share to tell the peer selector about that relationship.
Upload tells the peer selector all of the peers that it can write to (the peer selector doesn't assume that it can write to the peers it is told about for existing shares).
do:
- Upload asks the peer selector for share assignments, checks their health. If unhealthy, the upload fails with an error as appropriate (so we can account for file repair edge cases here).
- Upload tries to allocate buckets for the assigned share placements, reporting to the peer selector its successes (via add_peer_with_share) and failures (via mark_full_peer and mark_bad_peer) as appropriate.
while the peer selector needs recomputation.

I have an immutable uploader that works pretty well with this sort of interaction.

Things to look for if you're reviewing:

Leaky abstractions; particularly leaky method names. I tried to avoid leaking graph theory into my interface definition. Did I succeed? If not, what should be improved?
Does this seem like a generally useful interface? Could you imagine plugging it into an existing uploader without too much trouble? Could you imagine writing a new uploader that used it without too much trouble? How about a peer selector for an entirely different upload health metric?

(of course, you're welcome to look for whatever you want -- that's just what I think I need feedback on :-)

First up is a draft of of `IPeerSelector`, an abstract peer selection class. The draft is contained in attachment:interface.dpatch. A significant annoyance with servers of happiness (and a source of many of its shortcomings) is the fact that it wasn't used to motivate share placement or server selection in a well thought out way; rather, it was just used to evaluate the job that the existing share placement strategy did. So, with `IPeerSelector` (and with the implicit assumption that the peer selection and share placement strategies you'd want to use when uploading a file are probably generally motivated by how you define file health), I wanted to combine both file health and share placement in one abstraction. I still have a few small issues to fix with the uploader changes that actually use the interface, so you'll have to wait to see the code that exercises and implements the interface, but the interaction is basically: * Initialize upload, upload initializes peer selector. * Upload tries to find existing shares as appropriate. For each existing share: * Call `add_peer_with_share` to tell the peer selector about that relationship. * Upload tells the peer selector all of the peers that it can write to (the peer selector doesn't assume that it can write to the peers it is told about for existing shares). * do: * Upload asks the peer selector for share assignments, checks their health. If unhealthy, the upload fails with an error as appropriate (so we can account for file repair edge cases here). * Upload tries to allocate buckets for the assigned share placements, reporting to the peer selector its successes (via `add_peer_with_share`) and failures (via `mark_full_peer` and `mark_bad_peer`) as appropriate. * while the peer selector needs recomputation. I have an immutable uploader that works pretty well with this sort of interaction. Things to look for if you're reviewing: 1. Leaky abstractions; particularly leaky method names. I tried to avoid leaking graph theory into my interface definition. Did I succeed? If not, what should be improved? 1. Does this seem like a generally useful interface? Could you imagine plugging it into an existing uploader without too much trouble? Could you imagine writing a new uploader that used it without too much trouble? How about a peer selector for an entirely different upload health metric? (of course, you're welcome to look for whatever you want -- that's just what I think I need feedback on :-)

kevan commented

2011-03-29 04:06:09 +00:00

Author

Owner

Attachment interface.dpatch (13846 bytes) added

initial implementation of IPeerSelector interface

**Attachment** interface.dpatch (13846 bytes) added initial implementation of IPeerSelector interface

interface.dpatch

14 KiB

kevan commented

2011-04-24 02:38:27 +00:00

Author

Owner

Attachment interfaces.dpatch (35792 bytes) added

update IPeerSelector definition

**Attachment** interfaces.dpatch (35792 bytes) added update IPeerSelector definition

interfaces.dpatch

35 KiB

zooko modified the milestone from undecided to 1.9.0

2011-04-24 04:08:03 +00:00

kevan commented

2011-05-03 04:08:16 +00:00

Author

Owner

Attachment 1382.dpatch (155132 bytes) added

snapshot of #1382; includes new, simplified uploader, refactored server selection logic, tests

**Attachment** 1382.dpatch (155132 bytes) added snapshot of #1382; includes new, simplified uploader, refactored server selection logic, tests

1382.dpatch

152 KiB

zooko commented

2011-07-16 20:27:48 +00:00

On the Tahoe-LAFS Weekly Call, we agreed that this isn't likely to be ready for 1.9.0.

zooko modified the milestone from 1.9.0 to soon

2011-07-16 20:27:48 +00:00

zooko commented

2011-07-16 20:33:07 +00:00

Kevan says there isn't any specific thing that he knows is wrong with this patch, but he thinks of it as an experiment that would need more consideration and hunting for edge cases before it should go into trunk.

zooko commented

2011-07-22 13:41:48 +00:00

There are a few other tickets that may be affected by this one: #362, #1394, #873, #610, at least.

kevan commented

2011-11-17 03:50:06 +00:00

Author

Owner

Split the current peer selection/share allocation algorithm out into a separate class, implementing the interface described in this ticket. I guess we won't have any functional changes at this step, but it will help us design an interface that will be useful in general.
Solicit feedback on the interface.
Develop an implementation of the revised share placement algorithm using the interface that meets Brian's performance desiderata (200 shares, 1000 servers, 1 second).

Once that's done, we'll have an easy way for people to add new share placement algorithms to Tahoe-LAFS, and a share placement algorithm that fixes the corner cases users currently experience with servers of happiness.

I should have more free time in December than I've had lately, so I'll be happy to work on #1382. I don't know if we want to plan on #1382 for 1.10, or defer it until a future release; if we discussed that at the summit, I don't remember what we concluded.

We discussed #1382 at the second Tahoe-LAFS summit. We concluded that this approach (splitting peer selection into two parts along a well-defined and generally useful interface) was a good approach, and that a more complete and correct share allocation algorithm (share director? share overlord? I don't think we agreed on a name for the object) was a good idea. I think we concluded that the right path to bring #1382 into trunk was something like: * Split the current peer selection/share allocation algorithm out into a separate class, implementing the interface described in this ticket. I guess we won't have any functional changes at this step, but it will help us design an interface that will be useful in general. * Solicit feedback on the interface. * Develop an implementation of the revised share placement algorithm using the interface that meets Brian's performance desiderata (200 shares, 1000 servers, 1 second). Once that's done, we'll have an easy way for people to add new share placement algorithms to Tahoe-LAFS, and a share placement algorithm that fixes the corner cases users currently experience with servers of happiness. I should have more free time in December than I've had lately, so I'll be happy to work on #1382. I don't know if we want to plan on #1382 for 1.10, or defer it until a future release; if we discussed that at the summit, I don't remember what we concluded.

zooko commented

2011-11-17 06:12:43 +00:00

Kevan: excellent! I'm looking forward to progress on this. Your summary of the Summit discussion matches my recollections. We did not talk about whether it would go into 1.10 or not. Lets decide later. I don't even know who is going to be Release Wrangler for 1.10. (Hopefully Brian. He had some good ideas about "more automation, less process".)

kevan commented

2012-01-14 22:57:55 +00:00

Author

Owner

You can follow my progress on this project by watching the [ticket1382 branch on my GitHub page](https://github.com/isnotajoke/tahoe-lafs/tree/ticket1382). I'm trying to assemble a series of small deltas that take us from our current immutable peer selector to something that can be easily modified to use a pluggable peer selector, and would be particularly interested in knowing whether the deltas that are there now (and show up there as I continue to work on it) are small enough to be easily reviewed.

You can follow my progress on this project by watching the [ticket1382 branch on my [GitHub](wiki/GitHub) page](https://github.com/isnotajoke/tahoe-lafs/tree/ticket1382). I'm trying to assemble a series of small deltas that take us from our current immutable peer selector to something that can be easily modified to use a pluggable peer selector, and would be particularly interested in knowing whether the deltas that are there now (and show up there as I continue to work on it) are small enough to be easily reviewed.

daira commented

2012-03-29 22:24:23 +00:00

As Release Manager for 1.10, I say this goes in.

daira modified the milestone from soon to 1.10.0

2012-03-29 22:24:23 +00:00

zooko modified the milestone from 1.11.0 to soon

2012-04-01 02:16:24 +00:00

daira commented

2012-04-01 02:43:33 +00:00

(I effectively renamed 1.10 to 1.11, so that milestone is correct.)

daira modified the milestone from soon to 1.11.0

2012-04-01 02:43:33 +00:00

zooko commented

2013-05-15 17:20:05 +00:00

I'm moving this to Milestone: "eventually", which means that we agree it ought to be fixed, but we don't agree that it is going to get fixed in 1.11.

If someone (e.g. kevan or markberger) wants to prioritize this, then I suppose they might finish it in time for Milestone 1.11. In that case they can move it back into this milestone when they start working on it.

I'm moving this to Milestone: "eventually", which means that we agree it ought to be fixed, but we don't agree that it is going to get fixed in 1.11. If someone (e.g. kevan or markberger) wants to prioritize this, then I suppose they **might** finish it in time for Milestone 1.11. In that case they can move it back into this milestone when they start working on it.

zooko modified the milestone from 1.11.0 to eventually

2013-05-15 17:20:05 +00:00

daira commented

2013-05-17 00:26:45 +00:00

Huh. I thought this was the main ticket we wanted to get fixed for 1.11.

Huh. I thought this was **the** main ticket we wanted to get fixed for 1.11.

zooko commented

2013-05-22 23:23:28 +00:00

Replying to daira:

Huh. I thought this was the main ticket we wanted to get fixed for 1.11.

Oh! Well, it is the main ticket that I want to get fixed for markberger's Google Summer of Code project, but I hope we release Tahoe-LAFS v1.11 before that project is complete...

Replying to [daira](/tahoe-lafs/trac/issues/1382#issuecomment-384145): > Huh. I thought this was **the** main ticket we wanted to get fixed for 1.11. Oh! Well, it is **the** main ticket that I want to get fixed for markberger's Google Summer of Code project, but I hope we release Tahoe-LAFS v1.11 before that project is complete...

markberger commented

2013-06-17 18:14:08 +00:00

To view progress on this ticket refer to https://github.com/markberger/tahoe-lafs/tree/1382-immutable-peer-selection

To view progress on this ticket refer to <https://github.com/markberger/tahoe-lafs/tree/1382-immutable-peer-selection>

daira removed their assignment 2013-06-18 01:33:42 +00:00

markberger was assigned by daira

2013-06-18 01:33:42 +00:00

markberger commented

2013-06-25 20:40:26 +00:00

I'm currently figuring out the last few tests that fail with the new uploader. One of them is test_good_share_hosts in src/allmydata/test/test_checker.py. Here is a summary of the test:

N = 4
k = 3
happy = 1

Initial sharemap:
    share -> [servers]
    0:[A] 1:[A] 2:[A] 3:[A,B,C,D,E]
    4 good shares, but 5 good hosts
After deleting all instances of share #3 and repairing:
    0:[A,B], 1:[A,C], 2:[A,D], 3:[E]
    Still 4 good shares and 5 good hosts

However, with upload of happiness, repair produces

0:[A] 1:[A, B] 2:[C, A] 3:[E]

Is this behavior acceptable? The new algorithm views any additional share placements over N as redundant and does not attempt to place any more shares. However, this appears to be different from the old algorithm, which looks like it tried to place shares on as many servers as possible, regardless of N (at least from this test).

I'm currently figuring out the last few tests that fail with the new uploader. One of them is test_good_share_hosts in src/allmydata/test/test_checker.py. Here is a summary of the test: ``` N = 4 k = 3 happy = 1 Initial sharemap: share -> [servers] 0:[A] 1:[A] 2:[A] 3:[A,B,C,D,E] 4 good shares, but 5 good hosts After deleting all instances of share #3 and repairing: 0:[A,B], 1:[A,C], 2:[A,D], 3:[E] Still 4 good shares and 5 good hosts ``` However, with upload of happiness, repair produces ``` 0:[A] 1:[A, B] 2:[C, A] 3:[E] ``` Is this behavior acceptable? The new algorithm views any additional share placements over N as redundant and does not attempt to place any more shares. However, this appears to be different from the old algorithm, which looks like it tried to place shares on as many servers as possible, regardless of N (at least from this test).

zooko commented

2013-06-26 16:39:04 +00:00

Yes, I believe the new behavior is okay. The real question should be: does it match the specification? Well, what is the specification? Could you please make sure it is written in a precise and succinct form in a .rst file under source:docs/ ? (It can refer to Kevan's thesis, but the reader should be able to tell exactly what the specification is without reading Kevan's thesis.)

I'm pretty sure that the result you show in comment:384150 does meet the spec.

daira commented

2013-06-27 17:03:08 +00:00

It seems non-optimal that the new algorithm does not place share D. The algorithm in /tahoe-lafs/trac/issues/27074#comment:384138 does place share D, in step 5.

It seems non-optimal that the new algorithm does not place share D. The algorithm in [/tahoe-lafs/trac/issues/27074](/tahoe-lafs/trac/issues/27074)#[comment:384138](/tahoe-lafs/trac/issues/1382#issuecomment-384138) does place share D, in step 5.

markberger commented

2013-06-27 17:22:17 +00:00

Zooko (comment:23), I have added a document under source:docs/specifications that you should be able to see on the github branch.

Daira (comment:384152), the map is shares to servers, so the algorithm is placing share 0 on server A, share 1 on servers A and B, etc. So all of the shares are being placed, but not all of the servers get a share because the set of servers is larger than the set of shares.

Zooko (comment:23), I have added a document under source:docs/specifications that you should be able to see on the github branch. Daira ([comment:384152](/tahoe-lafs/trac/issues/1382#issuecomment-384152)), the map is shares to servers, so the algorithm is placing share 0 on server A, share 1 on servers A and B, etc. So all of the shares are being placed, but not all of the servers get a share because the set of servers is larger than the set of shares.

daira commented

2013-06-27 17:44:56 +00:00

Oh I see, I misread it.

I think this behaviour is completely acceptable.

Does the new code also decide which leases need to be renewed?

Oh I see, I misread it. I think this behaviour is completely acceptable. Does the new code also decide which leases need to be renewed?

markberger commented

2013-06-27 17:59:06 +00:00

Yes. The uploader attempts to place the existing shares on their respective servers and this causes the servers to renew their shares instead of uploading them again.

markberger commented

2013-07-09 17:43:47 +00:00

In 7/9/13's weekly dev chat, it was agreed upon between Brian, Daira, and Zooko that the upload algorithm associated with this ticket should contact 2n servers. There was also discussion on what to do in the case when existing shares are found. Refer to ticket #610.

markberger commented

2013-07-16 20:35:21 +00:00

I think I've found the error for why allmydata.test.test_download.Corruption.test_each_byte fails. Unlike the previous tests I needed to modify for this patch, the downloader is not deterministic. It will download the same shares each time, but not in the same order. Therefore different corruption offsets are not detected each time the test is run. In order for allmydata.test.test_download. Corruption.test_each_byte to pass, something must be modified in order to ensure the downloader is deterministic.

I think I've found the error for why ` allmydata.test.test_download.Corruption.test_each_byte ` fails. Unlike the previous tests I needed to modify for this patch, the downloader is not deterministic. It will download the same shares each time, but not in the same order. Therefore different corruption offsets are not detected each time the test is run. In order for ` allmydata.test.test_download. Corruption.test_each_byte ` to pass, something must be modified in order to ensure the downloader is deterministic.

markberger commented

2013-07-17 13:32:49 +00:00

It turns out what I perceived to be download order was not the case. Please ignore comment:384157.

It turns out what I perceived to be download order was not the case. Please ignore [comment:384157](/tahoe-lafs/trac/issues/1382#issuecomment-384157).

markberger commented

2013-07-18 14:41:29 +00:00

My github branch (https://github.com/markberger/tahoe-lafs/tree/1382-immutable-peer-selection) passes all tests.

My github branch (<https://github.com/markberger/tahoe-lafs/tree/1382-immutable-peer-selection>) passes all tests.

markberger commented

2013-07-22 18:53:47 +00:00

The above branch imposes the 2n server limit and a test has been added to check whether 2n servers are contacted.

markberger modified the milestone from eventually to 1.11.0

2013-07-22 19:12:15 +00:00

markberger commented

2013-08-21 20:36:44 +00:00

I rewrote my messy commit history and rebased the branch to trunk. You can find the new branch here: https://github.com/markberger/tahoe-lafs/tree/1382-rewrite

For anyone who is going to review the patch, I suggest reviewing each group in the same sitting.

Mostly concerned with the new class associated with the algorithm (first 3 commits)
Implementing the algorithm into the uploader (next 3 commits)
Adding and fixing unit tests (following 4 commits)
Document about the new upload algorithm (final commit)

I rewrote my messy commit history and rebased the branch to trunk. You can find the new branch here: <https://github.com/markberger/tahoe-lafs/tree/1382-rewrite> For anyone who is going to review the patch, I suggest reviewing each group in the same sitting. * Mostly concerned with the new class associated with the algorithm (first 3 commits) * Implementing the algorithm into the uploader (next 3 commits) * Adding and fixing unit tests (following 4 commits) * Document about the new upload algorithm (final commit)

markberger modified the milestone from soon to 1.11.0

2013-08-28 15:33:13 +00:00

zooko commented

2013-09-01 13:24:18 +00:00

This would fix #1814.

zooko commented

2013-09-01 13:36:14 +00:00

I believe this would fix #1130.

zooko commented

2013-09-03 03:50:55 +00:00

Overall, these patches are really great! You've done a fantastic job of satisfying all the requirements, recording the patches into coherent chunks, etc. I'm making a lot of requests of you in the following review notes, and I hope it doesn't discourage you! I'm excited about getting #1382 fixed in trunk ASAP!

regarding https://github.com/markberger/tahoe-lafs/commit/739468e69676d30353814248f1fd7dd4b7b9d8f9:

"servers- of" → "servers-of"
"1. Query all servers for existing shares." — actually only 2*N servers?
s/an arbitrary/a/g
define (explain) "maximum matching graph" the first time it occurs
link from servers-of-happiness.rst to upload-of-happiness.rst, or actually better to combine the two documents together.
link to Kevan's thesis from this doc
Hey, I think a series of diagrams showing an example graph would help a lot.
In step 2, "a server s" → "a readonly server s"
Shouldn't step 8 be broken into two steps?
Shouldn't failures of renewals in step 8 be handled by removing the server and starting over, just a failures of placements are in step 9?
"placement is not generated for a subset of shares" ; Hey! This is Diego "sickness" Righi's long-standing request (comment:384138:ticket699).
"guarantees that N shares will be placed" → "guarantees that at least H shares will be placed"

regarding https://github.com/markberger/tahoe-lafs/commit/ff1a77924b03d286477496c0e0a3ddfe45353c61:

Move the explanation about what a "flow network" is and what a "maximum spanning" is up to before the first occurrence of those terms.
No wait, instead move the explanation about what a "flow network" is into upload-of-happiness.rst and link to that document!
Also, explain if the flow network in this case is going to just be an integral flow of 0 or 1, in which case… maybe it doesn't matter that it is a flow network?
I'm confused about whether generate_mappings() is analyzing only the existing allocations, as the docstring seems to indicate, or additionally proposing new allocations, as the last comment within it indicates.
Are "readonly_peers" and "peerids" the same type of thing? Maybe named the former "readonly_peerids" then.
"[for key in servermap]key" → "servermap.keys()"
change:

    self.servermap_shareids = set()
    for key in servermap:
        for share in servermap[key]:
            self.servermap_shareids.add(share)

to:

    self.servermap_shareids = set(servermap.values())

No wait, actually remove self.servermap_shareids and self.servermap_peerids entirely and use the appropriate expression instead of those member variables where they are currently used.
For _flow_network():
- "Given set of peerids and shareids" → "Given a set of peerids and a set of shareids"
- Explain what the type of the return value is. It is a list of… lists… of something? Explain the data structure — what shape it has. I see that over in happinessutil.py it is mentioned "graph is a flow network in adjacency-list form". Is this the same? Maybe just put that in the docstring of all functions that take or return that type. However, I would also like more explanation of what adjacency-list form is, perhaps in happinessutil.py somewhere, and the other uses of it can say "As defined in happinessutil.py" ?
- Write a unit test showing that _flow_network() returns the right answer for a range of inputs.
- Okay, okay, hold on. allmydata.immutable.happiness_upload.Happiness_Upload._flow_network is almost identical to allmydata.util.happinessutil.flow_network_for. And, the latter looks like it has more of the kind of documentation that I'm asking for here. (But not yet the unit test that I asked for here.) Also allmydata.immutable.happiness_upload.Happiness_Upload._servermap_flow_graph. So: can we refactor these three functions into one function? And write unit tests for it?

Okay, I'm stopping reviewing ff1a77924b03d286477496c0e0a3ddfe45353c61 for now because I don't know how much of it is duplicated or near-duplicated code…

regarding https://github.com/markberger/tahoe-lafs/commit/e7af5f7995bc661d19d5c903359813fa91880966 and https://github.com/markberger/tahoe-lafs/commit/5875c31d32ef50467c572736599066fffe8b73ac:

In the docstring for get_tasks, "allocations" → "placements", "allocation" → "placement", "reliably or correctly placed" → "well allocated"
Add to the confirm_share_allocation docstring "i.e. the share has been successfully uploaded to the specified server" (if that is correct).
add_peers isn't used, but an "add_peer" method that isn't included in this interface is. (Perhaps this would have been detected by some kind of interface-checking tool? That Daira might have written? I don't remember…)
Likewise with "add_peer_with_share" and "add_peer_with_shares".
mark_full_peer is used on each of the servers detected as full by examining their announcements of themselves, but is not used elsewhere. This makes me wonder what happens if a server is not full according to its announcements, but it has become full since the creation of that announcement and before this upload begins, or it becomes full during this upload. Should such servers be marked with "mark_full_peer"? If so, then we should write a unit test that is red with the current code and goes green when we change the code to mark such servers as full.
I would prefer if the tasks object returned from get_tasks in upload.py were passed around as arguments instead of stored in a member variable. Member variables can be updated or read from more places, and can persist longer, so I have to look at a wider scope of code when wondering what could be effected by or effecting it. Arguments have narrower scope, and so are preferable except when you actually do want the state to be reachable from all (or at least most) of the methods.
I don't think is_healthy is used. There is a different "is_healthy" defined in checker results that is used a lot, but as far as I could see from grep none of the uses of is_healthy are the one from this interface. So, good! Maybe it is unnecessary and can be removed.
I don't think needs_recomputation is necessary or used.
change:

    def add_peer_with_shares(self, peerid, shnum):
        if peerid in self.existing_shares.keys():
            self.existing_shares[peerid].add(shnum)
        else:
            self.existing_shares[peerid] = set([shnum])

to:

    def add_peer_with_share(self, peerid, shnum):
        self.existing_shares.setdefault(peerid, set()).add(shnum)

for mark_bad_peer, you can use the set ".discard()" method to simplify the code

Mark: I started doing real code review while on a plane to Washington D.C. today. Here are my notes. I apologize if they are confusing or incomplete. I'm deciding I should send you these notes now instead of sleeping on it, in order to not delay. Overall, these patches are really great! You've done a fantastic job of satisfying all the requirements, recording the patches into coherent chunks, etc. I'm making a lot of requests of you in the following review notes, and I hope it doesn't discourage you! I'm excited about getting #1382 fixed in trunk ASAP! regarding <https://github.com/markberger/tahoe-lafs/commit/739468e69676d30353814248f1fd7dd4b7b9d8f9>: * "servers- of" → "servers-of" * "1. Query all servers for existing shares." — actually only 2*N servers? * s/an arbitrary/a/g * define (explain) "maximum matching graph" the first time it occurs * link from servers-of-happiness.rst to upload-of-happiness.rst, or actually better to combine the two documents together. * link to Kevan's thesis from this doc * Hey, I think a series of diagrams showing an example graph would help a lot. * In step 2, "a server s" → "a readonly server s" * Shouldn't step 8 be broken into two steps? * Shouldn't failures of renewals in step 8 be handled by removing the server and starting over, just a failures of placements are in step 9? * "placement is not generated for a subset of shares" ; Hey! This is Diego "sickness" Righi's long-standing request ([comment:384138](/tahoe-lafs/trac/issues/1382#issuecomment-384138):ticket699). * "guarantees that N shares will be placed" → "guarantees that at least H shares will be placed" regarding <https://github.com/markberger/tahoe-lafs/commit/ff1a77924b03d286477496c0e0a3ddfe45353c61>: * Move the explanation about what a "flow network" is and what a "maximum spanning" is up to before the first occurrence of those terms. * No wait, instead move the explanation about what a "flow network" is into upload-of-happiness.rst and link to that document! * Also, explain if the flow network in this case is going to just be an integral flow of 0 or 1, in which case… maybe it doesn't matter that it is a flow network? * I'm confused about whether generate_mappings() is analyzing only the existing allocations, as the docstring seems to indicate, or additionally proposing new allocations, as the last comment within it indicates. * Are "readonly_peers" and "peerids" the same type of thing? Maybe named the former "readonly_peerids" then. * "[for key in servermap]key" → "servermap.keys()" * change: ``` self.servermap_shareids = set() for key in servermap: for share in servermap[key]: self.servermap_shareids.add(share) ``` to: ``` self.servermap_shareids = set(servermap.values()) ``` * No wait, actually remove `self.servermap_shareids` and `self.servermap_peerids` entirely and use the appropriate expression instead of those member variables where they are currently used. * For _flow_network(): * "Given set of peerids and shareids" → "Given a set of peerids and a set of shareids" * Explain what the type of the return value is. It is a list of… lists… of something? Explain the data structure — what shape it has. I see that over in happinessutil.py it is mentioned "graph is a flow network in adjacency-list form". Is this the same? Maybe just put that in the docstring of all functions that take or return that type. However, I would also like more explanation of what adjacency-list form is, perhaps in happinessutil.py somewhere, and the other uses of it can say "As defined in happinessutil.py" ? * Write a unit test showing that _flow_network() returns the right answer for a range of inputs. * Okay, okay, hold on. allmydata.immutable.happiness_upload.Happiness_Upload._flow_network is almost identical to allmydata.util.happinessutil.flow_network_for. And, the latter looks like it has more of the kind of documentation that I'm asking for here. (But not yet the unit test that I asked for here.) Also allmydata.immutable.happiness_upload.Happiness_Upload._servermap_flow_graph. So: can we refactor these three functions into one function? And write unit tests for it? Okay, I'm stopping reviewing ff1a77924b03d286477496c0e0a3ddfe45353c61 for now because I don't know how much of it is duplicated or near-duplicated code… regarding <https://github.com/markberger/tahoe-lafs/commit/e7af5f7995bc661d19d5c903359813fa91880966> and <https://github.com/markberger/tahoe-lafs/commit/5875c31d32ef50467c572736599066fffe8b73ac>: * In the docstring for get_tasks, "allocations" → "placements", "allocation" → "placement", "reliably or correctly placed" → "well allocated" * Add to the confirm_share_allocation docstring "i.e. the share has been successfully uploaded to the specified server" (if that is correct). * add_peers isn't used, but an "add_peer" method that isn't included in this interface is. (Perhaps this would have been detected by some kind of interface-checking tool? That Daira might have written? I don't remember…) * Likewise with "add_peer_with_share" and "add_peer_with_shares". * mark_full_peer is used on each of the servers detected as full by examining their announcements of themselves, but is not used elsewhere. This makes me wonder what happens if a server is not full according to its announcements, but it has become full since the creation of that announcement and before this upload begins, or it becomes full during this upload. Should such servers be marked with "mark_full_peer"? If so, then we should write a unit test that is red with the current code and goes green when we change the code to mark such servers as full. * I would prefer if the tasks object returned from get_tasks in upload.py were passed around as arguments instead of stored in a member variable. Member variables can be updated or read from more places, and can persist longer, so I have to look at a wider scope of code when wondering what could be effected by or effecting it. Arguments have narrower scope, and so are preferable except when you actually do want the state to be reachable from all (or at least most) of the methods. * I don't think is_healthy is used. There is a different "is_healthy" defined in checker results that is used a lot, but as far as I could see from grep none of the uses of is_healthy are the one from this interface. So, good! Maybe it is unnecessary and can be removed. * I don't think needs_recomputation is necessary or used. * change: ``` def add_peer_with_shares(self, peerid, shnum): if peerid in self.existing_shares.keys(): self.existing_shares[peerid].add(shnum) else: self.existing_shares[peerid] = set([shnum]) ``` to: ``` def add_peer_with_share(self, peerid, shnum): self.existing_shares.setdefault(peerid, set()).add(shnum) ``` * for mark_bad_peer, you can use the set ".discard()" method to simplify the code

markberger commented

2013-09-03 20:28:51 +00:00

zooko: Thanks for reviewing my patch! I will start working on these changes. I'm hesitant to implement some of the changes you suggested for https://github.com/markberger/tahoe-lafs/commit/5875c31d32ef50467c572736599066fffe8b73ac because the next commit makes some changes to the same lines. If you also reviewed the next commit and those comments apply to both, please let me know and I'll go ahead with your changes.

zooko: Thanks for reviewing my patch! I will start working on these changes. I'm hesitant to implement some of the changes you suggested for <https://github.com/markberger/tahoe-lafs/commit/5875c31d32ef50467c572736599066fffe8b73ac> because the [next commit](https://github.com/markberger/tahoe-lafs/commit/b6a730393b292044b0a6dda024254d7fd97cac5d) makes some changes to the same lines. If you also reviewed the next commit and those comments apply to both, please let me know and I'll go ahead with your changes.

zooko commented

2013-09-04 01:13:08 +00:00

Replying to markberger:

zooko: Thanks for reviewing my patch! I will start working on these changes. I'm hesitant to implement some of the changes you suggested for https://github.com/markberger/tahoe-lafs/commit/5875c31d32ef50467c572736599066fffe8b73ac because the next commit makes some changes to the same lines. If you also reviewed the next commit and those comments apply to both, please let me know and I'll go ahead with your changes.

Will do.

Replying to [markberger](/tahoe-lafs/trac/issues/1382#issuecomment-384167): > zooko: Thanks for reviewing my patch! I will start working on these changes. I'm hesitant to implement some of the changes you suggested for <https://github.com/markberger/tahoe-lafs/commit/5875c31d32ef50467c572736599066fffe8b73ac> because the [next commit](https://github.com/markberger/tahoe-lafs/commit/b6a730393b292044b0a6dda024254d7fd97cac5d) makes some changes to the same lines. If you also reviewed the next commit and those comments apply to both, please let me know and I'll go ahead with your changes. Will do.

markberger commented

2013-09-09 22:16:49 +00:00

I've pushed some commits which fix the issues with the documentation and the happiness_upload class.

Write a unit test showing that _flow_network() returns the right answer for a range of inputs.

Okay, okay, hold on. allmydata.immutable.happiness_upload.Happiness_Upload._flow_network is almost identical to allmydata.util.happinessutil.flow_network_for. And, the latter looks like it has more of the kind of documentation that I'm asking for here. (But not yet the unit test that I asked for here.) Also allmydata.immutable.happiness_upload.Happiness_Upload._servermap_flow_graph. So: can we refactor these three functions into one function? And write unit tests for it?

I tried to refactor those two functions into one, but servers are indexed differently between the two and I couldn't figure out how to normalize the input. I will try again later when I have more time.

I've pushed some commits which fix the issues with the documentation and the happiness_upload class. > * Write a unit test showing that _flow_network() returns the right answer for a range of inputs. > * Okay, okay, hold on. allmydata.immutable.happiness_upload.Happiness_Upload._flow_network is almost identical to allmydata.util.happinessutil.flow_network_for. And, the latter looks like it has more of the kind of documentation that I'm asking for here. (But not yet the unit test that I asked for here.) Also allmydata.immutable.happiness_upload.Happiness_Upload._servermap_flow_graph. So: can we refactor these three functions into one function? And write unit tests for it? I tried to refactor those two functions into one, but servers are indexed differently between the two and I couldn't figure out how to normalize the input. I will try again later when I have more time.

markberger commented

2013-09-11 17:08:33 +00:00

During dev chat today (September 11th, 2013), Zooko and I talked about some modifications to happiness_upload.py that would make it easier to understand.

Add notes about when to use _server_flow_graph and when to use _flow_network.
Explain the graph data structure we use to represent a flow network.
Either summarize why we use flow networks to find the maximum spanning, or link to a good source instead of just referring to Introduction to Algorithms.
Combine index_peers and index_shares into one function.
Rename peerids and shareids to something else which makes it clear that they are indices used by the flow network.
Remove the repeated pattern in generate_mappings to its own function for clarity. That way reindexing can occur in the function and not distract the reader from what is occurring.

During dev chat today (September 11th, 2013), Zooko and I talked about some modifications to `happiness_upload.py` that would make it easier to understand. * Add notes about when to use `_server_flow_graph` and when to use `_flow_network`. * Explain the graph data structure we use to represent a flow network. * Either summarize why we use flow networks to find the maximum spanning, or link to a good source instead of just referring to Introduction to Algorithms. * Combine `index_peers` and `index_shares` into one function. * Rename `peerids` and `shareids` to something else which makes it clear that they are indices used by the flow network. * Remove the repeated pattern in `generate_mappings` to its own function for clarity. That way reindexing can occur in the function and not distract the reader from what is occurring.

markberger commented

2013-09-13 19:18:45 +00:00

Besides explaining why flow networks are used in the Edmond-Karp algorithm, all of the above points have been addressed on 1382-rewrite.

Besides explaining why flow networks are used in the Edmond-Karp algorithm, all of the above points have been addressed on [1382-rewrite](https://github.com/markberger/tahoe-lafs/tree/1382-rewrite).

markberger commented

2013-09-13 19:22:20 +00:00

I've opened a pull request for my branch here.

I've opened a pull request for my branch [here](https://github.com/tahoe-lafs/tahoe-lafs/pull/60).

zooko commented

2013-10-20 00:39:12 +00:00

Mark: I looked at the pull request linked from comment:384172 again just now (with Peter Stuge's) help, and I recognized most of the commits as things I've already read, and I see a lot of comments I've already left. Did you already take into account all previous comments? And if you did, did you put your latest code into new commits at the end of the 1382-rewrite branch, or a new branch, or what? Sorry if I'm being dumb, but please hold my hand and tell me where your latest code is now. Thanks! ☺ —Z

Mark: I looked at the pull request linked from [comment:384172](/tahoe-lafs/trac/issues/1382#issuecomment-384172) again just now (with Peter Stuge's) help, and I recognized most of the commits as things I've already read, and I see a lot of comments I've already left. Did you already take into account all previous comments? And if you did, did you put your latest code into new commits at the end of the 1382-rewrite branch, or a new branch, or what? Sorry if I'm being dumb, but please hold my hand and tell me where your latest code is now. Thanks! ☺ —Z

markberger commented

2013-10-22 03:33:42 +00:00

Hi Zooko! Sorry for the late response.

After the dev chat we had, I forgot to go and implement the changes we discussed. I quickly added the basic changes yesterday, and they can be found on the pull request. I investigated combining the loop over read only servers and the loop over write servers, but I did not have enough time to commit anything that works.

Sadly I don't have as much time to hack on tahoe as I had expected, but I will try to revisit the loop issue this weekend.

Hi Zooko! Sorry for the late response. After the dev chat we had, I forgot to go and implement the changes we discussed. I quickly added the basic changes yesterday, and they can be found on the pull request. I investigated combining the loop over read only servers and the loop over write servers, but I did not have enough time to commit anything that works. Sadly I don't have as much time to hack on tahoe as I had expected, but I will try to revisit the loop issue this weekend.

markberger commented

2013-11-04 22:44:30 +00:00

I've been working on the suggestions we discussed during last week's dev chat.

It turns out a lot of the query counting mechanics in place on my branch are incorrect. I'm shocked that it actually passes all of the tests in its current state. Also the current version tries to place shares on servers even when we have enough information to determine that a happy upload cannot occur.

I've started to correct these issues, as well as remove the redundant loop, in a branch here.

For the above branch, when tracker.ask_about_existing_shares() is called on FakeStorageServer, I receive the following error for each tracker:

[Failure instance: Traceback: <type 'exceptions.AttributeError'>: FakeStorageServer instance has no attribute 'get_buckets'
/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/base.py:800:runUntilCurrent
/Users/markberger/Code/tahoe-lafs/support/lib/python2.7/site-packages/foolscap-0.6.4-py2.7.egg/foolscap/eventual.py:26:_turn
/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py:368:callback
/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py:464:_startRunCallbacks
--- <exception caught here> ---
/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py:551:_runCallbacks
/Users/markberger/Code/tahoe-lafs/src/allmydata/test/test_upload.py:124:<lambda>
/Users/markberger/Code/tahoe-lafs/src/allmydata/test/test_upload.py:121:_call
]

Sadly, this error does not appear when you run python setup.py trial -s allmydata.test.test_upload. It simply fails, stating that the happiness error messages are wrong. To reproduce this error message, import pdb and call pdb.set_trace() in _handle_existing_response within immutable/upload.py.

I'm not sure why I'm receiving this error because the same function is called in 1.10. Additionally, there is some code that should give get_buckets a proper response here.

If anyone has some advice on how to fix this error, I would appreciate the help. I haven't been able to figure this bug out.

I've been working on the suggestions we discussed during last week's dev chat. It turns out a lot of the query counting mechanics in place on my branch are incorrect. I'm shocked that it actually passes all of the tests in its current state. Also the current version tries to place shares on servers even when we have enough information to determine that a happy upload cannot occur. I've started to correct these issues, as well as remove the redundant loop, in a branch [here](https://github.com/markberger/tahoe-lafs/tree/1382-rewrite-2). For the above branch, when `tracker.ask_about_existing_shares()` is called on `FakeStorageServer`, I receive the following error for each tracker: ``` [Failure instance: Traceback: <type 'exceptions.AttributeError'>: FakeStorageServer instance has no attribute 'get_buckets' /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/base.py:800:runUntilCurrent /Users/markberger/Code/tahoe-lafs/support/lib/python2.7/site-packages/foolscap-0.6.4-py2.7.egg/foolscap/eventual.py:26:_turn /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py:368:callback /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py:464:_startRunCallbacks --- <exception caught here> --- /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py:551:_runCallbacks /Users/markberger/Code/tahoe-lafs/src/allmydata/test/test_upload.py:124:<lambda> /Users/markberger/Code/tahoe-lafs/src/allmydata/test/test_upload.py:121:_call ] ``` Sadly, this error does not appear when you run `python setup.py trial -s allmydata.test.test_upload`. It simply fails, stating that the happiness error messages are wrong. To reproduce this error message, `import pdb` and call `pdb.set_trace()` in `_handle_existing_response` within `immutable/upload.py`. I'm not sure why I'm receiving this error because the same function is called in 1.10. Additionally, there is some code that should give `get_buckets` a proper response [here](https://github.com/tahoe-lafs/tahoe-lafs/blob/77029991070fef8f6a89fbb8407cc238de2defeb/src/allmydata/test/no_network.py#L105). If anyone has some advice on how to fix this error, I would appreciate the help. I haven't been able to figure this bug out.

markberger commented

2013-11-07 08:03:43 +00:00

To correct the issue I was having in my previous comment, I added a get_buckets method to FakeStorageServer. I'm not sure why this is necessary since it seemed to work before, but it appears to fix my problem.

To correct the issue I was having in my previous comment, I added a `get_buckets` method to `FakeStorageServer`. I'm not sure why this is necessary since it seemed to work before, but it appears to fix my problem.

markberger commented

2013-11-12 04:29:30 +00:00

Today during the summit, we addressed the error messages which occur on the the 1382-rewrite-2 branch. Daira raised some concerns that the uploader might terminate too early because the upload code is tied to repair. I haven't looked into that issue yet.

Today during the summit, we addressed the error messages which occur on the the `1382-rewrite-2` branch. Daira raised some concerns that the uploader might terminate too early because the upload code is tied to repair. I haven't looked into that issue yet.

daira commented

2013-11-14 17:34:39 +00:00

At the summit we also discussed the issue that repair should be a best-effort operation and should not exit early, even if it can tell that it isn't going to achieve the happiness set by shared.happy. However I think that should already be addressed by the fixed ticket #1212, which set the happiness threshold to 0 when invoking the repairer. On the other hand, there should be an explicit test of this behaviour, and I'm not sure there is one already.

Adding the resulting happiness to repair and check reports is part of #1784.

At the summit we also discussed the issue that repair should be a best-effort operation and should not exit early, even if it can tell that it isn't going to achieve the happiness set by `shared.happy`. However I think that should already be addressed by the fixed ticket #1212, which set the happiness threshold to 0 when invoking the repairer. On the other hand, there should be an explicit test of this behaviour, and I'm not sure there is one already. Adding the resulting happiness to repair and check reports is part of #1784.

zooko commented

2013-11-14 22:15:18 +00:00

Once we land this ticket, let's go revisit #614. I am inspired to say that by the discussion on #2106.

zooko commented

2013-11-15 06:33:19 +00:00

Mark:

Here are some more review comments. These are from looking at upload.py on the current version of your 1382-rewrite-2 branch.

This comment on line 372 is totally vestigial, right? It is describing the old upload algorithm that you've now replaced.
architecture.rst#server-selection is describing the old algorithm. It should probably be replaced with a one-sentence description and a link to servers-of-happiness.rst#upload-strategy-of-happiness.
I'm thinking about the constant 2 at line 340. A reader might wonder why that is there, and why it isn't 3 or 2.71828 or something. We should add a comment next to it saying something like "If there are a large number of servers, we don't want to spend a lot of time and network traffic talking to all of them before deciding which ones to upload to. Here we limit our consideration to the first 2*N servers. That ought to be enough. See docs/specifications/servers-of-happiness.rst for details.". Then add to servers-of-happiness.rst a discussion about that constant 2.
Hm…in fact, now that I think about it, I'm not happy about this part of the behavior. Actually, I think I'll save that for a separate comment!
I find self.tasks confusing. The following things would probably make it easier for me to follow:
- change the name to something more specific than "tasks". Is it…an "upload_plan"?
- add a comment next to it saying what it is, possibly: "a map from shnum to a set of servers that the share should be uploaded to"
- actually you can skip the comment if the renaming above makes it obvious to the reader that it is the return value from Happiness_Upload.generate_mappings, especially if you rename generate_mappings to calculate_upload_plan. ☺
Relatedly, the word servermap is used in this file on line 54 to denote a data structure which is shaped like {serverid: set(shnum)} , but here a "servermap" is a datastructure which is shaped like {shnum: serverid} , and then further down in the file "servermap" is used to mean something which is shaped like {shnum: set(serverids)} . Confusing! (Even though "servermap" is a reasonable word for any of these three shapes.) Could we think of different local names for the latter two? Better not touch the first one since it is transmitted over the wire in the helper protocol, apparently.) Oh, here's a partial solution: the one on line 482 doesn't need to be named at all — just inline it instead of using a variable to hold it between line 482 and 483. ☺
Please rename _isHappinessPossible to is_happiness_possible and Happiness_Upload to HappinessUpload, to match our conventions.

Okay, I didn't really finish studying the entire upload.py (this version of it), but I'm stopping here for tonight. ☺

Mark: Here are some more review comments. These are from looking at [upload.py on the current version of your 1382-rewrite-2 branch](https://github.com/markberger/tahoe-lafs/blob/fa6f3cfb1e258e4427f35a7aada05c7ad2c9dd13/src/allmydata/immutable/upload.py#L340). * This [comment on line 372](https://github.com/markberger/tahoe-lafs/blob/fa6f3cfb1e258e4427f35a7aada05c7ad2c9dd13/src/allmydata/immutable/upload.py#L372) is totally vestigial, right? It is describing the old upload algorithm that you've now replaced. * [architecture.rst#server-selection](https://github.com/markberger/tahoe-lafs/blob/d9c7dfa4a0085884a6a3ba256e558e316fde096c/docs/architecture.rst#server-selection) is describing the old algorithm. It should probably be replaced with a one-sentence description and a link to [servers-of-happiness.rst#upload-strategy-of-happiness](https://github.com/markberger/tahoe-lafs/blob/603333f9bba4ef4c40f970f93baa255eeca008c8/docs/specifications/servers-of-happiness.rst#upload-strategy-of-happiness). * I'm thinking about the constant `2` at [line 340](https://github.com/markberger/tahoe-lafs/blob/fa6f3cfb1e258e4427f35a7aada05c7ad2c9dd13/src/allmydata/immutable/upload.py#L340). A reader might wonder why that is there, and why it isn't `3` or `2.71828` or something. We should add a comment next to it saying something like "If there are a large number of servers, we don't want to spend a lot of time and network traffic talking to all of them before deciding which ones to upload to. Here we limit our consideration to the first 2*N servers. That ought to be enough. See docs/specifications/servers-of-happiness.rst for details.". Then add to servers-of-happiness.rst a discussion about that constant 2. * Hm…in fact, now that I think about it, I'm not happy about this part of the behavior. Actually, I think I'll save that for a separate comment! * I find [self.tasks](https://github.com/markberger/tahoe-lafs/blob/fa6f3cfb1e258e4427f35a7aada05c7ad2c9dd13/src/allmydata/immutable/upload.py#L418) confusing. The following things would probably make it easier for me to follow: - change the name to something more specific than "tasks". Is it…an "upload_plan"? - add a comment next to it saying what it is, possibly: "a map from shnum to a set of servers that the share should be uploaded to" - actually you can skip the comment if the renaming above makes it obvious to the reader that it is the return value from [Happiness_Upload.generate_mappings](https://github.com/markberger/tahoe-lafs/blob/fa6f3cfb1e258e4427f35a7aada05c7ad2c9dd13/src/allmydata/immutable/happiness_upload.py#L26), especially if you rename `generate_mappings` to `calculate_upload_plan`. ☺ * Relatedly, the word `servermap` is used in this file on [line 54](https://github.com/markberger/tahoe-lafs/blob/fa6f3cfb1e258e4427f35a7aada05c7ad2c9dd13/src/allmydata/immutable/upload.py#L54) to denote a data structure which is shaped like `{serverid: set(shnum)} `, but [here](https://github.com/markberger/tahoe-lafs/blob/fa6f3cfb1e258e4427f35a7aada05c7ad2c9dd13/src/allmydata/immutable/upload.py#L482) a "servermap" is a datastructure which is shaped like `{shnum: serverid} `, and then [further down in the file](https://github.com/markberger/tahoe-lafs/blob/fa6f3cfb1e258e4427f35a7aada05c7ad2c9dd13/src/allmydata/immutable/upload.py#L1025) "servermap" is used to mean something which is shaped like `{shnum: set(serverids)} `. Confusing! (Even though "servermap" is a reasonable word for any of these three shapes.) Could we think of different local names for the latter two? Better not touch the first one since it is transmitted over the wire in the helper protocol, apparently.) Oh, here's a partial solution: the one on line 482 doesn't need to be named at all — just inline it instead of using a variable to hold it between line 482 and 483. ☺ * Please rename `_isHappinessPossible` to `is_happiness_possible` and `Happiness_Upload` to `HappinessUpload`, to match [our conventions](wiki/CodingStandards). Okay, I didn't really *finish* studying the entire upload.py (this version of it), but I'm stopping here for tonight. ☺

zooko commented

2013-11-15 07:10:59 +00:00

Okay, so about the "try the first 2*N servers" part (line 340) that I mentioned in comment:384180.

There are two limitations of this. The first one relatively simple to solve and I'd like to do it in this branch. The second one is more invasive and I'd like to open a separate ticket for it.

The first one is that all 2*N of the first servers might be in read-only mode (and not already have shares of this file). Then the upload would fail. This is actually fixable without waiting for further network results, because whether the servers were already announced to be in read-only mode is already known (through the introduction protocol) by the client, so we could for example do something like the following. Replace:

    candidate_servers = all_servers[:2*total_shares]
    for server in candidate_servers:
        self.peer_selector.add_peer(server.get_serverid())
    writeable_servers = [server for server in candidate_servers
                        if _get_maxsize(server) >= allocated_size]
    readonly_servers = set(candidate_servers) - set(writeable_servers)
    for server in readonly_servers:
        self.peer_selector.mark_full_peer(server.get_serverid())

with:

    num_of_writeable_servers = 0
    for server in all_servers:
        self.peer_selector.add_peer(server.get_serverid())
        if _get_maxsize(server) >= allocated_size:
            num_of_writeable_servers += 1
            if num_of_writeable_servers >= 2*total_shares:
                break
        else:
            self.peer_selector.mark_full_peer(server.get_serverid())

I guess we ought to write a unit test of this! Demonstrating a situation where the first 2*N servers are all read-only so the upload fails with the old code but succeeds with the new code.

The second limitation is the "deeper" one. Even if we pick 2N writeable servers at this stage, there's no guarantee that any of those 2N writeable servers will actually accept our subsequent writes. It could be that those 2*N of them are all full or read-only-mode now (even though they weren't back when the introduction protocol ran and informed our client about those servers), or they might all fail at this moment, or something. The current behavior is to give up on the upload if this happens. It would almost always be better to instead go ahead and ask the next few servers if they would hold shares.

I'm going to open a separate ticket for that…

Okay, so about the "try the first 2*`N` servers" part ([line 340](https://github.com/markberger/tahoe-lafs/blob/fa6f3cfb1e258e4427f35a7aada05c7ad2c9dd13/src/allmydata/immutable/upload.py#L340)) that I mentioned in [comment:384180](/tahoe-lafs/trac/issues/1382#issuecomment-384180). There are two limitations of this. The first one relatively simple to solve and I'd like to do it in this branch. The second one is more invasive and I'd like to open a separate ticket for it. The first one is that all 2*N of the first servers might be in read-only mode (and not already have shares of this file). Then the upload would fail. This is actually fixable without waiting for further network results, because whether the servers were already announced to be in read-only mode is already known (through the introduction protocol) by the client, so we could for example do something like the following. Replace: ``` candidate_servers = all_servers[:2*total_shares] for server in candidate_servers: self.peer_selector.add_peer(server.get_serverid()) writeable_servers = [server for server in candidate_servers if _get_maxsize(server) >= allocated_size] readonly_servers = set(candidate_servers) - set(writeable_servers) for server in readonly_servers: self.peer_selector.mark_full_peer(server.get_serverid()) ``` with: ``` num_of_writeable_servers = 0 for server in all_servers: self.peer_selector.add_peer(server.get_serverid()) if _get_maxsize(server) >= allocated_size: num_of_writeable_servers += 1 if num_of_writeable_servers >= 2*total_shares: break else: self.peer_selector.mark_full_peer(server.get_serverid()) ``` I guess we ought to write a unit test of this! Demonstrating a situation where the first 2*N servers are all read-only so the upload fails with the old code but succeeds with the new code. The second limitation is the "deeper" one. Even if we pick 2*N writeable servers at this stage, there's no guarantee that any of those 2*N writeable servers will actually accept our subsequent writes. It could be that those 2*N of them are all full or read-only-mode now (even though they weren't back when the introduction protocol ran and informed our client about those servers), or they might all fail at this moment, or something. The current behavior is to give up on the upload if this happens. It would almost always be better to instead go ahead and ask the *next* few servers if they would hold shares. I'm going to open a separate ticket for that…

zooko commented

2013-11-15 07:25:02 +00:00

Okay, I filed #2108 for the deeper issue raised in comment:384181.

Okay, I filed #2108 for the deeper issue raised in [comment:384181](/tahoe-lafs/trac/issues/1382#issuecomment-384181).

zooko commented

2013-11-16 04:45:01 +00:00

Here is another review of upload.py from the current tip of 1382-rewrite-2. I'm looking at _get_next_allocation() on line 466, and it seems like it is doing some complicated computations in order to yield just a simple data structure — the ordered list of [(server, set(shnums))] that constitute the upload plan. It seems like the "calculate upload plan" (a.k.a. _calculate_tasks() method could just generate that list and return it and _get_next_allocation() could disappear. Am I missing anything?

Here is another review of [upload.py from the current tip of 1382-rewrite-2](https://github.com/markberger/tahoe-lafs/blob/fa6f3cfb1e258e4427f35a7aada05c7ad2c9dd13/src/allmydata/immutable/upload.py). I'm looking at [_get_next_allocation() on line 466](https://github.com/markberger/tahoe-lafs/blob/fa6f3cfb1e258e4427f35a7aada05c7ad2c9dd13/src/allmydata/immutable/upload.py#L466), and it seems like it is doing some complicated computations in order to yield just a simple data structure — the ordered list of [(server, set(shnums))] that constitute the upload plan. It seems like the "calculate upload plan" (a.k.a. `_calculate_tasks()` method could just generate that list and return it and `_get_next_allocation()` could disappear. Am I missing anything?

markberger commented

2013-12-11 22:45:17 +00:00

I've push patches that address what zooko mentioned in comment:384180. I have not done anything in response to ticket #2108. I also removed _get_next_allocation.

I've push patches that address what zooko mentioned in [comment:384180](/tahoe-lafs/trac/issues/1382#issuecomment-384180). I have not done anything in response to ticket #2108. I also removed `_get_next_allocation`.

markberger commented

2013-12-22 00:26:48 +00:00

I have a lot of free time to work on the patch for the next couple of weeks. If we could clear up what is necessary to get this merged, I would be able to work on this ticket.

zooko commented

2013-12-26 05:07:57 +00:00

Replying to markberger:

I have a lot of free time to work on the patch for the next couple of weeks. If we could clear up what is necessary to get this merged, I would be able to work on this ticket.

Hi Mark! Thanks for posting this. I just saw it (I've been spending all day on family Xmas stuff until now), and I'll see if I can answer any questions you have. I have a patch myself somewhere around here…

Replying to [markberger](/tahoe-lafs/trac/issues/1382#issuecomment-384185): > I have a lot of free time to work on the patch for the next couple of weeks. If we could clear up what is necessary to get this merged, I would be able to work on this ticket. Hi Mark! Thanks for posting this. I just saw it (I've been spending all day on family Xmas stuff until now), and I'll see if I can answer any questions you have. I have a patch myself somewhere around here…

zooko commented

2013-12-31 00:50:12 +00:00

transcribed from IRC:

So, now I'm still confused about https://github.com/markberger/tahoe-lafs/blob/6f5a125283494d3bbf1d2fa143eddd3d5183c7d7/src/allmydata/test/test_upload.py#L734 is_happy_enough()
It uses k.
Why does it do that?
The (modern) definition of happiness doesn't include mention of K, right?
(It's just that nobody ever sets their H requirement to less than their K.)
So I suspect that is_happy_enough() isn't calculating exactly th right function, but we don't notice because it gets called only for well-distributed things, by test_upload, so "return True" also works for the current tests. (I tried this, and return True does indeed give the correct answer for all queries that test_upload.py sends to the is_happy_enough function.)
But it seems like we ought to replace it with a function that calculates the correct thing, to avoid confusing future test-writers.

So Mark: please write a new is_happy_enough, or explain to me why I'm wrong, or replace the call to is_happy_enough with a call to allmydata.util.happinessutil.servers_of_happiness.

transcribed from IRC: So, now I'm still confused about <https://github.com/markberger/tahoe-lafs/blob/6f5a125283494d3bbf1d2fa143eddd3d5183c7d7/src/allmydata/test/test_upload.py#L734> is_happy_enough() It uses k. Why does it do that? The (modern) definition of happiness doesn't include mention of K, right? (It's just that nobody ever sets their H requirement to less than their K.) So I suspect that is_happy_enough() isn't calculating exactly th right function, but we don't notice because it gets called only for well-distributed things, by test_upload, so "return True" also works for the current tests. (I tried this, and `return True` does indeed give the correct answer for all queries that test_upload.py sends to the `is_happy_enough` function.) But it seems like we ought to replace it with a function that calculates the correct thing, to avoid confusing future test-writers. So Mark: please write a new `is_happy_enough`, or explain to me why I'm wrong, or replace the call to `is_happy_enough` with a call to `allmydata.util.happinessutil.servers_of_happiness`.

zooko commented

2013-12-31 01:24:05 +00:00

diff --git a/docs/specifications/servers-of-happiness.rst b/docs/specifications/servers-of-happiness.rst
index d18432c..6a335a7 100644
--- a/docs/specifications/servers-of-happiness.rst
+++ b/docs/specifications/servers-of-happiness.rst
@@ -14,29 +14,85 @@ is distributed on the grid in such a way as to ensure that it will
 probably survive in good enough shape to be recoverable, even if a few
 things go wrong between the time of the test and the time that it is
 recovered. Our current upload health metric for immutable files is called
-'servers-of-happiness'; its predecessor was called 'shares-of-happiness'.
-
-shares-of-happiness used the number of encoded shares generated by a
-file upload to say whether or not it was healthy. If there were more
-shares than a user-configurable threshold, the file was reported to be
-healthy; otherwise, it was reported to be unhealthy. In normal
-situations, the upload process would distribute shares fairly evenly
-over the peers in the grid, and in that case shares-of-happiness
-worked fine. However, because it only considered the number of shares,
-and not where they were on the grid, it could not detect situations
-where a file was unhealthy because most or all of the shares generated
-from the file were stored on one or two peers.
-
-servers-of-happiness addresses this by extending the share-focused
-upload health metric to also consider the location of the shares on
-grid. servers-of-happiness looks at the mapping of peers to the shares
-that they hold, and compares the cardinality of the largest happy subset
-of those to a user-configurable threshold. A happy subset of peers has
-the property that any k (where k is as in k-of-n encoding) peers within
-the subset can reconstruct the source file. This definition of file
-health provides a stronger assurance of file availability over time;
-with 3-of-10 encoding, and happy=7, a healthy file is still guaranteed
-to be available even if 4 peers fail.
+'servers-of-happiness'.
+
+The servers-of-happiness upload health metric considers the location of the
+shares on grid. servers-of-happiness looks at the mapping of peers to the
+shares that they hold, and considers the size of the largest set of (server,
+share) pairs such that no server appears more than once in the set and no
+share appears more than once in the set. The size of the set is called the
+Happiness value.
+
+The Happiness value can be understood as the number of servers that are
+"independently" contributing to the health of the file. For example, if
+server A is holding share 0, and server B is holding share 1, then the
+Happiness value is 2.::
+
+  example 1
+  ---------
+
+  server A → share 0
+  server B → share 1
+
+  Happiness value = 2
+
+In this case, adding server C holding share 0 would not increase the
+Happiness value.::
+
+  example 2
+  ---------
+
+  server A → share 0
+  server B → share 1
+  server C → share 1
+
+  Happiness value = 2
+
+You can understand this intuitively as being that server C doesn't maximally
+increase the robustness of the file. Server C will help if server B
+disappears, but server C will not be any added help beyond what server B
+provided, if server A disappears.
+
+But if the added server C held a new share, then it would increase the
+Happiness value.::
+
+  example 3
+  ---------
+
+  server A → share 0
+  server B → share 1
+  server C → share 2
+
+  Happiness value = 3
+
+For another example, if you have this distribution::
+
+  example 4
+  ---------
+
+  server A → share 0, share 1
+  server B → share 1, share 2
+
+  Happiness value = 2
+
+And you add a server C which holds share 0 and share 1, then you increase the
+Happiness level to 3.::
+
+  example 5
+  ---------
+
+  server A → share 0, share 1
+  server B → share 1, share 2
+  server C → share 1, share 2
+
+  Happiness value = 3
+
+XXX FIXME: how come going from example 4 to example 5 increases Happiness but going from example 1 to example 2 doesn't ?? —Zooko
+
+XXX This definition of file health
+provides a stronger assurance of file availability over time; with 3-of-10
+encoding, and happy=7, a healthy file is still guaranteed to be available
+even if 4 peers fail.
 
 Measuring Servers of Happiness
 ==============================

So, I was trying to update docs/specifications/servers-of-happiness.rst to explain the algorithm and provide an "intuition" of it for readers, and look what happened: I confused myself again! I apparently do not have an "intuition" that matches the behavior of servers-of-happiness: ``` diff --git a/docs/specifications/servers-of-happiness.rst b/docs/specifications/servers-of-happiness.rst index d18432c..6a335a7 100644 --- a/docs/specifications/servers-of-happiness.rst +++ b/docs/specifications/servers-of-happiness.rst @@ -14,29 +14,85 @@ is distributed on the grid in such a way as to ensure that it will probably survive in good enough shape to be recoverable, even if a few things go wrong between the time of the test and the time that it is recovered. Our current upload health metric for immutable files is called -'servers-of-happiness'; its predecessor was called 'shares-of-happiness'. - -shares-of-happiness used the number of encoded shares generated by a -file upload to say whether or not it was healthy. If there were more -shares than a user-configurable threshold, the file was reported to be -healthy; otherwise, it was reported to be unhealthy. In normal -situations, the upload process would distribute shares fairly evenly -over the peers in the grid, and in that case shares-of-happiness -worked fine. However, because it only considered the number of shares, -and not where they were on the grid, it could not detect situations -where a file was unhealthy because most or all of the shares generated -from the file were stored on one or two peers. - -servers-of-happiness addresses this by extending the share-focused -upload health metric to also consider the location of the shares on -grid. servers-of-happiness looks at the mapping of peers to the shares -that they hold, and compares the cardinality of the largest happy subset -of those to a user-configurable threshold. A happy subset of peers has -the property that any k (where k is as in k-of-n encoding) peers within -the subset can reconstruct the source file. This definition of file -health provides a stronger assurance of file availability over time; -with 3-of-10 encoding, and happy=7, a healthy file is still guaranteed -to be available even if 4 peers fail. +'servers-of-happiness'. + +The servers-of-happiness upload health metric considers the location of the +shares on grid. servers-of-happiness looks at the mapping of peers to the +shares that they hold, and considers the size of the largest set of (server, +share) pairs such that no server appears more than once in the set and no +share appears more than once in the set. The size of the set is called the +Happiness value. + +The Happiness value can be understood as the number of servers that are +"independently" contributing to the health of the file. For example, if +server A is holding share 0, and server B is holding share 1, then the +Happiness value is 2.:: + + example 1 + --------- + + server A → share 0 + server B → share 1 + + Happiness value = 2 + +In this case, adding server C holding share 0 would not increase the +Happiness value.:: + + example 2 + --------- + + server A → share 0 + server B → share 1 + server C → share 1 + + Happiness value = 2 + +You can understand this intuitively as being that server C doesn't maximally +increase the robustness of the file. Server C will help if server B +disappears, but server C will not be any added help beyond what server B +provided, if server A disappears. + +But if the added server C held a new share, then it would increase the +Happiness value.:: + + example 3 + --------- + + server A → share 0 + server B → share 1 + server C → share 2 + + Happiness value = 3 + +For another example, if you have this distribution:: + + example 4 + --------- + + server A → share 0, share 1 + server B → share 1, share 2 + + Happiness value = 2 + +And you add a server C which holds share 0 and share 1, then you increase the +Happiness level to 3.:: + + example 5 + --------- + + server A → share 0, share 1 + server B → share 1, share 2 + server C → share 1, share 2 + + Happiness value = 3 + +XXX FIXME: how come going from example 4 to example 5 increases Happiness but going from example 1 to example 2 doesn't ?? —Zooko + +XXX This definition of file health +provides a stronger assurance of file availability over time; with 3-of-10 +encoding, and happy=7, a healthy file is still guaranteed to be available +even if 4 peers fail. Measuring Servers of Happiness ============================== ```

markberger commented

2013-12-31 15:54:37 +00:00

Servers of happiness is limited by the cardinality of the smaller set.

So in example 1 we have:

server A -> share 0
server B -> share 1

and then for example two we add:

server C -> share 1

This doesn't increase happiness because share 1 is already matched with server B. Matching shares with extra servers doesn't increase happiness. Since we don't have any homeless shares to match with server C, we are limited by the cardinality of the smallest set, and happiness is set to 2.

In example 4 we have

server A -> share 0, share 1
server B -> share 1, share 2

We can match server A with share 0 and server B with share 1. Once again we are limited by the cardinality of the smaller set, so even though we matched server A with share 1 and server B with share 2 as well, those pairs do not increase happiness. Our set of matchings already has servers A and B, so any additional pairs with servers A and B will not increase happiness.

However, when we add:

server C → share 1, share 2

we are no longer limited by the cardinality of 2 because each set has a cardinality of 3. So now we can match server A with share 0, server B with share 1, and server C with share 2. Each of these pairing has a unique server and share that is only used once in our set of matchings, therefore each pair contributes to a happiness of 3.

That explanation would be the intuitive way to think about the maximum matching of bipartite graphs. I think it is clearer if you show pictures of bipartite graphs, but introducing such graphs might be outside the scope of the document.

Servers of happiness is limited by the cardinality of the smaller set. So in example 1 we have: ``` server A -> share 0 server B -> share 1 ``` and then for example two we add: ``` server C -> share 1 ``` This doesn't increase happiness because share 1 is already matched with server B. Matching shares with extra servers doesn't increase happiness. Since we don't have any homeless shares to match with server C, we are limited by the cardinality of the smallest set, and happiness is set to 2. In example 4 we have ``` server A -> share 0, share 1 server B -> share 1, share 2 ``` We can match server A with share 0 and server B with share 1. Once again we are limited by the cardinality of the smaller set, so even though we matched server A with share 1 and server B with share 2 as well, those pairs do not increase happiness. Our set of matchings already has servers A and B, so any additional pairs with servers A and B will not increase happiness. However, when we add: ``` server C → share 1, share 2 ``` we are no longer limited by the cardinality of 2 because each set has a cardinality of 3. So now we can match server A with share 0, server B with share 1, and server C with share 2. Each of these pairing has a unique server and share that is only used once in our set of matchings, therefore each pair contributes to a happiness of 3. That explanation would be the intuitive way to think about the maximum matching of bipartite graphs. I think it is clearer if you show pictures of bipartite graphs, but introducing such graphs might be outside the scope of the document.

daira commented

2013-12-31 17:37:57 +00:00

Replying to zooko:

So I suspect that is_happy_enough() isn't calculating exactly the right function, but we don't notice because it gets called only for well-distributed things, by test_upload, so "return True" also works for the current tests. (I tried this, and return True does indeed give the correct answer for all queries that test_upload.py sends to the is_happy_enough function.)
But it seems like we ought to replace it with a function that calculates the correct thing, to avoid confusing future test-writers.

So Mark: please write a new is_happy_enough, or explain to me why I'm wrong, or replace the call to is_happy_enough with a call to allmydata.util.happinessutil.servers_of_happiness.

Calling allmydata.util.happinessutil.servers_of_happiness from is_happy_enough and trusting the result would not preserve the intent of the test, because there might be a bug in happinessutil that would then propagate to the test and result in a false pass.

Since the test only checks for happiness rather than unhappiness, it would be possible to call a function that returns a purported maximum matching, and check that it actually is a matching. It is not necessary to check that the matching is maximum, because not being maximum could only cause the test to fail; it could not cause a false pass.

Replying to [zooko](/tahoe-lafs/trac/issues/1382#issuecomment-384187): > So I suspect that is_happy_enough() isn't calculating exactly the right function, but we don't notice because it gets called only for well-distributed things, by test_upload, so "return True" also works for the current tests. (I tried this, and `return True` does indeed give the correct answer for all queries that test_upload.py sends to the `is_happy_enough` function.) > But it seems like we ought to replace it with a function that calculates the correct thing, to avoid confusing future test-writers. > > So Mark: please write a new `is_happy_enough`, or explain to me why I'm wrong, or replace the call to `is_happy_enough` with a call to `allmydata.util.happinessutil.servers_of_happiness`. Calling `allmydata.util.happinessutil.servers_of_happiness` from `is_happy_enough` and trusting the result would not preserve the intent of the test, because there might be a bug in `happinessutil` that would then propagate to the test and result in a false pass. Since the test only checks for happiness rather than unhappiness, it would be possible to call a function that returns a purported maximum matching, and check that it actually is a matching. It is not necessary to check that the matching is maximum, because not being maximum could only cause the test to fail; it could not cause a false pass.

daira commented

2013-12-31 20:54:08 +00:00

Ah, the maximum matching is generated in happiness_upload.py. I was not expecting it to be there; I was expecting it to be in happinessutil.py. Does this mean there is duplication between those two files that should be factored out?

Ah, the maximum matching is generated in [happiness_upload.py](https://github.com/markberger/tahoe-lafs/blob/fa6f3cfb1e258e4427f35a7aada05c7ad2c9dd13/src/allmydata/immutable/happiness_upload.py#L115). I was not expecting it to be there; I was expecting it to be in happinessutil.py. Does this mean there is duplication between those two files that should be factored out?

zooko commented

2014-01-09 17:33:23 +00:00

I updated the docs! Please review! https://github.com/zooko/tahoe-lafs/commit/69ec1ad94b29cd06ffc08b19286f6d6ee989a857

I updated the docs! Please review! <https://github.com/zooko/tahoe-lafs/commit/69ec1ad94b29cd06ffc08b19286f6d6ee989a857>

nsh commented

2014-01-28 20:12:58 +00:00

Author

Owner

comment x-posted from github:

reviewed, clear and descriptive. only one quibble: not explicitly clear at which point(s) the uploader strategy would fail, that is determine it couldn't achieve happiness, either in principle or in practice

comment x-posted from github: reviewed, clear and descriptive. only one quibble: not explicitly clear at which point(s) the uploader strategy would fail, that is determine it couldn't achieve happiness, either in principle or in practice

zooko commented

2014-02-13 17:01:41 +00:00

I worked on this for a few minutes today and posted notes about what I found as I went. It is entirely possible that the problems I found will turn out to be non-problems once I look more closely at them, and if they are problems they are shallow ones and I'll experiment with cleaning them up myself: [//pipermail/tahoe-dev/2014-February/008908.html].

zooko commented

2014-03-20 16:59:55 +00:00

Here are some of my notes from previous sessions of work on my branch:

[//pipermail/tahoe-dev/2014-February/008895.html 2014-02-06]
[//pipermail/tahoe-dev/2014-February/008901.html 2014-02-12]
[//pipermail/tahoe-dev/2014-February/008908.html 2014-02-13]

Here are some of my notes from previous sessions of work on [my branch](https://github.com/zooko/tahoe-lafs/commits/1382-rewrite-3): * [//pipermail/tahoe-dev/2014-February/008895.html 2014-02-06] * [//pipermail/tahoe-dev/2014-February/008901.html 2014-02-12] * [//pipermail/tahoe-dev/2014-February/008908.html 2014-02-13]

zooko commented

2014-04-29 22:44:28 +00:00

This will also fix #1791.

zooko commented

2014-04-29 22:47:10 +00:00

I believe this will fix #1124. We need to confirm that and make sure there's a unit test that demonstrates that.

david415 commented

2014-05-02 00:19:46 +00:00

I'm commenting on "Calculating Share Placements" in here:
https://github.com/zooko/tahoe-lafs/blob/1382-rewrite-3/docs/specifications/servers-of-happiness.rst

I might be wrong but... I think step 10 should be written as follows:

If any placements from step 9 fail (generated in step 7), remove the server from the set of possible servers and regenerate the matchings. Go back to step 6.

I'm commenting on "Calculating Share Placements" in here: <https://github.com/zooko/tahoe-lafs/blob/1382-rewrite-3/docs/specifications/servers-of-happiness.rst> I might be wrong but... I think step 10 should be written as follows: If any placements from step 9 fail (generated in step 7), remove the server from the set of possible servers and regenerate the matchings. Go back to step 6.

daira commented

2014-09-23 16:58:20 +00:00

I will re-read Mark's docs and consider comment:384198.

I will re-read Mark's docs and consider [comment:384198](/tahoe-lafs/trac/issues/1382#issuecomment-384198).

daira commented

2014-09-23 17:01:09 +00:00

Please read http://markjberger.com/upload-strategy-of-happiness/ for context.

Please read <http://markjberger.com/upload-strategy-of-happiness/> for context.

warner commented

2014-09-23 17:28:40 +00:00

I've agreed (OMG what have I gotten myself into) to try to understand this patch, the math behind it, and try to review it and/or rearrange it into smaller pieces that other folks can help review. We still think of this as a blocker for 1.11 (it's kind of the showcase feature), but it's likely to push the release out way past our target. Oh well.

markberger was unassigned by warner

2014-09-23 17:28:40 +00:00

warner self-assigned this 2014-09-23 17:28:40 +00:00

zooko commented

2014-11-25 18:14:30 +00:00

Okay I've updated the docs and included diagrams! https://github.com/zooko/tahoe-lafs/commits/1382-rewrite-4

I intend to squash it into a few coherent commits on a new branch, to be named 1382-rewrite-5.

Okay I've updated the docs and included diagrams! <https://github.com/zooko/tahoe-lafs/commits/1382-rewrite-4> I intend to squash it into a few coherent commits on a new branch, to be named `1382-rewrite-5`.

zooko commented

2014-12-16 18:01:11 +00:00

My most recent code is in Zooko's 1382-rewrite-3 (ae77df1297cdd). If you're working on the code, please start there! :-) See also my notes from when I was last working on this: comment:384195.

My most recent doc is in Zooko's 1382-rewrite-4 (3cd00fb32c5ca).

I intend to re-edit and rebase the docs into a new branch, which will be named 1382-rewrite-5.

My most recent code is in [Zooko's 1382-rewrite-3](https://github.com/zooko/tahoe-lafs/tree/1382-rewrite-3/src/allmydata/immutable) ([ae77df1297cdd](https://github.com/zooko/tahoe-lafs/commit/ae77df1297cdd3663738729708492aed5adf00ea)). If you're working on the code, please start there! :-) See also my notes from when I was last working on this: [comment:384195](/tahoe-lafs/trac/issues/1382#issuecomment-384195). My most recent doc is in [Zooko's 1382-rewrite-4](https://github.com/zooko/tahoe-lafs/blob/1382-rewrite-4/docs/specifications/servers-of-happiness.rst) ([3cd00fb32c5ca](https://github.com/zooko/tahoe-lafs/commit/3cd00fb32c5cafa985a0f9cd2adbb6d51d275264)). I intend to re-edit and rebase the docs into a new branch, which will be named `1382-rewrite-5`.

warner commented

2014-12-16 19:55:50 +00:00

we hashed out some more docs in an etherpad, results recorded here: https://gist.github.com/warner/66b604617d307374fe17

we hashed out some more docs in an etherpad, results recorded here: <https://gist.github.com/warner/66b604617d307374fe17>

warner modified the milestone from 1.10.1 to 1.11.0

2015-01-20 17:25:30 +00:00

zooko commented

2015-01-20 18:04:10 +00:00

Okay, I've written a new introductory doc to complement the specification doc. Brian: please read this and tell me if it makes the servers-of-happiness algorithm seem desirable to you. ☺

Here's the patch that adds the docs:

https://github.com/zooko/tahoe-lafs/commit/89261a922d6fa5165d0341b84c3f09b972f55c95

Here's the introductory doc, rendered:

https://github.com/zooko/tahoe-lafs/blob/89261a922d6fa5165d0341b84c3f09b972f55c95/docs/upload-happiness.rst

Okay, I've written a new introductory doc to complement the specification doc. Brian: please read this and tell me if it makes the servers-of-happiness algorithm seem desirable to you. ☺ Here's the patch that adds the docs: <https://github.com/zooko/tahoe-lafs/commit/89261a922d6fa5165d0341b84c3f09b972f55c95> Here's the introductory doc, rendered: <https://github.com/zooko/tahoe-lafs/blob/89261a922d6fa5165d0341b84c3f09b972f55c95/docs/upload-happiness.rst>

zooko commented

2015-03-21 01:48:26 +00:00

Current status of this ticket is: blocked on Brian reading https://github.com/zooko/tahoe-lafs/blob/89261a922d6fa5165d0341b84c3f09b972f55c95/docs/upload-happiness.rst

Current status of this ticket is: blocked on Brian reading <https://github.com/zooko/tahoe-lafs/blob/89261a922d6fa5165d0341b84c3f09b972f55c95/docs/upload-happiness.rst>

warner commented

2015-05-12 17:03:01 +00:00

I've read that, and I like it. Nice explanation!

warner commented

2016-03-22 05:02:52 +00:00

Milestone renamed

warner modified the milestone from 1.11.0 to 1.12.0

2016-03-22 05:02:52 +00:00

warner commented

2016-06-28 18:20:37 +00:00

moving most tickets from 1.12 to 1.13 so we can release 1.12 with magic-folders

warner modified the milestone from 1.12.0 to 1.13.0

2016-06-28 18:20:37 +00:00

daira commented

2016-10-11 22:01:45 +00:00

Mark Berger's code from comment:384172 merged into current master: https://github.com/tahoe-lafs/tahoe-lafs/pull/362

Mark Berger's code from [comment:384172](/tahoe-lafs/trac/issues/1382#issuecomment-384172) merged into current master: <https://github.com/tahoe-lafs/tahoe-lafs/pull/362>

meejah commented

2016-10-18 14:50:20 +00:00

I have a branch based on rebasing instead, which gets to about the same point, but there are tests failing depending on the value of PYTHONHASHSEED -- it looks like "because it's choosing different shares" and/or the tests are wrong. But, I wouldn't expect the hashseed to affect how we choose share-placement.

I have a branch based on rebasing instead, which gets to about the same point, but there are tests failing depending on the value of `PYTHONHASHSEED` -- it looks like "because it's choosing different shares" and/or the tests are wrong. But, I wouldn't expect the hashseed to affect how we choose share-placement.

vrusinov commented

2016-11-23 13:06:26 +00:00

Looks like review-needed is not the case at the moment, so I'll remove the tag if nobody minds.

meejah commented

2016-11-29 16:40:38 +00:00

At the Tahoe-LAFS summit recently, we clarified and edited the "servers of happiness" document (that's included in this branch, among a bunch of others). So, this should allow us to write proper tests and then move this forward.

david415 commented

2017-01-20 07:25:59 +00:00

meejah and i made some progress in our pairing session today.
i continued work in my dev branch:
https://github.com/david415/tahoe-lafs/tree/1382.markberger-rewrite-rebase.3

i noticed that happinessutils had a duplicate servers of happiness subroutine, here:

https://github.com/tahoe-lafs/tahoe-lafs/blob/ce47f6aaee952bfc9872458355533af6afefa481/src/allmydata/util/happinessutil.py#L80

i wrote a wrapper so that servers_of_happiness in happinessutils called our new
servers of happiness routine:

https://github.com/david415/tahoe-lafs/blob/b49fc2eff5666d1215dbf4f532f0e14e65921405/src/allmydata/util/happinessutil.py#L81-L91

is this the right approach?

meejah and i made some progress in our pairing session today. i continued work in my dev branch: <https://github.com/david415/tahoe-lafs/tree/1382.markberger-rewrite-rebase.3> i noticed that happinessutils had a duplicate servers of happiness subroutine, here: <https://github.com/tahoe-lafs/tahoe-lafs/blob/ce47f6aaee952bfc9872458355533af6afefa481/src/allmydata/util/happinessutil.py#L80> i wrote a wrapper so that servers_of_happiness in happinessutils called our new servers of happiness routine: <https://github.com/david415/tahoe-lafs/blob/b49fc2eff5666d1215dbf4f532f0e14e65921405/src/allmydata/util/happinessutil.py#L81-L91> is this the right approach?

david415 commented

2017-01-30 22:02:30 +00:00

here's my current attempts to make all the tests pass:
https://github.com/david415/tahoe-lafs/tree/1382.markberger-rewrite-rebase.3

currently some tests are still failing; in particular allmydata.test.test_upload.EncodingParameters.test_exception_messages_during_server_selection

here's my current attempts to make all the tests pass: <https://github.com/david415/tahoe-lafs/tree/1382.markberger-rewrite-rebase.3> currently some tests are still failing; in particular `allmydata.test.test_upload.EncodingParameters.test_exception_messages_during_server_selection`

david415 commented

2017-02-10 21:08:06 +00:00

recent progress here:
https://github.com/david415/tahoe-lafs/tree/1382.markberger-rewrite-rebase.4

I fixed a few broken unit tests... at least I thought they were broken because
they failed depending on the value set in PYTHONHASHSEED.

I'm currently stuck on trying to get this unit test to pass::

PYTHONHASHSEED=1 trial allmydata.test.test_download.Corruption.test_each_byte

recent progress here: <https://github.com/david415/tahoe-lafs/tree/1382.markberger-rewrite-rebase.4> I fixed a few broken unit tests... at least I thought they were broken because they failed depending on the value set in PYTHONHASHSEED. I'm currently stuck on trying to get this unit test to pass:: > PYTHONHASHSEED=1 trial allmydata.test.test_download.Corruption.test_each_byte

david415 commented

2017-02-15 18:14:22 +00:00

meejah's latest dev branch https://github.com/meejah/tahoe-lafs/tree/1382.markberger-rewrite-rebase.5

the following tests are failing:

allmydata.test.test_system.SystemTest.test_upload_and_download_convergent
allmydata.test.test_system.SystemTest.test_upload_and_download_random_key allmydata.test.test_checker.BalancingAct.test_good_share_hosts

meejah's latest dev branch <https://github.com/meejah/tahoe-lafs/tree/1382.markberger-rewrite-rebase.5> the following tests are failing: `allmydata.test.test_system.SystemTest.test_upload_and_download_convergent` `allmydata.test.test_system.SystemTest.test_upload_and_download_random_key` `allmydata.test.test_checker.BalancingAct.test_good_share_hosts`

meejah commented

2017-02-15 19:20:52 +00:00

I have submitted a pull-request (https://github.com/tahoe-lafs/tahoe-lafs/pull/402) that is a rebased as squashed of the latest stuff with fixes for the last couple tests that failed. One was "not enough sorting", basically (so the answers weren't as stable as they could be) and the other was a test that wasn't quite correct?

I have submitted a pull-request (<https://github.com/tahoe-lafs/tahoe-lafs/pull/402>) that is a rebased as squashed of the latest stuff with fixes for the last couple tests that failed. One was "not enough sorting", basically (so the answers weren't as stable as they could be) and the other was a test that wasn't quite correct?

Brian Warner <warner@lothar.com> commented

2017-06-05 23:06:30 +00:00

Author

Owner

In 53130c6/trunk:

Merge PR416: immutable upload now uses happiness-metric algorithm

Closes ticket:1382

In [53130c6/trunk](/tahoe-lafs/trac/commit/53130c69c077062f27884b2e2559376fa6831290): ``` Merge PR416: immutable upload now uses happiness-metric algorithm Closes ticket:1382 ```

tahoe-lafs added the

r/fixed

label 2017-06-05 23:06:30 +00:00

Brian Warner <warner@lothar.com> closed this issue

2017-06-05 23:06:30 +00:00