use different encoding parameters for dirnodes than for files #1252

New issue

Open

opened 2010-11-08 15:14:41 +00:00 by daira · 6 comments

daira commented

2010-11-08 15:14:41 +00:00

As pointed out in this tahoe-dev subthread, if you only know how to reach a given file via a sequence of dirnodes from some root, then loss of any of those dirnodes will effectively make the file unavailable.

This implies that you might want to choose the encoding parameters and happiness threshold to provide greater redundancy for dirnodes.

As pointed out in [this tahoe-dev subthread](http://tahoe-lafs.org/pipermail/tahoe-dev/2010-November/005558.html), if you only know how to reach a given file via a sequence of dirnodes from some root, then loss of any of those dirnodes will effectively make the file unavailable. This implies that you might want to choose the encoding parameters and happiness threshold to provide greater redundancy for dirnodes.

daira added the

labels 2010-11-08 15:14:41 +00:00

daira added this to the 1.9.0 milestone 2010-11-08 15:14:41 +00:00

daira commented

2011-01-06 02:00:55 +00:00

Author

This is too intrusive for 1.8.2, but should be a high priority for 1.9.0.

daira self-assigned this 2011-01-06 02:00:55 +00:00

zooko commented

2011-01-06 06:45:10 +00:00

I don't agree with the whole idea. We don't know that a given directory is more important than the files it references -- I often reference files from a directory which I also have directly linked from other directories or from my bookmarks or my personal html files or what have you. And we don't know that directories are smaller than files -- sometimes people use small files and sometimes people use large directories. I think if you have any non-negligible chance at all of losing an object then you have a problem, and the solution to that problem is not automated configuration in which some objects get more robust encoding than others. That's because the default fixed encoding for all objects should be robust enough to make probabilistic loss due to "oops I didn't have enough shares" never happen. If you're still having object loss then you have a deeper problem than "Oh I needed a more robust encoding.". I also don't like adding to the cognitive complexity for users, who already struggle mightily to understand what erasure coding means, what the default settings are, how the shares are distributed to servers, what happens when you re-upload, what happens if you change the encoding configuration, etc., etc. Having different types of objects encoded differently is only going to make it harder for them to understand and manage their grid and address their real risks of data loss.

I don't agree with the whole idea. We don't know that a given directory is more important than the files it references -- I often reference files from a directory which I also have directly linked from other directories or from my bookmarks or my personal html files or what have you. And we don't know that directories are smaller than files -- sometimes people use small files and sometimes people use large directories. I think if you have any non-negligible chance at all of losing an object then you have a problem, and the solution to that problem is not automated configuration in which some objects get more robust encoding than others. That's because the default fixed encoding for *all* objects should be robust enough to make probabilistic loss due to "oops I didn't have enough shares" never happen. If you're still having object loss then you have a deeper problem than "Oh I needed a more robust encoding.". I also don't like adding to the cognitive complexity for users, who already struggle mightily to understand what erasure coding means, what the default settings are, how the shares are distributed to servers, what happens when you re-upload, what happens if you change the encoding configuration, etc., etc. Having different types of objects encoded differently is only going to make it harder for them to understand and manage their grid and address their *real* risks of data loss.

warner commented

2011-01-06 19:17:32 +00:00

But I agree that it's a pretty vague heuristic, and it'd be nicer to have
something less uncertain, or at least some data to work from. I'd bet that
most people retain a small number of rootcaps and use them to access a much
larger number of files, and that making dirnodes more reliable (at the cost
of more storage space) would be a good thing for 95% of the use cases. (note
that folks who keep track of individual filecaps directly, like a big
database or something, would not see more storage space consumed by this
change).

On the "data to work from" front, it might be interesting if
tahoe deep-stats built a histogram of node-depth (i.e. number of
dirnodes traversed, from the root, for each file). With the exception of
multiply-linked nodes and additional external rootcaps, this might give us a
better notion of how much dirnode reliability affects filenode reachability.

I'll also throw in a +0 for Zooko's deeper message, which perhaps he didn't
state explicitly this particular time, which is that our P(recover_node)
probability is already above the it-makes-sense-to-think-about-it-further
threshold: the notion that unmodeled real-world failures are way more likely
than the nice-clean-(artificial) modeled
all-servers-randomly-independently-fail-simultaneously failures. Once your
P(failure) drops below 10^5^ or something, any further modeling is just an
act of self-indulgent mathematics.

I go back and forth on this: it feels like a good exercise to do the math and
build a system with a theoretical failure probability low enough that we
don't need to worry about it, and to keep paying attention to that
theoretical number when we make design changes (e.g. the reason we use
segmentation instead of chunking is because the math says that chunking is
highly likely to fail). It's nice to be able to say that, if you have 20
servers with Poisson failure rates X and repair with frequency Y then your
files will have Poisson durability Z (where Z is really good). But it's also
important to remind the listener that you'll never really achieve Z because
something outside the model will happen first: somebody will pour coffee into
your only copy of ~/.tahoe/private/aliases, put a backhoe into the DSL line
that connects you to the whole grid, or introduce a software bug into all
your storage servers at the same time.

(incidentally, this is one of the big reasons I'd like to move us to a
simpler storage protocol: it would allow multiple implementations of the
storage server, in different languages, improving diversity and reducing the
chance of simultaneous non-independent failures).

So anyways, yeah, I still think reinforcing dirnodes might be a good idea,
but I have no idea how good, or how much extra expansion is appropriate, so
I'm content to put it off for a while yet. Maybe 1.9.0, but I'd prioritize it
lower than most of the other 1.9.0-milestone projects I can think of.

I guess I'm +0 on the general idea of making dirnodes more robust than the default, and -0 about the implementation/configuration complexity involved. If you have a deep directory tree, and the only path from a rootcap to a filenode is through 10 subdirectories, then your chances of recovering the file are P(recover_dirnode)^10^*P(recover_filenode) . We provision things to make sure that P(recover_node) is extremely high, but that x^10^ is a big factor, so making P(recover_dirnode) even higher isn't a bad idea. But I agree that it's a pretty vague heuristic, and it'd be nicer to have something less uncertain, or at least some data to work from. I'd bet that most people retain a small number of rootcaps and use them to access a much larger number of files, and that making dirnodes more reliable (at the cost of more storage space) would be a good thing for 95% of the use cases. (note that folks who keep track of individual filecaps directly, like a big database or something, would not see more storage space consumed by this change). On the "data to work from" front, it might be interesting if `tahoe deep-stats` built a histogram of node-depth (i.e. number of dirnodes traversed, from the root, for each file). With the exception of multiply-linked nodes and additional external rootcaps, this might give us a better notion of how much dirnode reliability affects filenode reachability. I'll also throw in a +0 for Zooko's deeper message, which perhaps he didn't state explicitly this particular time, which is that our P(recover_node) probability is already above the it-makes-sense-to-think-about-it-further threshold: the notion that unmodeled real-world failures are way more likely than the nice-clean-(artificial) modeled all-servers-randomly-independently-fail-simultaneously failures. Once your P(failure) drops below 10^5^ or something, any further modeling is just an act of self-indulgent mathematics. I go back and forth on this: it feels like a good exercise to do the math and build a system with a theoretical failure probability low enough that we don't need to worry about it, and to keep paying attention to that theoretical number when we make design changes (e.g. the reason we use segmentation instead of chunking is because the math says that chunking is highly likely to fail). It's nice to be able to say that, if you have 20 servers with Poisson failure rates X and repair with frequency Y then your files will have Poisson durability Z (where Z is really good). But it's also important to remind the listener that you'll never really achieve Z because something outside the model will happen first: somebody will pour coffee into your only copy of ~/.tahoe/private/aliases, put a backhoe into the DSL line that connects you to the whole grid, or introduce a software bug into all your storage servers at the same time. (incidentally, this is one of the big reasons I'd like to move us to a simpler storage protocol: it would allow multiple implementations of the storage server, in different languages, improving diversity and reducing the chance of simultaneous non-independent failures). So anyways, yeah, I still think reinforcing dirnodes might be a good idea, but I have no idea how good, or how much extra expansion is appropriate, so I'm content to put it off for a while yet. Maybe 1.9.0, but I'd prioritize it lower than most of the other 1.9.0-milestone projects I can think of.

swillden commented

2011-01-06 22:09:53 +00:00

Owner

Replying to warner:

I'll also throw in a +0 for Zooko's deeper message, which perhaps he didn't
state explicitly this particular time, which is that our P(recover_node)
probability is already above the it-makes-sense-to-think-about-it-further
threshold: the notion that unmodeled real-world failures are way more likely
than the nice-clean-(artificial) modeled
all-servers-randomly-independently-fail-simultaneously failures. Once your
P(failure) drops below 10^5^ or something, any further modeling is just an
act of self-indulgent mathematics.

I have to disagree with this, both with Zooko's more generic message and your formulation of it.

Tahoe-LAFS files do NOT have reliabilities above the it-makes-sense-to-think-about-it level. In fact, for some deployment models, Tahoe-LAFS default encoding parameters provide insufficient reliability for practical real-world needs, even ignoring extra-model events.

This fact was amply demonstrated by the problems observed at Allmydata.com. Individual file reliabilities may appear astronomical, but it isn't individual file reliabilities that matter. We're going to be unhappy if ANY files are lost.

When the number of shares N is much smaller than the number of servers in the grid (as was the case at allmydata.com) then failure of a relatively tiny number of servers will destroy files with shares on all of those servers. Given a large enough server set, and enough files, it becomes reasonable to treat each file's survivability as independent and multiply them all to compute the probability of acceptable file system performance -- which means that the probability of the user perceiving a failure isn't just p^d^, it's (roughly) p^t^, where t is the total number of files the user has stored. A x^10^ factor is one thing, but allmydata.com was facing a factor more like x^1,000^ or x^10,000^ on a per-user basis, and an exponent of many millions (billions?) for the whole system.

Given a grid of 10 servers, what is the probability that 8 of them will be down at one time? What about a grid of 200 servers? This is the factor that kicked allmydata.com's butt, and it wasn't any sort of black swan. I'm not arguing that black swans don't happen, I'm arguing that the model say grids like allmydata.com's have inadequate reliability using 3-of-10 encoding. Then you can toss black swans on top of that.

In fact, I think for large grids you can calculate the probability of any file being lost with, say, eight servers out of action as the number of ways to choose the eight dead boxes divided by the number of ways to choose 10 storage servers for a file. Assuming 200 total servers, that calculation says that with 8 of them down, one out of every 400 files would be unavailable, and that ignores the unreachability problem due to the portion of those unavailable files that are dircaps AND it assumes uniform share distribution, where in practice I'd expect older servers to have more shares, and also to be more likely to fail.

To achieve acceptable reliability in large grids N must be increased significantly.

The simplest way to think about and model it is to set N equal to the number of storage servers. In that scenario, assuming uniform share distribution and the same K for all files, the entire contents of the grid lives or dies together and the simple single-file reliability calculation works just fine, so if you can get it up to 1-10^-5^ (with realistic assumptions) there's really no need to bother further, and there's certainly no need to provide different encoding parameters for dirnodes. There's little point in making sure the directories survive if all the files are gone.

If you don't want to set N that large for large grids, the other option is to accept that you have an exponent in the millions, and choose encoding parameters such that you still have acceptable predicted reliability. If you want to store 100M files, and have an aggregate survival probability of 1-10^-5^, you need an individual survival probability on the order of 1-10^-13^, minimum. Even for a thousand files you need an individual p in the neighborhood of 1-10^-9^.

Oh, and when calculating those probabilities it's very important not to overestimate storage server reliability. The point of erasure coding is to reduce the server reliability requirements, which means we tend to choose less-reliable hardware configurations for storage servers -- old boxes, cheap blades, etc. Assuming 99.9% availability on such hardware is foolish. I think 95% is realistic, and choose 90% to be conservative.

Luckily, in a large grid it is not necessary to increase redundancy in order to get better survival probabilities. Scaling up both K and N in equal proportions increases reliability, fairly rapidly. 9-of-30 encoding produces a per-file reliability of 1-10^-16^, for example.

Bringing this line of thought to bear on the question at hand: I don't think it makes much sense to change the encoding parameters for dirnodes. Assuming we choose encoding parameters such that p^t^ is acceptable, an additional factor of p^d^ won't make much difference, since t >> d.

Replying to [warner](/tahoe-lafs/trac/issues/1252#issuecomment-382046): > I'll also throw in a +0 for Zooko's deeper message, which perhaps he didn't > state explicitly this particular time, which is that our P(recover_node) > probability is already above the it-makes-sense-to-think-about-it-further > threshold: the notion that unmodeled real-world failures are way more likely > than the nice-clean-(artificial) modeled > all-servers-randomly-independently-fail-simultaneously failures. Once your > P(failure) drops below 10^5^ or something, any further modeling is just an > act of self-indulgent mathematics. I have to disagree with this, both with Zooko's more generic message and your formulation of it. Tahoe-LAFS files do NOT have reliabilities above the it-makes-sense-to-think-about-it level. In fact, for some deployment models, Tahoe-LAFS default encoding parameters provide insufficient reliability for practical real-world needs, even ignoring extra-model events. This fact was amply demonstrated by the problems observed at Allmydata.com. Individual file reliabilities may appear astronomical, but it isn't individual file reliabilities that matter. We're going to be unhappy if ANY files are lost. When the number of shares N is much smaller than the number of servers in the grid (as was the case at allmydata.com) then failure of a relatively tiny number of servers will destroy files with shares on all of those servers. Given a large enough server set, and enough files, it becomes reasonable to treat each file's survivability as independent and multiply them all to compute the probability of acceptable file system performance -- which means that the probability of the user perceiving a failure isn't just p^d^, it's (roughly) p^t^, where t is the total number of files the user has stored. A x^10^ factor is one thing, but allmydata.com was facing a factor more like x^1,000^ or x^10,000^ on a per-user basis, and an exponent of many millions (billions?) for the whole system. Given a grid of 10 servers, what is the probability that 8 of them will be down at one time? What about a grid of 200 servers? This is the factor that kicked allmydata.com's butt, and it wasn't any sort of black swan. I'm not arguing that black swans don't happen, I'm arguing that the model say grids like allmydata.com's have inadequate reliability using 3-of-10 encoding. Then you can toss black swans on top of that. In fact, I think for large grids you can calculate the probability of any file being lost with, say, eight servers out of action as the number of ways to choose the eight dead boxes divided by the number of ways to choose 10 storage servers for a file. Assuming 200 total servers, that calculation says that with 8 of them down, one out of every 400 files would be unavailable, and that ignores the unreachability problem due to the portion of those unavailable files that are dircaps AND it assumes uniform share distribution, where in practice I'd expect older servers to have more shares, and also to be more likely to fail. To achieve acceptable reliability in large grids N must be increased significantly. The simplest way to think about and model it is to set N equal to the number of storage servers. In that scenario, assuming uniform share distribution and the same K for all files, the entire contents of the grid lives or dies together and the simple single-file reliability calculation works just fine, so if you can get it up to 1-10^-5^ (with realistic assumptions) there's really no need to bother further, and there's certainly no need to provide different encoding parameters for dirnodes. There's little point in making sure the directories survive if all the files are gone. If you don't want to set N that large for large grids, the other option is to accept that you have an exponent in the millions, and choose encoding parameters such that you still have acceptable predicted reliability. If you want to store 100M files, and have an aggregate survival probability of 1-10^-5^, you *need* an individual survival probability on the order of 1-10^-13^, minimum. Even for a thousand files you need an individual p in the neighborhood of 1-10^-9^. Oh, and when calculating those probabilities it's very important not to overestimate storage server reliability. The point of erasure coding is to reduce the server reliability requirements, which means we tend to choose less-reliable hardware configurations for storage servers -- old boxes, cheap blades, etc. Assuming 99.9% availability on such hardware is foolish. I think 95% is realistic, and choose 90% to be conservative. Luckily, in a large grid it is not necessary to increase redundancy in order to get better survival probabilities. Scaling up both K and N in equal proportions increases reliability, fairly rapidly. 9-of-30 encoding produces a per-file reliability of 1-10^-16^, for example. Bringing this line of thought to bear on the question at hand: I don't think it makes much sense to change the encoding parameters for dirnodes. Assuming we choose encoding parameters such that p^t^ is acceptable, an additional factor of p^d^ won't make much difference, since t >> d.

daira modified the milestone from 1.9.0 to undecided

2011-05-28 19:30:09 +00:00

daira commented

2013-07-23 19:25:20 +00:00

Author

I suspect part of the difference between Zooko's and my opinion on this issue is that I already see the complexity of potentially having different encoding parameters for different objects as a sunk cost. And I agree completely with "Tahoe-LAFS files do NOT have reliabilities above the it-makes-sense-to-think-about-it level."

srl commented

2013-07-23 19:42:10 +00:00

Owner

Cognitive complexity is a real issue, however I think erasure coding parameters are already complex enough that 3 more optional parameters instead of just 3 seems to be only slightly more complex. I would definitely use this, I'd set dircaps to (say) 1-of-4 while other files are 2-of-4 or 3-of-4

Rows
Columns