small mutable files

warner commented

2007-10-29 22:29:57 +00:00

Owner

Last week, zooko and I finished designing "Small Distributed Mutable Files".
The design notes are in source:docs/mutable.txt . Now it's time to implement
them.

Our hope is to have this in an 0.6.2 release within two weeks (since I'm
giving a talk on 9-Nov and it'd be nice to have it ready by then), but if
that doesn't work out, the goal is for the 0.7 release.

Here's the task list, so we can claim different pieces without colliding too
much. Please edit this summary when you claim a piece.

adding RSA to allmydata.Crypto
- not sure quite how yet: see #11
backend: new methods in IStorageServer to allocate/locate/read/write
mutable buckets, MutableBucketWriter class to implement the server-side
container format -CLAIMED BY warner, 80% done-
client-side SMDF slot format wrangling, given a chunk of data and the
right keys, generate the slot contents
client-side peer selection: walk through peers, find existing shares,
decide upon which shares go where
- recovery algorithm
client-side filenode class: API should have replace method.
client-side dirnode class -CLAIMED BY warner-

Distributed Dirnodes (#115) will be implemented on top of this.

Last week, zooko and I finished designing "Small Distributed Mutable Files". The design notes are in source:docs/mutable.txt . Now it's time to implement them. Our hope is to have this in an 0.6.2 release within two weeks (since I'm giving a talk on 9-Nov and it'd be nice to have it ready by then), but if that doesn't work out, the goal is for the 0.7 release. Here's the task list, so we can claim different pieces without colliding too much. Please edit this summary when you claim a piece. * adding RSA to allmydata.Crypto * not sure quite how yet: see #11 * backend: new methods in IStorageServer to allocate/locate/read/write mutable buckets, [MutableBucketWriter](wiki/MutableBucketWriter) class to implement the server-side container format -CLAIMED BY warner, 80% done- * client-side SMDF slot format wrangling, given a chunk of data and the right keys, generate the slot contents * client-side peer selection: walk through peers, find existing shares, decide upon which shares go where * recovery algorithm * client-side filenode class: API should have `replace` method. * client-side dirnode class -CLAIMED BY warner- Distributed Dirnodes (#115) will be implemented on top of this.

tahoe-lafs added the

labels 2007-10-29 22:29:57 +00:00

tahoe-lafs added this to the 0.7.0 milestone 2007-10-29 22:29:57 +00:00

warner commented

2007-10-29 22:43:12 +00:00

Author

Owner

I'll start with the backend. We're putting off RSA for a day to give zooko some time to look at alternatives to pycrypto

warner commented

2007-10-31 07:43:46 +00:00

Author

Owner

got most of the server-side slots in place. Needs more test coverage to make sure we don't corrupt shares on resize, etc. Also leases need a lot more test coverage.

Tomorrow I'll start looking at the layers above that.

got most of the server-side slots in place. Needs more test coverage to make sure we don't corrupt shares on resize, etc. Also leases need a lot more test coverage. Tomorrow I'll start looking at the layers above that.

zooko commented

2007-10-31 15:26:34 +00:00

Author

Owner

This is the definitive feature for v0.7.0 release.

If we get this running in time for the next public presentation of Tahoe (which Brian will be giving), then we'll release v0.7. If we don't, then we'll release v0.6.2.

Oh, except that there might be a backwards-incompatibility due to sha-256. I haven't yet determined if our use of sha-256 would run afoul of that bug. But anyway, if there is a backwards-incompatibility then we have to bump the minor version number, so then it will be v0.7 regardless.

This is the definitive feature for v0.7.0 release. If we get this running in time for the next public presentation of Tahoe (which Brian will be giving), then we'll release v0.7. If we don't, then we'll release v0.6.2. Oh, except that there might be a backwards-incompatibility due to sha-256. I haven't yet determined if our use of sha-256 would run afoul of that bug. But anyway, if there is a backwards-incompatibility then we have to bump the minor version number, so then it will be v0.7 regardless.

tahoe-lafs added

blocker

and removed

major

labels 2007-10-31 15:26:34 +00:00

zooko commented

2007-11-01 17:10:17 +00:00

Author

Owner

This will fix [#187 security flaw: directory server can escalate read access into write access]. Merged #187 into this ticket.

zooko commented

2007-11-01 20:16:27 +00:00

Author

Owner

Please see comment:/tahoe-lafs/trac-2024-07-25/issues/7607:9, the end of which mentions that perhaps URI isn't a good place for the "dir-or-file" type.

Please see comment:[/tahoe-lafs/trac-2024-07-25/issues/7607](/tahoe-lafs/trac-2024-07-25/issues/7607):9, the end of which mentions that perhaps URI isn't a good place for the "dir-or-file" type.

warner commented

2007-11-08 11:01:49 +00:00

Author

Owner

Whew. Two weeks of intense design, followed by two weeks of very intense
coding, and Tahoe now has a shiny new mutable file implementation!

It isn't quite ready yet.. there are about two days of integration work left.
So no release this week, I'm afraid. But early next week.

The integration tasks left:

the Client class is where upload/download/create methods should live.
That means client.upload(target), client.download_to(target),
client.create_dirnode(), client.get_node_for_uri(uri).
Direct access to the uploader/downloader should go away, and everything
should go through the Client. We might want to move this to a separate
object (or at least a separate Interface), but everything that accesses
the grid layer should go through the same object.
the client should create a new-style dirnode upon first boot instead of
an old-style one
the old dirnode code should be removed, along with the vdrive client-side
code and the vdrive server (and the vdrive.furl config file)
dirnode2.py should replace dirnode.py
URIs for the new mutable filenodes and dirnodes are a bit goofy-looking.
URI:DIR2:... . I suggest we leave them as they are for this release
but set a task for the next release to go through and clean up all URIs
(also settle on shorter hash lengths to make the URIs smaller, and
probably also start storing binary representations of the read/write caps
in dirnodes rather than printable URI strings)

Testing tasks left around this stuff:

the solaris buildslave is not actually testing new code (#145), so we need
to know that pycryptopp is working on solaris before release
docs/README about the new dependency on libcrypto++
.debs for pycryptopp so we can install on the test grid machines

Mutable file tasks, security vulnerabilities, not for this release but soon
thereafter:

check the share hash chain (Retrieve._validate_share_and_extract_data). At
the moment we check the block hash tree but not the share hash chain, so
byzantine storage servers could trivially corrupt our data.
rollback-attacks: we chose a policy of "first retrieveable version wins"
on download, but for small grids and large expansion factors (i.e. small
values of k) this makes it awfully easy for a single out-of-date server to
effectively perform a rollback attack against you. I think we should
define some parameter epsilon and use the highest-seqnum'ed retrieveable
version from k+epsilon servers.
consider verifying the signature in Publish._got_query_results. It's
slightly expensive, so I don't want to do it without thinking it through.
It would help prevent DoS attacks where a server makes us think that
there's a colliding write taking place.

Mutable file tasks, lower priority:

analyze control flow to count the round trips. I was hoping we could get
an update done in just one RTT but at the moment it's more like 3 or 4.
It's much more challenging than I originally thought.
try to always push a share (perhaps N+1) to ourselves, so we'll have the
private key around. It would be sad to have a directory that had contents
that were unrecoverable but which we could no longer modify because we
couldn't get the privkey anymore.
choose one likely server (specifically ourselves) during publish to
use to fetch our encprivkey. This means doing an extra readv (or perhaps
just an extra-large readv) for that one server in _query_peers: the rest
can use pretty small reads, like 1000 bytes. This ought to save us a
round-trip.
error-handling. peers throwing random remote exceptions should not cause
our publish to fail unless it's for NotEnoughPeersError.
the notion of "container size" in the mutable-slot storage API is pretty
fuzzy. One idea was to allow read vectors to refer to the end of the
segment (like python string slices using negative index values), for which
we'd need a well-defined container size. I'm not sure this is actually
useful for anything, though. (maybe grabbing the encrypted privkey, since
it's always at the end?). Probably not useful until MDMF where you'd want
to grab the encprivkey without having to grab the whole share too.
tests, tests, tests. There are LOTS of corner cases that I want coverage
on. The easy ones are what download does in the face of out-of-date
servers. The hard ones are what upload does in the face of simultaneous
writers.
write-time collision recovery. We designed an algorithm (in
docs/mutable.txt) to handle this well. There is a place for it to be
implemented in allmydata.mutable.Publish._maybe_recover .
Publish peer selection: rebalance shares on each publish, by noticing when
there are multiple shares on a single peer and also unused peers in the
permuted list. The idea is that shares created on a small grid should
automatically spread out when updated after the grid has grown.
RSA key generation takes an unfortunately long time (between 0.8 and 3.2
seconds in my casual tests). This will make a RW deepcopy of a large
directory structure pretty slow. We should do some benchmarking of this
thing to determine key size / speed tradeoffs, and maybe at some point
consider ECC if it could be faster.
code terminology: share vs slot vs container, "SSK" vs mutable file vs
slot. We need to nail down the meanings of some of these and clean up
the code to match.

Whew. Two weeks of intense design, followed by two weeks of *very* intense coding, and Tahoe now has a shiny new mutable file implementation! It isn't quite ready yet.. there are about two days of integration work left. So no release this week, I'm afraid. But early next week. The integration tasks left: * the Client class is where upload/download/create methods should live. That means `client.upload(target)`, `client.download_to(target)`, `client.create_dirnode()`, `client.get_node_for_uri(uri)`. Direct access to the uploader/downloader should go away, and everything should go through the Client. We might want to move this to a separate object (or at least a separate Interface), but everything that accesses the grid layer should go through the same object. * the client should create a new-style dirnode upon first boot instead of an old-style one * the old dirnode code should be removed, along with the vdrive client-side code and the vdrive server (and the vdrive.furl config file) * dirnode2.py should replace dirnode.py * URIs for the new mutable filenodes and dirnodes are a bit goofy-looking. URI:DIR2:... . I suggest we leave them as they are for this release but set a task for the next release to go through and clean up all URIs (also settle on shorter hash lengths to make the URIs smaller, and probably also start storing binary representations of the read/write caps in dirnodes rather than printable URI strings) Testing tasks left around this stuff: * the solaris buildslave is not actually testing new code (#145), so we need to know that pycryptopp is working on solaris before release * docs/README about the new dependency on libcrypto++ * .debs for pycryptopp so we can install on the test grid machines Mutable file tasks, security vulnerabilities, not for this release but soon thereafter: * check the share hash chain (Retrieve._validate_share_and_extract_data). At the moment we check the block hash tree but not the share hash chain, so byzantine storage servers could trivially corrupt our data. * rollback-attacks: we chose a policy of "first retrieveable version wins" on download, but for small grids and large expansion factors (i.e. small values of k) this makes it awfully easy for a single out-of-date server to effectively perform a rollback attack against you. I think we should define some parameter epsilon and use the highest-seqnum'ed retrieveable version from k+epsilon servers. * consider verifying the signature in Publish._got_query_results. It's slightly expensive, so I don't want to do it without thinking it through. It would help prevent DoS attacks where a server makes us think that there's a colliding write taking place. Mutable file tasks, lower priority: * analyze control flow to count the round trips. I was hoping we could get an update done in just one RTT but at the moment it's more like 3 or 4. It's much more challenging than I originally thought. * try to always push a share (perhaps N+1) to ourselves, so we'll have the private key around. It would be sad to have a directory that had contents that were unrecoverable but which we could no longer modify because we couldn't get the privkey anymore. * choose one likely server (specifically ourselves) during publish to use to fetch our encprivkey. This means doing an extra readv (or perhaps just an extra-large readv) for that one server in _query_peers: the rest can use pretty small reads, like 1000 bytes. This ought to save us a round-trip. * error-handling. peers throwing random remote exceptions should not cause our publish to fail unless it's for [NotEnoughPeersError](wiki/NotEnoughPeersError). * the notion of "container size" in the mutable-slot storage API is pretty fuzzy. One idea was to allow read vectors to refer to the end of the segment (like python string slices using negative index values), for which we'd need a well-defined container size. I'm not sure this is actually useful for anything, though. (maybe grabbing the encrypted privkey, since it's always at the end?). Probably not useful until MDMF where you'd want to grab the encprivkey without having to grab the whole share too. * tests, tests, tests. There are LOTS of corner cases that I want coverage on. The easy ones are what download does in the face of out-of-date servers. The hard ones are what upload does in the face of simultaneous writers. * write-time collision recovery. We designed an algorithm (in docs/mutable.txt) to handle this well. There is a place for it to be implemented in allmydata.mutable.Publish._maybe_recover . * Publish peer selection: rebalance shares on each publish, by noticing when there are multiple shares on a single peer and also unused peers in the permuted list. The idea is that shares created on a small grid should automatically spread out when updated after the grid has grown. * RSA key generation takes an unfortunately long time (between 0.8 and 3.2 seconds in my casual tests). This will make a RW deepcopy of a large directory structure pretty slow. We should do some benchmarking of this thing to determine key size / speed tradeoffs, and maybe at some point consider ECC if it could be faster. * code terminology: share vs slot vs container, "SSK" vs mutable file vs slot. We need to nail down the meanings of some of these and clean up the code to match.

warner commented

2007-11-08 18:19:34 +00:00

Author

Owner

If we provide web UI access to mutable files (as opposed to merely mutable
directories), say, so a certain developer could put an HTML presentation
together without being limited to a strict tree structure in the HREFs, I
think it should look like the following:

the "Upload File" button either acquires a "Mutable?" checkbox, or there
are two separate buttons, one for immutable, one for mutable.
all entries in the directory listing that are mutable files should provide
a way to upload new contents: either a (choose-file, "Replace"-button)
pair, or a button labeled "Replace.." that takes you to a separate page
from which you can choose a file and hit the real "Replace" button.

If we provide web UI access to mutable *files* (as opposed to merely mutable directories), say, so a certain developer could put an HTML presentation together without being limited to a strict tree structure in the HREFs, I think it should look like the following: * the "Upload File" button either acquires a "Mutable?" checkbox, or there are two separate buttons, one for immutable, one for mutable. * all entries in the directory listing that are mutable files should provide a way to upload new contents: either a (choose-file, "Replace"-button) pair, or a button labeled "Replace.." that takes you to a separate page from which you can choose a file and hit the real "Replace" button.

warner commented

2007-11-08 18:22:48 +00:00

Author

Owner

This is an issue for a different ticket, but I'd also like to see our dirnode
URIs changed to be a clear wrapper around a filenode URI. When the internal
filenode URI is a mutable file (RW or RO), the dirnode is mutable (RW or RO).
When the internal filenode URI is an immutable CHK file, that dirnode is
immutable. When we implement the deep-copy facility, it will have an option
to either create a new tree of mutable RW dirnodes, or a tree of immutable
dirnodes, and the latter can be done much more efficiently by creating CHK
files (no expensive RSA key generation) instead of mutable files.

This is an issue for a different ticket, but I'd also like to see our dirnode URIs changed to be a clear wrapper around a filenode URI. When the internal filenode URI is a mutable file (RW or RO), the dirnode is mutable (RW or RO). When the internal filenode URI is an immutable CHK file, that dirnode is immutable. When we implement the deep-copy facility, it will have an option to either create a new tree of mutable RW dirnodes, or a tree of immutable dirnodes, and the latter can be done much more efficiently by creating CHK files (no expensive RSA key generation) instead of mutable files.

warner commented

2007-11-08 19:48:35 +00:00

Author

Owner

I mentioned this one above:

* consider verifying the signature in Publish._got_query_results. It's
  slightly expensive, so I don't want to do it without thinking it
  through. It would help prevent DoS attacks where a server makes us
  think that there's a colliding write taking place.

I've decided we should verify it here; the cryptopp benchmarks show that my
Mac at home takes 560 microseconds per RSA2048 verification, so adding an
extra 10 checks will only add 5.6ms to the publish time, which is small
compared to the 42ms that the same computer will take to do the RSA2048
signature needed to publish the new contents.

I mentioned this one above: * consider verifying the signature in Publish._got_query_results. It's slightly expensive, so I don't want to do it without thinking it through. It would help prevent DoS attacks where a server makes us think that there's a colliding write taking place. I've decided we *should* verify it here; the cryptopp benchmarks show that my Mac at home takes 560 microseconds per RSA2048 verification, so adding an extra 10 checks will only add 5.6ms to the publish time, which is small compared to the 42ms that the same computer will take to do the RSA2048 signature needed to publish the new contents.

warner commented

2007-11-08 20:03:29 +00:00

Author

Owner

I went ahead and pushed the verify-signatures-in-Publish._got_query_results change. It still needs testing, of course, just like all the other failure cases.

zooko commented

2007-11-13 18:03:10 +00:00

Author

Owner

I've created ticket #207 to hold all the parts of this task that can be deferred until v0.7.1.

Here are the remaining things that I think we need to fix to close this ticket:

There is currently a test failure on solaris. I also get a (different) unit test failure from the foolscap tests on that machine foolscap #31.
check the share hash chain (Retrieve._validate_share_and_extract_data). At the moment we check the block hash tree but not the share hash chain, so byzantine storage servers could trivially corrupt our data.

I've created ticket #207 to hold all the parts of this task that can be deferred until v0.7.1. Here are the remaining things that I think we need to fix to close this ticket: * There is currently [a test failure on solaris](http://allmydata.org/buildbot/builders/solaris/builds/91/steps/test/logs/test.log). I also get a (different) unit test failure from the foolscap tests on that machine [foolscap #31](http://foolscap.lothar.com/trac/ticket/31). * check the share hash chain (Retrieve._validate_share_and_extract_data). At the moment we check the block hash tree but not the share hash chain, so byzantine storage servers could trivially corrupt our data.

zooko commented

2007-11-14 14:56:04 +00:00

Author

Owner

The test failure on solaris had to do with old, incompatible versions of packages being loaded from the system instead of the current versions of the packages. I removed the old packages from the host, so that problem is currently not happening (although we would still like to fix it do that the unit tests do not load packages from the system -- #145).

zooko commented

2007-11-14 14:56:54 +00:00

Author

Owner

I believe Brian implemented "check the share hash chain" in changeset:8ba10d0155cddeb3, but I'll leave it to him to close this ticket.

warner commented

2007-11-14 20:53:10 +00:00

Author

Owner

nope, not yet. I added a test which provokes a failure in a single share, to see if download succeeds anyways, but that test does not specifically check to make sure that the corrupted share is detected. There is not yet code to check this hash chain. Maybe today, more likely late tomorrow.

warner commented

2007-11-14 21:32:45 +00:00

Author

Owner

Ok, I just pushed the validate-the-share_hash_chain fix in changeset:2eeac5cff8baf05d, so this check is now being done.
I did an eyeball test, but we still need unit tests (but not for 0.7.0).

#207 has the remaining mutable file issues.

I will close this one shortly.. there's an OS-X buildbot failure in the test I just pushed, which I need to investigate before I'll consider this part done.

Ok, I just pushed the validate-the-share_hash_chain fix in changeset:2eeac5cff8baf05d, so this check is now being done. I did an eyeball test, but we still need unit tests (but not for 0.7.0). #207 has the remaining mutable file issues. I will close this one shortly.. there's an OS-X buildbot failure in the test I just pushed, which I need to investigate before I'll consider this part done.

warner commented

2007-11-14 22:54:46 +00:00

Author

Owner

it looks like the code which is supposed to fall back to other shares when a corrupt share is detected is not working. The lack of deterministic tests (included in #207) makes this a random failure rather than an easily repeatable one.

still investigating..

it looks like the code which is supposed to fall back to other shares when a corrupt share is detected is not working. The lack of deterministic tests (included in #207) makes this a random failure rather than an easily repeatable one. still investigating..

warner commented

2007-11-17 00:27:04 +00:00

Author

Owner

Ok, the last issue is resolved, and getting multiple shares from a single server (in which one is bad and the other good) no longer causes the good shares to be ignored. The remainder of the mutable file issues (for release after this one) are in #207, I'm closing this one out.

tahoe-lafs added the

fixed

label 2007-11-17 00:27:04 +00:00

warner closed this issue

2007-11-17 00:27:04 +00:00

small mutable files #197