Allow Tahoe filesystem to be run over a different key-value-store / DHT implementation #869

New issue

Open

opened 2009-12-20 23:26:52 +00:00 by daira · 5 comments

daira commented

2009-12-20 23:26:52 +00:00

source:docs/architecture.rst describes Tahoe as comprising three layers: key-value store, filesystem, and application.

Most of what makes Tahoe different from other systems is in the filesystem layer -- the layer that implements a cryptographic capability filesystem. The key-value store layer implements (a little bit more than) a Distributed Hash Table, which is a fairly well-understood primitive with many implementations. The Tahoe filesystem and applications could in principle run on a different DHT, and it would still behave like Tahoe -- with different (perhaps better, depending on the DHT) scalability, performance, and availability properties, but with confidentiality and integrity ensured by Tahoe without relying on the DHT servers.

However, there are some obstacles to running the Tahoe filesystem layer on another DHT:

The code isn't strictly factored into layers (even though most code files belong mainly to one layer), so there isn't a narrow API between the key-value store and filesystem-related abstractions.
The communication with servers currently needs to be encrypted (independently of the share encryption), and other DHTs probably wouldn't support that.
Because the filesystem has only been used with one key-value store layer up to now, it may make assumptions about that layer that haven't been clearly documented.

Note that even if the Tahoe code was strictly layered, we should still expect there to be some significant effort to port Tahoe to a particular DHT. The DHT servers would probably have to run some Tahoe code in order to verify shares, for example.

source:docs/architecture.rst describes Tahoe as comprising three layers: **key-value store**, **filesystem**, and **application**. Most of what makes Tahoe different from other systems is in the filesystem layer -- the layer that implements a cryptographic capability filesystem. The key-value store layer implements (a little bit more than) a Distributed Hash Table, which is a fairly well-understood primitive with many implementations. The Tahoe filesystem and applications could in principle run on a different DHT, and it would still behave like Tahoe -- with different (perhaps better, depending on the DHT) scalability, performance, and availability properties, but with confidentiality and integrity ensured by Tahoe without relying on the DHT servers. However, there are some obstacles to running the Tahoe filesystem layer on another DHT: * The code isn't strictly factored into layers (even though most code files belong mainly to one layer), so there isn't a narrow API between the key-value store and filesystem-related abstractions. * The communication with servers currently needs to be encrypted (independently of the share encryption), and other DHTs probably wouldn't support that. * Because the filesystem has only been used with one key-value store layer up to now, it may make assumptions about that layer that haven't been clearly documented. Note that even if the Tahoe code was strictly layered, we should still expect there to be some significant effort to port Tahoe to a particular DHT. The DHT servers would probably have to run some Tahoe code in order to verify shares, for example.

daira added the

labels 2009-12-20 23:26:52 +00:00

daira added this to the undecided milestone 2009-12-20 23:26:52 +00:00

warner commented

2009-12-22 05:28:49 +00:00

Hmm, good points. This ties in closely to the docs outline that we wrote up
(but which we haven't finished by writing the actual documentation it calls
for): source:docs/specifications/outline.rst .

As you note, there are several abstraction-layer leaks which would need to be
plugged or accomodated to switch to a general-purpose DHT for the bottom-most
layer. Here are a few thoughts.

the main special feature that we require of the bottom-most DHT layer is
support for mutable files. All of the immutable-file stuff is fairly
standard DHT material. But to implement Tahoe's mutable files, we need a
distributed slot primitive with capability-based access control: creating
a slot should return separate read- and write- caps, and there should be
some means of repairing shares without being able to forge new contents.
the only need for encrypted server connections is to support the
shared-secret used to manage mutable-slot access control (which we'd like
to get rid of anyways, because it makes share-migration harder, and it
makes repair-from-readcap harder). If we had a different mechanism, e.g.
representing slot-modify authority with a separate ECDSA private key per
server*slot, then we could probably drop this requirement. (there is some
work to do w.r.t. replay attacks and building a suitable protocol with
which to prove knowledge of the private key, but these are well-understood
problems).
on the other hand, the shared-secret slot-modify authority is nice and
simple, is fast and easy for the server to verify (meaning a slow server
can still handle lots of traffic), and doesn't require the server to have
detailed knowledge of the share layout (which decouples server version
from client version). Most of the schemes we've considered for
signed-message slot-modify operations require the servers to verify the
proposed new slot contents thoroughly, making it harder to deploy new
share types without simultaneously upgrading all the servers.

There might also be some better ways of describing Tahoe's nominal layers, in
a sense refactoring the description or shuffling around the dotted lines.
I've been trying to write up a presentation using the following arrangement:

We could say that the lowermost layer is responsible for providing
availability, reliability, and integrity: this layer has all the
distributed stuff, erasure coding, and hashes to guard against corrupted
shares, but you could replace it with a simple local lookup table if you
didn't care about that sort of thing. This layer provides a pair of
immutable operations (key=put(data) and data=get(key)), and a triple of
mutable operations (writecap,readcap=create(), put(writecap,data),
data=get(readcap)). The check/verify/repair operations work entirely at
this level. All of the 'data' at this layer is ciphertext.
The next layer up gets you plaintext: the immutable operations are
key=f(readcap), ciphertext=encrypt(key, plaintext), and
plaintext=decrypt(key, ciphertext). The mutable operations are the same,
plus something to give you the writecap-accessible-only column of a
dirnode. If you didn't care about confidentiality, you could make these
NOPs.
The layer above that gets you directories, and is mostly about serializing
the childname->childcap+metadata table into a mutable slot (or immutable
file). If you have some other mechanism to manage your filecaps, you could
ignore this layer.
The layer above that provides some sort of API to non-Tahoe code, making
all of the other layers accessible by somewhere. This presents operations
like data=get(readcap), children=read(dircap), etc.

One way to look at Tahoe is in terms of that top-most API: you don't care
what it does, you just need to know about filecaps and dircaps. Another view
is about some client code, the API, the gateway node, and the servers that
the gateway connects to: this diagram would show different sorts of message
traversing the different connections. A third view would abstract the servers
and the DHT/erasure-coding stuff into a lookup table, and focus on the
crypto-and-above layers.

Hmm, good points. This ties in closely to the docs outline that we wrote up (but which we haven't finished by writing the actual documentation it calls for): source:docs/specifications/outline.rst . As you note, there are several abstraction-layer leaks which would need to be plugged or accomodated to switch to a general-purpose DHT for the bottom-most layer. Here are a few thoughts. * the main special feature that we require of the bottom-most DHT layer is support for mutable files. All of the *immutable*-file stuff is fairly standard DHT material. But to implement Tahoe's mutable files, we need a distributed slot primitive with capability-based access control: creating a slot should return separate read- and write- caps, and there should be some means of repairing shares without being able to forge new contents. * the only need for encrypted server connections is to support the shared-secret used to manage mutable-slot access control (which we'd like to get rid of anyways, because it makes share-migration harder, and it makes repair-from-readcap harder). If we had a different mechanism, e.g. representing slot-modify authority with a separate ECDSA private key per server*slot, then we could probably drop this requirement. (there is some work to do w.r.t. replay attacks and building a suitable protocol with which to prove knowledge of the private key, but these are well-understood problems). * on the other hand, the shared-secret slot-modify authority is nice and simple, is fast and easy for the server to verify (meaning a slow server can still handle lots of traffic), and doesn't require the server to have detailed knowledge of the share layout (which decouples server version from client version). Most of the schemes we've considered for signed-message slot-modify operations require the servers to verify the proposed new slot contents thoroughly, making it harder to deploy new share types without simultaneously upgrading all the servers. There might also be some better ways of describing Tahoe's nominal layers, in a sense refactoring the description or shuffling around the dotted lines. I've been trying to write up a presentation using the following arrangement: * We could say that the lowermost layer is responsible for providing availability, reliability, and integrity: this layer has all the distributed stuff, erasure coding, and hashes to guard against corrupted shares, but you could replace it with a simple local lookup table if you didn't care about that sort of thing. This layer provides a pair of immutable operations (key=put(data) and data=get(key)), and a triple of mutable operations (writecap,readcap=create(), put(writecap,data), data=get(readcap)). The check/verify/repair operations work entirely at this level. All of the 'data' at this layer is ciphertext. * The next layer up gets you plaintext: the immutable operations are key=f(readcap), ciphertext=encrypt(key, plaintext), and plaintext=decrypt(key, ciphertext). The mutable operations are the same, plus something to give you the writecap-accessible-only column of a dirnode. If you didn't care about confidentiality, you could make these NOPs. * The layer above that gets you directories, and is mostly about serializing the childname->childcap+metadata table into a mutable slot (or immutable file). If you have some other mechanism to manage your filecaps, you could ignore this layer. * The layer above that provides some sort of API to non-Tahoe code, making all of the other layers accessible by somewhere. This presents operations like data=get(readcap), children=read(dircap), etc. One way to look at Tahoe is in terms of that top-most API: you don't care what it does, you just need to know about filecaps and dircaps. Another view is about some client code, the API, the gateway node, and the servers that the gateway connects to: this diagram would show different sorts of message traversing the different connections. A third view would abstract the servers and the DHT/erasure-coding stuff into a lookup table, and focus on the crypto-and-above layers.

daira commented

2010-01-20 07:10:09 +00:00

Author

The "grid layer" is now called the "key-value store layer".

daira changed title from ~~Allow Tahoe filesystem to be run over a different grid/DHT implementation~~ to Allow Tahoe filesystem to be run over a different key-value-store / DHT implementation

2010-01-20 07:10:09 +00:00

daira commented

2010-12-16 01:00:21 +00:00

Author

Other DHTs might have better anti-censorship properties.

daira commented

2011-01-05 23:00:21 +00:00

Author

Replying to warner:

... This ties in closely to the docs outline that we wrote up
(but which we haven't finished by writing the actual documentation it calls
for): source:docs/specifications/outline.txt .

Now source:docs/specifications/outline.rst.

on the other hand, the shared-secret slot-modify authority is nice and
simple, is fast and easy for the server to verify (meaning a slow server
can still handle lots of traffic), and doesn't require the server to have
detailed knowledge of the share layout (which decouples server version
from client version). Most of the schemes we've considered for
signed-message slot-modify operations require the servers to verify the
proposed new slot contents thoroughly, making it harder to deploy new
share types without simultaneously upgrading all the servers.

As far as performance is concerned, signature verification is fast with RSA, ECDSA or hash-based signatures (and the hashing can be done incrementally as the share is received, so no significant increase in latency). I don't think this is likely to be a performance bottleneck.

The compatibility impact of changes in the mutable share format would be that an older server is not able to accept mutable shares of the newer version from a newer client. The newer client can still store shares of the older version on that server. Grids with a mixture of server and client versions (and old shares) will still work, subject to that limitation.

On the other hand, suppose that the reason for the change is migration to a new signing algorithm to fix a security flaw. In that case, a given client can't expect any improvements in security until all servers have upgraded, then all shares are migrated to the new format (probably as part of rebalancing), then that client has been upgraded to stop accepting the old format. Relative to the current scheme where servers don't need to be upgraded because they are unaware of the signing algorithm, there is indeed a significant disadvantage. At least the grid can continue operating through the upgrade, though.

The initial switch from write-enablers to share verification also requires upgrading all servers on a grid -- but if you're doing this to support a different DHT, then that would have to be effectively a new grid, which would just start with servers of the required version. The same caps could potentially be kept when migrating files from one grid to another, as long as the cap format has not changed incompatibly.

Replying to [warner](/tahoe-lafs/trac/issues/869#issuecomment-374988): > ... This ties in closely to the docs outline that we wrote up > (but which we haven't finished by writing the actual documentation it calls > for): source:docs/specifications/outline.txt . Now source:docs/specifications/outline.rst. > * on the other hand, the shared-secret slot-modify authority is nice and > simple, is fast and easy for the server to verify (meaning a slow server > can still handle lots of traffic), and doesn't require the server to have > detailed knowledge of the share layout (which decouples server version > from client version). Most of the schemes we've considered for > signed-message slot-modify operations require the servers to verify the > proposed new slot contents thoroughly, making it harder to deploy new > share types without simultaneously upgrading all the servers. As far as performance is concerned, signature *verification* is fast with RSA, ECDSA or hash-based signatures (and the hashing can be done incrementally as the share is received, so no significant increase in latency). I don't think this is likely to be a performance bottleneck. The compatibility impact of changes in the mutable share format would be that an older server is not able to accept mutable shares of the newer version from a newer client. The newer client can still store shares of the older version on that server. Grids with a mixture of server and client versions (and old shares) will still work, subject to that limitation. On the other hand, suppose that the reason for the change is migration to a new signing algorithm to fix a security flaw. In that case, a given client can't expect any improvements in security until all servers have upgraded, then all shares are migrated to the new format (probably as part of rebalancing), then that client has been upgraded to stop accepting the old format. Relative to the current scheme where servers don't need to be upgraded because they are unaware of the signing algorithm, there is indeed a significant disadvantage. At least the grid can continue operating through the upgrade, though. The initial switch from write-enablers to share verification also requires upgrading all servers on a grid -- but if you're doing this to support a different DHT, then that would have to be effectively a new grid, which would just start with servers of the required version. The same caps could potentially be kept when migrating files from one grid to another, as long as the cap format has not changed incompatibly.

warner commented

2011-01-06 19:48:54 +00:00

Replying to [davidsarah]comment:6:

As far as performance is concerned, signature verification is fast with
RSA, ECDSA or hash-based signatures (and the hashing can be done
incrementally as the share is received, so no significant increase in
latency). I don't think this is likely to be a performance bottleneck.

I'd want to test this with the lowliest of our potential storage servers:
embedded NAS devices like Pogo-Plugs and !OpenWRT boxes with USB drives
attached (like Francois' super-slow ARM buildslave). Moving from Foolscap to
HTTP would help these boxes (which find SSL challenging), and doing less work
per share would help. Ideally, we'd be able to saturate the disk bandwidth
without maxing out the CPU.

Also, one of our selling points is that the storage server is low-impact: we
want to encourage folks on desktops to share their disk space without
worrying about their other applications running slowly. I agree that it might
not be a big bottleneck, but let's just keep in mind that our target is lower
than 100% CPU consumption.

Incremental hashing will require forethought in the CHK share-layout and in
the write protocol (the order in which we send out share bits): there are
plenty of ways to screw it up. Mutable files are harder (you're updating an
existing merkle tree, reading in modified segments, applying deltas,
rehashing, testing, then committing to disk). The simplest approach would
involve writing a whole new proposed share, doing integrity checks, then
replacing the old one.

The compatibility impact of changes in the mutable share format would be
that an older server is not able to accept mutable shares of the newer
version from a newer client. The newer client can still store shares of the
older version on that server. Grids with a mixture of server and client
versions (and old shares) will still work, subject to that limitation.

Hm, I think I'm assuming that a new share format really means a new encoding
protocol, so everything about the share is different, and the filecaps
necessarily change. It wouldn't be possible to produce both "old" and "new"
shares for a single file. In that case, clients faced with older servers
either have to reencode the file (and change the filecap, and find everywhere
the old cap was used and replace it), or reduce diversity (you can only store
shares on new servers).

Migrating existing files to the new format can't be done in a simple
rebalancing pass (in which you'd only see ciphertext); you'd need something
closer to a cp -r.

My big concern is that this would slow adoption of new formats like MDMF.
Since servers should advertise the formats they can understand, I can imagine
a control panel that shows me grid/server-status on a per-format basis: "if
you upload an SDMF file, you can use servers A/B/C/D, but if you upload MDMF,
you can only use servers B/C". Clients would need to watch the control panel
and not update their config to start using e.g. MDMF until enough servers
were capable to provide reasonable diversity: not exactly a flag day, but not
a painless upgrade either.

On the other hand, suppose that the reason for the change is migration to a
new signing algorithm to fix a security flaw. In that case, a given client
can't expect any improvements in security until all servers have upgraded,

Incidentally, the security vulnerability induced by such a flaw would be
limited to availability (and possibly rollback), since that's all the server
can threaten anyways. In this scenario, a non-writecap-holding attacker might
be able to convince the server to modify a share in some invalid way, which
will either result in a (detected) integrity failure or worst-case a
rollback. Anyways, it probably wouldn't be a fire-drill.

Replying to [davidsarah]comment:6: > > As far as performance is concerned, signature *verification* is fast with > RSA, ECDSA or hash-based signatures (and the hashing can be done > incrementally as the share is received, so no significant increase in > latency). I don't think this is likely to be a performance bottleneck. I'd want to test this with the lowliest of our potential storage servers: embedded NAS devices like Pogo-Plugs and !OpenWRT boxes with USB drives attached (like Francois' super-slow ARM buildslave). Moving from Foolscap to HTTP would help these boxes (which find SSL challenging), and doing less work per share would help. Ideally, we'd be able to saturate the disk bandwidth without maxing out the CPU. Also, one of our selling points is that the storage server is low-impact: we want to encourage folks on desktops to share their disk space without worrying about their other applications running slowly. I agree that it might not be a big bottleneck, but let's just keep in mind that our target is lower than 100% CPU consumption. Incremental hashing will require forethought in the CHK share-layout and in the write protocol (the order in which we send out share bits): there are plenty of ways to screw it up. Mutable files are harder (you're updating an existing merkle tree, reading in modified segments, applying deltas, rehashing, testing, then committing to disk). The simplest approach would involve writing a whole new proposed share, doing integrity checks, then replacing the old one. > The compatibility impact of changes in the mutable share format would be > that an older server is not able to accept mutable shares of the newer > version from a newer client. The newer client can still store shares of the > older version on that server. Grids with a mixture of server and client > versions (and old shares) will still work, subject to that limitation. Hm, I think I'm assuming that a new share format really means a new encoding protocol, so everything about the share is different, and the filecaps necessarily change. It wouldn't be possible to produce both "old" and "new" shares for a single file. In that case, clients faced with older servers either have to reencode the file (and change the filecap, and find everywhere the old cap was used and replace it), or reduce diversity (you can only store shares on new servers). Migrating existing files to the new format can't be done in a simple rebalancing pass (in which you'd only see ciphertext); you'd need something closer to a `cp -r`. My big concern is that this would slow adoption of new formats like MDMF. Since servers should advertise the formats they can understand, I can imagine a control panel that shows me grid/server-status on a per-format basis: "if you upload an SDMF file, you can use servers A/B/C/D, but if you upload MDMF, you can only use servers B/C". Clients would need to watch the control panel and not update their config to start using e.g. MDMF until enough servers were capable to provide reasonable diversity: not exactly a flag day, but not a painless upgrade either. > On the other hand, suppose that the reason for the change is migration to a > new signing algorithm to fix a security flaw. In that case, a given client > can't expect any improvements in security until all servers have upgraded, Incidentally, the security vulnerability induced by such a flaw would be limited to availability (and possibly rollback), since that's all the server can threaten anyways. In this scenario, a non-writecap-holding attacker might be able to convince the server to modify a share in some invalid way, which will either result in a (detected) integrity failure or worst-case a rollback. Anyways, it probably wouldn't be a fire-drill.

Rows
Columns