tahoe-lafs/trac-2024-07-25

verifierid as storage index: not the whole story #5

New issue

Closed

opened 2007-04-26 18:36:13 +00:00 by warner · 4 comments

warner commented

2007-04-26 18:36:13 +00:00

Owner

We've talked on and off about what key we should be using when looking up shares (the index sent to RIStorageServer.get_buckets). We're currently using the VerifierId. I'm wondering if we should be using some combination of the verifierid and the encoding parameters to make sure that this index consistently maps to the same set of shares, rather than merely shares generated from the same data.

The peer selection algorithm forces us to pick exactly one index value.

Pros and cons of different index values:

FileId:

good: quick to compute, allows potential uploaders to make the barest minimum of passes over the source data (two passes total: one for fileid+encryption_key determination, a second for encryption and encoding).
bad: a single plaintext file might be encrypted in different ways. It might also be encoded in different ways. Both will change the share data thus generated, and those shares should not be intermingled
bad: privacy leak, reveals more data about what files people are using (such that even custom encryption keys fail to hide the identity of the file)

VerifierId:

good: allows custom keys to protect the identity of the file
good: changes in encryption key result in different share identity
bad: requires an extra pass. The minimum memory/disk footprint approach means we don't want to store the crypttext, so we need a (fileid+key) pass and a (encrypt+discard+verifierid) pass, then we know the verifierid and can ask peers about shares, then if we need to upload the file for real we need a (encrypt+encode) pass.
bad: variations in encoding parameters (total number of shares, number of rquired shares, segment size) result in different shares, but these variations are not captured in the index

So I'm thinking that the share index needs to be verifierid plus a serialized representation of the encoding parameters. The serialized parameters can be compressed by just saying "v1" and havin that imply a certain algorithm applied to the filesize, but that should still give us the ability to change encoding parameters in the future and not wind up with incompatible shares that appear identical from the perspective of get_buckets().

There is certain information that needs to go into peer selection (depending upon the algorithm). The verifierid is one of them, the number of shares that were uploaded is another (at least for PeerSelection/TahoeThree and PeerSelection/DenverAirport .. PeerSelection/TahoeTwo does not need it). There is some information that can affect the shares being generated without influencing peer selection (like segment size): this data could be stored on the peers and retrieved at download time. Peers could store shares from multiple encoded forms of the same crypttext. The download process would involve the downloader asking a set of likely peers about a verifierid, and learning of a set of encoded forms, such that the peer has buckets for some forms and not others. The response that provides a list of encoded forms includes the encoding parameters, so the downloader could learn about how many buckets for that form it needs to recover the file. The second step would be to pick one form and retrieve references to sufficient buckets for that form, then finally the data could be fetched and decoded.

We've talked on and off about what key we should be using when looking up shares (the index sent to RIStorageServer.get_buckets). We're currently using the [VerifierId](wiki/VerifierId). I'm wondering if we should be using some combination of the verifierid and the encoding parameters to make sure that this index consistently maps to the same set of shares, rather than merely shares generated from the same data. The peer selection algorithm forces us to pick exactly one index value. Pros and cons of different index values: [FileId](wiki/FileId): * good: quick to compute, allows potential uploaders to make the barest minimum of passes over the source data (two passes total: one for fileid+encryption_key determination, a second for encryption and encoding). * bad: a single plaintext file might be encrypted in different ways. It might also be encoded in different ways. Both will change the share data thus generated, and those shares should not be intermingled * bad: privacy leak, reveals more data about what files people are using (such that even custom encryption keys fail to hide the identity of the file) [VerifierId](wiki/VerifierId): * good: allows custom keys to protect the identity of the file * good: changes in encryption key result in different share identity * bad: requires an extra pass. The minimum memory/disk footprint approach means we don't want to store the crypttext, so we need a (fileid+key) pass and a (encrypt+discard+verifierid) pass, then we know the verifierid and can ask peers about shares, then if we need to upload the file for real we need a (encrypt+encode) pass. * bad: variations in encoding parameters (total number of shares, number of rquired shares, segment size) result in different shares, but these variations are not captured in the index So I'm thinking that the share index needs to be verifierid plus a serialized representation of the encoding parameters. The serialized parameters can be compressed by just saying "v1" and havin that imply a certain algorithm applied to the filesize, but that should still give us the ability to change encoding parameters in the future and not wind up with incompatible shares that appear identical from the perspective of get_buckets(). There is certain information that needs to go into peer selection (depending upon the algorithm). The verifierid is one of them, the number of shares that were uploaded is another (at least for [PeerSelection](wiki/PeerSelection)/TahoeThree and [PeerSelection](wiki/PeerSelection)/DenverAirport .. [PeerSelection](wiki/PeerSelection)/TahoeTwo does not need it). There is some information that can affect the shares being generated without influencing peer selection (like segment size): this data could be stored on the peers and retrieved at download time. Peers could store shares from multiple encoded forms of the same crypttext. The download process would involve the downloader asking a set of likely peers about a verifierid, and learning of a set of encoded forms, such that the peer has buckets for some forms and not others. The response that provides a list of encoded forms includes the encoding parameters, so the downloader could learn about how many buckets for that form it needs to recover the file. The second step would be to pick one form and retrieve references to sufficient buckets for that form, then finally the data could be fetched and decoded.

tahoe-lafs added the

minor

defect

labels 2007-04-26 18:36:13 +00:00

warner commented

2007-04-27 03:24:21 +00:00

Author

Owner

fix some wikinames

tahoe-lafs added the

code

label 2007-04-28 19:17:41 +00:00

warner commented

2007-06-29 23:27:33 +00:00

Author

Owner

we can probably put this one off for a little while. If the storage index is randomly generated (or derived from something randomly generated, like the readkey), then this isn't a problem. We could also say that the storage index should be the hash of (readkey, encoding parameters).

tahoe-lafs added the

0.3.0

label 2007-06-29 23:27:33 +00:00

warner commented

2007-07-25 02:59:44 +00:00

Author

Owner

currently (in, say, source:src/allmydata/upload.py@1000) the Uploadable is responsible for generating the readkey, and it is suggested that convergent uploads use a hash of the file's contents and the desired encoding parameters. We don't do that quite yet, but if we did, then the readkey would be different for different encodings of the same file, and we'd have the properties that we want.

tahoe-lafs added

code-encoding

and removed

code

labels 2007-08-14 18:55:46 +00:00

zooko commented

2007-09-25 04:25:40 +00:00

Author

Owner

Nowadays the storage index is the secure hash of the encryption key. Closing as fixed.

tahoe-lafs added the

fixed