get_hash method in webapi for extension caching logic.

nejucomo commented

2008-01-17 05:01:47 +00:00

Owner

The webapi could provide a call which returns the content's hash for a given capability:

get_hash(cap, hashtype) -> hash

cap - A string containing a capability.

hashtype - An enumeration type specifying the hash algorithm; example "sha256" (more below).

hash - The result of applying the specified hash to the contents referred to by cap.

Support for different hashtypes allows the backend to implement which ever types are convenient, and extension writers can request specific types for future versions.

As long as the hashtype is convenient for extensions to compute on their own, this allows them to make "smart" caching decisions. For instance, a local file system synchronization command could chose to only download (or upload) a file if get_hash returns a different hash than one computed from the local file.

The tahoe architecture may provide support for certain algorithms efficiently (because they are innate to the data structures).

The webapi could provide a call which returns the content's hash for a given capability: get_hash(cap, hashtype) -> hash cap - A string containing a capability. hashtype - An enumeration type specifying the hash algorithm; example "sha256" (more below). hash - The result of applying the specified hash to the contents referred to by cap. Support for different hashtypes allows the backend to implement which ever types are convenient, and extension writers can request specific types for future versions. As long as the hashtype is convenient for extensions to compute on their own, this allows them to make "smart" caching decisions. For instance, a local file system synchronization command could chose to only download (or upload) a file if get_hash returns a different hash than one computed from the local file. The tahoe architecture may provide support for certain algorithms efficiently (because they are innate to the data structures).

tahoe-lafs added the

labels 2008-01-17 05:01:47 +00:00

tahoe-lafs added this to the eventually milestone 2008-01-17 05:01:47 +00:00

warner commented

2008-02-06 09:59:02 +00:00

Author

Owner

Yeah, my concern is that I'm not sure where we would store these hashes. We
could stash them as metadata on directory edges, but then the API is more
like::

 hash = dirnode.get_hash_of_child("foo.txt", "sha1")

and of course you have to have the dirnode around to ask it anything.

To have a function that just takes an arbitrary cap would either mean that
these hashes are contained inside the cap (so the cap would have to get
bigger), or that there's some magic table somewhere that maps caps to hashes
(and where do we keep this table, who gets to add to it, who gets to read
from it, etc).

I completely agree with the utility of this feature, I just don't yet see how
to implement it.

Yeah, my concern is that I'm not sure where we would store these hashes. We could stash them as metadata on directory edges, but then the API is more like:: ``` hash = dirnode.get_hash_of_child("foo.txt", "sha1") ``` and of course you have to have the dirnode around to ask it anything. To have a function that just takes an arbitrary cap would either mean that these hashes are contained inside the cap (so the cap would have to get bigger), or that there's some magic table somewhere that maps caps to hashes (and where do we keep this table, who gets to add to it, who gets to read from it, etc). I completely agree with the utility of this feature, I just don't yet see how to implement it.

zooko commented

2008-03-27 16:28:00 +00:00

Author

Owner

Here's something we could do:

Store such hashes (encrypted by the readcap) in the UEB (which will hopefully be renamed CEB), so Tahoe can answer queries like

get_hash(cap, hashtype) -> hash

by making a single request (typically) to a storage server. The supported hashtypes would be limited to the hashtypes that were supported by the uploader when they uploaded they file -- either just one (sha256), or maybe two or three (sha256 and Tiger and RIPEMD-160?). Most code which does file validation stuff nowadays still uses MD5, SHA-1, or Tiger, but the first two really shouldn't be used for secure file validation in the future, so I would be happy to not support them.

By the way, storing an encrypted sha256 hash of the plaintext in the CEB is something that Rob and perhaps Brian and perhaps I want to do anyway in order to gave further assurance that there wasn't a bug or wrong symmetric key in our decryption of the validated ciphertext.

Here's something we could do: Store such hashes (encrypted by the readcap) in the UEB (which will hopefully be renamed CEB), so Tahoe can answer queries like ``` get_hash(cap, hashtype) -> hash ``` by making a single request (typically) to a storage server. The supported hashtypes would be limited to the hashtypes that were supported by the uploader when they uploaded they file -- either just one (sha256), or maybe two or three (sha256 and Tiger and RIPEMD-160?). Most code which does file validation stuff nowadays still uses MD5, SHA-1, or Tiger, but the first two really shouldn't be used for secure file validation in the future, so I would be happy to not support them. By the way, storing an encrypted sha256 hash of the plaintext in the CEB is something that Rob and perhaps Brian and perhaps I want to do anyway in order to gave further assurance that there wasn't a bug or wrong symmetric key in our decryption of the validated ciphertext.

zooko commented

2008-05-30 20:21:33 +00:00

Author

Owner

A user of allmydata.com's consumer backup service just requested that it display the md5sum of a file on the web site so that he could use that to assure himself that the file had uploaded completely and correctly.

tahoe-lafs modified the milestone from eventually to undecided

2008-06-01 20:58:01 +00:00

tahoe-lafs added

code-frontend-web

and removed

unknown

labels 2009-03-08 22:02:51 +00:00

nejucomo commented

2009-09-29 00:14:10 +00:00

Author

Owner

The comments above seem to only consider a well-known hash function, like SHA256, and indeed it seems like including such a hash would add some overhead or complexity to the storage format. This might be worth it.

However, when I originally wrote this, I imagined there was some hashtype which was "innate" to Tahoe storage structures, and therefore this call could extract that information efficiently from a Cap.

After a quick skim of the architecture doc, it sounds like there is a merkle tree stored in the capability extension block. If this is a tree over the plain text, then the root of this tree could be efficiently returned by the proposed call, such as:

get_hash(myCap, "tahoe_content_merkle_tree_root")

Clients would then need to compute a merkle tree, but I expect this would be somewhat simple and efficient, given the right library for computing merkle trees.

Because I've noticed a thread on tahoe-dev about caching, and I've seen some tickets related to caching, I'm going to link all of these related tickets and threads together.

The comments above seem to only consider a well-known hash function, like SHA256, and indeed it seems like including such a hash would add some overhead or complexity to the storage format. This might be worth it. However, when I originally wrote this, I imagined there was some hashtype which was "innate" to Tahoe storage structures, and therefore this call could extract that information efficiently from a Cap. After a quick skim of the architecture doc, it sounds like there is a merkle tree stored in the capability extension block. If this is a tree over the plain text, then the root of this tree could be efficiently returned by the proposed call, such as: get_hash(myCap, "tahoe_content_merkle_tree_root") Clients would then need to compute a merkle tree, but I expect this would be somewhat simple and efficient, given the right library for computing merkle trees. Because I've noticed a thread on tahoe-dev about caching, and I've seen some tickets related to caching, I'm going to link all of these related tickets and threads together.

nejucomo commented

2009-09-29 00:27:49 +00:00

Author

Owner

See ticket #316 for a built-in caching feature proposal.

I personally prefer this minimal code change which makes it easier for clients to do caching versus a built-in caching feature. Fewer features, fewer configuration states, and more test-coverage per component.

See ticket #316 for a built-in caching feature proposal. I personally prefer this minimal code change which makes it easier for clients to do caching versus a built-in caching feature. Fewer features, fewer configuration states, and more test-coverage per component.

zooko commented

2009-09-29 03:24:56 +00:00

Author

Owner

There is currently no hash of the plaintext stored. See http://allmydata.org/~zooko/lafs.pdf diagram 1 for what is stored for an immutable file currently. We used to have one, but we took it out because it was visible to anyone (it was stored on storage servers unencrypted), and this enables anyone to mount guess-and-check attacks (per http://hacktahoe.org/drew_perttula.html ). #453 (safely add plaintext_hash to immutable UEB) is a ticket to add plaintext hashes back but store them encrypted under the read-cap.

If we had #453, we could easily give out the hash-of-plaintext or else the root-of-merkle-tree-of-plaintext to serve this API. But wait a minute, what's the use case of this proposed API again? How come the user can't just use the verify cap instead of this hash-of-the-plaintext?

There is currently no hash of the plaintext stored. See <http://allmydata.org/~zooko/lafs.pdf> diagram 1 for what is stored for an immutable file currently. We used to have one, but we took it out because it was visible to anyone (it was stored on storage servers unencrypted), and this enables anyone to mount guess-and-check attacks (per <http://hacktahoe.org/drew_perttula.html> ). #453 (safely add plaintext_hash to immutable UEB) is a ticket to add plaintext hashes back but store them encrypted under the read-cap. If we had #453, we could easily give out the hash-of-plaintext or else the root-of-merkle-tree-of-plaintext to serve this API. But wait a minute, what's the use case of this proposed API again? How come the user can't just use the verify cap instead of this hash-of-the-plaintext?

davidsarah commented

2009-10-28 04:09:24 +00:00

Author

Owner

Tagging issues relevant to new cap protocol design.

zooko commented

2010-05-15 04:39:21 +00:00

Author

Owner

I still don't understand why the use case for this isn't satisfied by verify caps.

nejucomo commented

2012-02-21 21:23:09 +00:00

Author

Owner

Replying to zooko:

I still don't understand why the use case for this isn't satisfied by verify caps.

Here's a use case I advocate:

I have a large file called myblob.bin and a capability, $C (of any kind) which I believe is associated with some revision of myblob.bin.
I use a commandline tool to calculate a cryptographic-hash-like value. Example alternatives:
$ md5sum myblob.bin > local_hash
$ pyeval 'hashlib.sha256(ri).hexdigest()' < myblob.bin > local_hash
$ tahoe calculate_hashlike_thingy --input-file myblob.bin > local_hash
I then ask tahoe for the hash-like value given the capability:
$ tahoe calculate_hashlike_thingy --input-uri $C > lafs_hash
NOTE: For my use case, I want this command to not do any networking, if possible.
Compare the results for equality:
$ if diff -q local_hash lafs_hash ; then echo 'This revision of myblob.bin is not stored at that capability.' ; fi

So for this use case to be satisfied by verify caps I need this command:

$ tahoe spit_out_verify_cap < myblob.bin

This command should only read myblob.bin but should not do any networking or use any state other than the cap and myblob.bin (so that any tahoe user on any grid can run it).

Is it feasible to make this command? That would satisfy my goal for this ticket.

Replying to [zooko](/tahoe-lafs/trac-2024-07-25/issues/280#issuecomment-106177): > I still don't understand why the use case for this isn't satisfied by verify caps. Here's a use case I advocate: * I have a large file called `myblob.bin` and a capability, `$C` (of any kind) which I believe is associated with some revision of `myblob.bin`. * I use a commandline tool to calculate a cryptographic-hash-like value. Example alternatives: * `$ md5sum myblob.bin > local_hash` * `$ pyeval 'hashlib.sha256(ri).hexdigest()' < myblob.bin > local_hash` * `$ tahoe calculate_hashlike_thingy --input-file myblob.bin > local_hash` * I then ask tahoe for the hash-like value given the capability: * `$ tahoe calculate_hashlike_thingy --input-uri $C > lafs_hash` * *NOTE*: For my use case, I want this command to not do any networking, if possible. * Compare the results for equality: * `$ if diff -q local_hash lafs_hash ; then echo 'This revision of myblob.bin is not stored at that capability.' ; fi` So for this use case to be satisfied by verify caps I need this command: ` $ tahoe spit_out_verify_cap < myblob.bin ` This command should only read `myblob.bin` but should not do any networking or use any state other than the cap and `myblob.bin` (so that any tahoe user on any grid can run it). Is it feasible to make this command? That would satisfy my goal for this ticket.

davidsarah commented

2012-02-22 00:53:10 +00:00

Author

Owner

Replying to [nejucomo]comment:12:

So for this use case to be satisfied by verify caps I need this command:

$ tahoe spit_out_verify_cap < myblob.bin

This command should only read myblob.bin but should not do any networking or use any state other than the cap and myblob.bin (so that any tahoe user on any grid can run it).

Is it feasible to make this command? That would satisfy my goal for this ticket.

Yes, it is feasible to make this command. Depending on the cap protocol, it might have to do all the work of erasure coding the file and computing a Merkle hash of the ciphertext shares before it can compute the verify cap.

Your use case could also be met with a Merkle hash of the plaintext and convergence secret, which could be computed without erasure coding. But there's a tradeoff between being able to do that and the cap size: in order to be able to recover the plaintext hash from the read cap without network access, the encryption bits and the integrity bits of the read cap must be separate, which means that the minimum immutable read cap size for a security level of 2^K^ against 2^T^ targets is 3K + T (2K integrity bits and K+T confidentiality bits). In contrast the scheme with the shortest read caps so far without this constraint is Rainhill 3, which has an immutable read cap size of only 2K, the minimum possible to achieve 2^K^ security against collision attacks.

(A simplified version of Rainhill 3 without traversal caps is here. It does allow you to compute a plaintext hash P, or an encrypted hash EncP_R, before doing erasure coding, but in order to recover that value from the read cap, you also need EncK_R which is stored on the server.)

Replying to [nejucomo]comment:12: > So for this use case to be satisfied by verify caps I need this command: > > ` $ tahoe spit_out_verify_cap < myblob.bin ` > > This command should only read `myblob.bin` but should not do any networking or use any state other than the cap and `myblob.bin` (so that any tahoe user on any grid can run it). > > Is it feasible to make this command? That would satisfy my goal for this ticket. Yes, it is feasible to make this command. Depending on the cap protocol, it might have to do all the work of erasure coding the file and computing a Merkle hash of the ciphertext shares before it can compute the verify cap. Your use case could also be met with a Merkle hash of the plaintext and convergence secret, which could be computed without erasure coding. But there's a tradeoff between being able to do that and the cap size: in order to be able to recover the plaintext hash from the read cap without network access, the encryption bits and the integrity bits of the read cap must be separate, which means that the minimum immutable read cap size for a security level of 2^K^ against 2^T^ targets is 3K + T (2K integrity bits and K+T confidentiality bits). In contrast the scheme with the shortest read caps so far without this constraint is Rainhill 3, which has an immutable read cap size of only 2K, the minimum possible to achieve 2^K^ security against collision attacks. (A simplified version of Rainhill 3 without traversal caps is [here](https://tahoe-lafs.org/~davidsarah/immutable-rainhill-3x.png). It does allow you to compute a plaintext hash P, or an encrypted hash EncP_R, before doing erasure coding, but in order to recover that value from the read cap, you also need EncK_R which is stored on the server.)

davidsarah commented

2012-02-22 01:01:21 +00:00

Author

Owner

BTW, if you drop the feature of being able to derive a verify cap from a read cap off-line, then a verify cap could include the information normally stored on the server that allows to verify a plaintext off-line without doing erasure coding, and read caps could still be optimally short. However, in practice I think off-line derivation of verify caps is the more useful feature.

get_hash method in webapi for extension caching logic. #280