tahoe-lafs/trac-2024-07-25

transverse block-hash-trees, plus segment-hash-tree #2387

New issue

Open

opened 2015-02-17 20:53:49 +00:00 by warner · 0 comments

warner commented

2015-02-17 20:53:49 +00:00

Owner

While discussing #2386, Daira identified an alternate encoding scheme which might resolve the efficient-mutable-update problem.

The current encoding scheme starts with block-hash-trees (one per share, with one leaf per block, each block being an erasure-coded fragment of a different ciphertext segment) that culminate in the block-hash-tree-root (one per share). On top of that, a "transverse" share-hash-tree is built, which contains one leaf per share (leaf==block-hash-tree-root), culminating in the share-hash-tree-root. The share-hash-tree-root goes into the UEB (immutable) and the signed RSA block (mutable).

The drawback to this scheme is that modifying a single block causes new block-hash-tree-roots to be created for all shares, so we need block-hash-tree data from all shares to modify any of them. As #2386 describes, this means we can't efficiently modify unhealthy files.

An alternate encoding would rotate the trees. In this approach, the lower hash tree the transverse one. It is still a block-hash-tree, but instead of one leaf per segment (and one block-hash-tree per share), it would contain one leaf per share (and we'd have one block-hash-tree per segment). The block-hash-tree for seg0 uses the first block of each share as its N leaves, and contains enough information to prove that any one block (for segment X and share Y, X in range(numsegs), Y in range(num_servers==N)) belongs to a given segment (block-hash-tree-rootX).

On top of that, we'd have a non-transverse "segment-hash-tree", with one leaf per segment. The segment-hash-tree-root would be included in the UEB as before. The entire segment-hash-tree would be replicated into each share (this is 64 bytes per segment).

Each block would be paired with the block-hash-tree Merkle chain needed to verify the block's membership in the block-hash-tree up to the block-hash-tree root (32 bytes * log(N)). Every time the client fetches a block, they also fetch this Merkle chain. Fetching this chain is so common that it might be best to interleave the hashes with the data blocks (to reduce seeks), rather than putting all the block-hash-tree data in a separate portion of the share (like #1687, but without the random-access requirement during upload).

If it weren't for this last interleaving trick, it might be possible to add this encoding to the current (immutable) scheme, which is somewhat open-ended. We can add new fields to the UEB (which is simple JSON-encoded), and the offsets table doesn't need to cover the entire file, so the UEB could also contain the offset of the new hash tree data. There's not much point to doing this, however, since fixing #2386 requires removing the non-transverse block-hash-trees.

Our encoding scheme is over eight years old, in fact the very first commit [4968ca9] was to calculate the storage/bandwidth/alacrity overhead of this scheme. It's hard to remember, but I don't think we considered rotating the trees back then. If we did, we might have dismissed for the following reasons:

It makes the shares less independent. There was a brief period when we wanted to make it possible to re-encode a file (with higher N) and reuse (most of) the original shares. Transverse block-hash-trees would interfere with that: increasing N would require changing more data. (our thinking here was also influenced by Tahoe's predecessor, which used Fountain Codes to produce arbitrary numbers of shares, so fixed-N was a new thing at the time). #678 and #711 are related.
In particular, we've considered making N fixed and large, like 256, and only producing a subset of the shares. If this only costs encoding time, it's a fairly cheap way to enable after-the-fact "reencoding". If it costs space in a Merkle chain, then it's not quite as cheap. Fixed-N would also enable an easier-to-understand share-at-a-time uploader (#1340).
it requires downloading extra data all the time, like O(filesize*k*ln(N)). For each segment, you need k blocks, and for each block, you need an uncle chain of length ln(N). The original encoding scheme lets you decide between alacrity and bandwidth overhead: if you know you're downloading the entire file and don't need to deliver the first bytes early, you can skip fetching the block-hash-tree and just recompute it entirely. The bandwidth overhead then reduces to the transverse share-hash-tree, which is O(k*ln(N)). Since N is generally small, and filesizes can be large, this is a win. Of course if you want alacrity, you need the block-hash-tree Merkle chain for each segment, so you must download O(k*ln(filesize)) bytes for the first segment, and somewhat less for subsequent ones, requiring something like O(filesize*k*ln(N))/2 for a complete linear read.

It might be fair to say we've optimized for single-segment reads, at the expense of mutable updates.

While discussing #2386, Daira identified an alternate encoding scheme which might resolve the efficient-mutable-update problem. The current encoding scheme starts with block-hash-trees (one per share, with one leaf per block, each block being an erasure-coded fragment of a different ciphertext segment) that culminate in the block-hash-tree-root (one per share). On top of that, a "transverse" share-hash-tree is built, which contains one leaf per share (leaf==block-hash-tree-root), culminating in the share-hash-tree-root. The share-hash-tree-root goes into the UEB (immutable) and the signed RSA block (mutable). The drawback to this scheme is that modifying a single block causes new block-hash-tree-roots to be created for all shares, so we need block-hash-tree data from all shares to modify any of them. As #2386 describes, this means we can't efficiently modify unhealthy files. An alternate encoding would rotate the trees. In this approach, the lower hash tree the transverse one. It is still a block-hash-tree, but instead of one leaf per segment (and one block-hash-tree per share), it would contain one leaf per share (and we'd have one block-hash-tree per segment). The block-hash-tree for seg0 uses the first block of each share as its N leaves, and contains enough information to prove that any one block (for segment X and share Y, X in range(numsegs), Y in range(num_servers==N)) belongs to a given segment (`block-hash-tree-rootX`). On top of that, we'd have a non-transverse "segment-hash-tree", with one leaf per *segment*. The segment-hash-tree-root would be included in the UEB as before. The entire segment-hash-tree would be replicated into each share (this is 64 bytes per segment). Each block would be paired with the block-hash-tree Merkle chain needed to verify the block's membership in the block-hash-tree up to the block-hash-tree root (32 bytes * log(N)). Every time the client fetches a block, they also fetch this Merkle chain. Fetching this chain is so common that it might be best to interleave the hashes with the data blocks (to reduce seeks), rather than putting all the block-hash-tree data in a separate portion of the share (like #1687, but without the random-access requirement during upload). If it weren't for this last interleaving trick, it might be possible to add this encoding to the current (immutable) scheme, which is somewhat open-ended. We can add new fields to the UEB (which is simple JSON-encoded), and the offsets table doesn't need to cover the entire file, so the UEB could also contain the offset of the new hash tree data. There's not much point to doing this, however, since fixing #2386 requires *removing* the non-transverse block-hash-trees. Our encoding scheme is over eight years old, in fact the very first commit [4968ca9] was to calculate the storage/bandwidth/alacrity overhead of this scheme. It's hard to remember, but I don't think we considered rotating the trees back then. If we did, we might have dismissed for the following reasons: * It makes the shares less independent. There was a brief period when we wanted to make it possible to re-encode a file (with higher N) and reuse (most of) the original shares. Transverse block-hash-trees would interfere with that: increasing N would require changing more data. (our thinking here was also influenced by Tahoe's predecessor, which used Fountain Codes to produce arbitrary numbers of shares, so fixed-N was a new thing at the time). #678 and #711 are related. * In particular, we've considered making N fixed and large, like 256, and only producing a subset of the shares. If this only costs encoding **time**, it's a fairly cheap way to enable after-the-fact "reencoding". If it costs space in a Merkle chain, then it's not quite as cheap. Fixed-N would also enable an easier-to-understand share-at-a-time uploader (#1340). * it requires downloading extra data all the time, like `O(filesize*k*ln(N))`. For each segment, you need k blocks, and for each block, you need an uncle chain of length `ln(N)`. The original encoding scheme lets you decide between alacrity and bandwidth overhead: if you know you're downloading the entire file and don't need to deliver the first bytes early, you can skip fetching the block-hash-tree and just recompute it entirely. The bandwidth overhead then reduces to the transverse share-hash-tree, which is `O(k*ln(N))`. Since N is generally small, and filesizes can be large, this is a win. Of course if you want alacrity, you need the block-hash-tree Merkle chain for each segment, so you must download `O(k*ln(filesize))` bytes for the first segment, and somewhat less for subsequent ones, requiring something like `O(filesize*k*ln(N))/2` for a complete linear read. It might be fair to say we've optimized for single-segment reads, at the expense of mutable updates.