tahoe-lafs/trac-2024-07-25

store copy of block-hash-chain with each block #1687

New issue

Open

opened 2012-03-14 20:57:51 +00:00 by warner · 1 comment

warner commented

2012-03-14 20:57:51 +00:00

Owner

The current share layout puts the whole block-hash-tree at the end of
the share. To read block 0, the client must fetch the block data,
plus log(N) hashes from the BHT. If it then reads block 1, it doesn't
need to get any new hashes (it already has the necessary internal
nodes). Reading block 2 requires it to get one new hash. The number
of hash nodes needed for each block of a sequential read follows a
funny sort of inverse-binary pattern: I think it equals the number of
bits that change between a binary representation of the current
segnum and the previous one, so it's log(N)-ish large for large
powers of two, and small for every-other segment. This is pretty
visible in the download-timeline visualizer layout. This layout
minimizes the number of bytes needed to download any given segment
(download alacrity), and minimizes the storage overhead spent on
integrity data : 2*N hashes (where N is the number of segments:
ceil(filesize/128KiB)).

The overhead of fetching those extra nodes has a noticeable effect on
the speed of the new immutable downloader. It's not the number of
bytes being fetched, but rather some per-request overhead: either
Foolscap serialization time, the method-invocation overhead, or
perhaps the extra disk IO (although I expect the filesystem to cache
that part pretty effectively).

We could instead put an extra copy of each chain on each block. So
block 0 would immediately be followed by the log(N)-long uncle-chain
necessary to do an independent "cold" verification of block 0, block
1 would be followed by the chain for block 1, etc. The hashes would
be sorted to put the higher-level nodes at the end, so that you'd
only need to read to the end of the chain on the first block.

Then the downloader could make exactly one contiguous read for block
0, a shorter single read for block 1, etc. The storage overhead would
grow: I think it would need Nln2(N) hashes, maybe N(1+ln2(N)), not
quite sure. But I think this would be pretty reasonable.. I need to
build a simulator and measure it.

The biggest problem is that this requires random-access during the
write process: the hash nodes aren't available until after the whole
file has been processed. So we'd need to enhance the storage-server
protocol to let you seek backwards to an earlier position in the file
and overwrite some placeholder bytes that you put there earlier. Or
define shares to be made up of multiple chunks and allow uploaders to
write those chunks out-of-order. Either way complicates the
share-upload process, and probably makes certain backends (i.e. S3)
trickier.

We could also have the client encode the entire file locally, fill in
the hash nodes, then upload the resulting shares. But our experience
with Tahoe's predecessor (the Allmydata derivative of the
!Mnet/Hivecache that we retroactively named "Mountain View") was that
this gives horrible disk access patterns: it was like filling in a
grid along one axis, then reading it back out along the other, and
the disk went into a massive seek frenzy. To fix this in Tahoe, we
moved the matrix-transposition burden to the network, and stream each
segment as we finish encoding it. So I'd definitely go for an updated
write-seek-write-close storage protocol before trying to encode the
whole file on the client side.

One idea that Zooko and I came up with at pycon last week was to make downloads more efficient by rearranging the shares to append a copy of the necessary block-hash-tree nodes (its "uncle chain") to each block. The current share layout puts the whole block-hash-tree at the end of the share. To read block 0, the client must fetch the block data, plus log(N) hashes from the BHT. If it then reads block 1, it doesn't need to get any new hashes (it already has the necessary internal nodes). Reading block 2 requires it to get one new hash. The number of hash nodes needed for each block of a sequential read follows a funny sort of inverse-binary pattern: I think it equals the number of bits that change between a binary representation of the current segnum and the previous one, so it's log(N)-ish large for large powers of two, and small for every-other segment. This is pretty visible in the download-timeline visualizer layout. This layout minimizes the number of bytes needed to download any given segment (download alacrity), and minimizes the storage overhead spent on integrity data : 2*N hashes (where N is the number of segments: ceil(filesize/128KiB)). The overhead of fetching those extra nodes has a noticeable effect on the speed of the new immutable downloader. It's not the number of bytes being fetched, but rather some per-request overhead: either Foolscap serialization time, the method-invocation overhead, or perhaps the extra disk IO (although I expect the filesystem to cache that part pretty effectively). We could instead put an extra copy of each chain on each block. So block 0 would immediately be followed by the log(N)-long uncle-chain necessary to do an independent "cold" verification of block 0, block 1 would be followed by the chain for block 1, etc. The hashes would be sorted to put the higher-level nodes at the end, so that you'd only need to read to the end of the chain on the first block. Then the downloader could make exactly one contiguous read for block 0, a shorter single read for block 1, etc. The storage overhead would grow: I think it would need N*ln2(N) hashes, maybe N*(1+ln2(N)), not quite sure. But I think this would be pretty reasonable.. I need to build a simulator and measure it. The biggest problem is that this requires random-access during the write process: the hash nodes aren't available until after the whole file has been processed. So we'd need to enhance the storage-server protocol to let you seek backwards to an earlier position in the file and overwrite some placeholder bytes that you put there earlier. Or define shares to be made up of multiple chunks and allow uploaders to write those chunks out-of-order. Either way complicates the share-upload process, and probably makes certain backends (i.e. S3) trickier. We could also have the client encode the entire file locally, fill in the hash nodes, then upload the resulting shares. But our experience with Tahoe's predecessor (the Allmydata derivative of the !Mnet/Hivecache that we retroactively named "Mountain View") was that this gives horrible disk access patterns: it was like filling in a grid along one axis, then reading it back out along the other, and the disk went into a massive seek frenzy. To fix this in Tahoe, we moved the matrix-transposition burden to the network, and stream each segment as we finish encoding it. So I'd definitely go for an updated write-seek-write-close storage protocol before trying to encode the whole file on the client side.