tahoe-lafs/trac

storage: maybe store buckets as files, not directories #600

New issue

Open

opened 2009-01-28 23:39:27 +00:00 by warner · 5 comments

warner commented

2009-01-28 23:39:27 +00:00

Our current storage-server backend share-file format defines a "bucket" for
each storage index, into which some quantity of numbered "shares" are placed.
The "buckets" are each represented as a directory (named with the base32
representation of the storage index), and the shares are files inside that
directory. To make ext3 happier, these bucket directories are contained in a
series of "prefix directories", one for each two-letter base32-alphabet
string. So, if we are storing both share 0 and share 5 of storage index
"aktyxrieysdumjed2hoynwpnl4", they would be located in:

NODEDIR/storage/shares/ak/aktyxrieysdumjed2hoynwpnl4/0
NODEDIR/storage/shares/ak/aktyxrieysdumjed2hoynwpnl4/5

(there are two ways this makes ext3 happier: ext3 cannot have more than 32000
subdirectories in a single directory, and very large directories (lots of
child files or subdirectories) have very slow lookup times)

There is a certain amount of metadata associated with each bucket. For
mutable files, this includes the write-enabler. [Both mutable and immutable
container files used to also contain lease information at the end of the
file, but that is no longer true on the leasedb branch which will be merged
soon.]edit:

To make share-migration easier, we originally decided to make the share files
stand alone, by placing this metadata inside the share files themselves,
even though the metadata is really attached to the bucket.
This unfortunately creates a danger for mutable files: some of the
metadata is located at the end of the share, and when the share is enlarged,
the server must copy the metadata to a new location within the file, creating
a window during which it might be shut down, and the metadata lost.

Since we might want to add even more metadata (the other-share-location
hints, described in #599), perhaps we should should consider moving this
metadata to a separate file, so there would be one copy per bucket, rather
than one copy per share. One approach might be to place a non-numeric
"metadata" file in each bucket directory, so:

NODEDIR/storage/shares/ak/aktyxrieysdumjed2hoynwpnl4/metadata
NODEDIR/storage/shares/ak/aktyxrieysdumjed2hoynwpnl4/0
NODEDIR/storage/shares/ak/aktyxrieysdumjed2hoynwpnl4/5

Another approach would be to stop using subdirectories for buckets
altogether, and include the share numbers in the metadata file:

NODEDIR/storage/shares/ak/aktyxrieysdumjed2hoynwpnl4.metadata
NODEDIR/storage/shares/ak/aktyxrieysdumjed2hoynwpnl4.0
NODEDIR/storage/shares/ak/aktyxrieysdumjed2hoynwpnl4.5

In this latter approach, the get_buckets query would be processed by
looking for an "$SI.metadata" file. If present, the file is opened and a list
of share numbers read out of it (as well as other metadata). Those share
numbers are then used to compute the filenames of the shares themselves, and
those files can then be opened.

The first approach (SI/metadata) adds an extra inode and an extra block to
the total disk used per SI (probably 8kB). The second approach removes a
directory and adds a file, so the disk space use is probably neutral, except
that there are now multiple copies of the (long) SI-based filename, which
must be stored in the prefix directory's dnode. This approach also at least
doubles the number of children kept in each prefix directory, although they
will all be file children rather than subdir children, and ext3 does not
appear to have an arbitrary limit on the number of file children that a
single directory can hold. (at least, not a small arbitrary limit like
32000).

Both of these approaches make an offline share-migration tool slightly
tougher: the tool must copy two files to a new server, not just one. The
second approach is doubly tricky, because the metadata file must be modified
(if, say, the sh0+sh5 pair are split up: the new metadata file must only
reference the share that actually lives next to it). On the other hand, since
metadata files will contain leases that are specific to a given server, they
will likely need to be rewritten anyways.

The main benefit of moving the metadata to a separate file is to reduce the
complexity of the lease-maintenance code, by removing redundancy. With the
current scheme, the code that walks buckets (looking for expired leases, etc)
must really walk shares.

Our current storage-server backend share-file format defines a "bucket" for each storage index, into which some quantity of numbered "shares" are placed. The "buckets" are each represented as a directory (named with the base32 representation of the storage index), and the shares are files inside that directory. To make ext3 happier, these bucket directories are contained in a series of "prefix directories", one for each two-letter base32-alphabet string. So, if we are storing both share 0 and share 5 of storage index "aktyxrieysdumjed2hoynwpnl4", they would be located in: ``` NODEDIR/storage/shares/ak/aktyxrieysdumjed2hoynwpnl4/0 NODEDIR/storage/shares/ak/aktyxrieysdumjed2hoynwpnl4/5 ``` (there are two ways this makes ext3 happier: ext3 cannot have more than 32000 subdirectories in a single directory, and very large directories (lots of child files or subdirectories) have very slow lookup times) There is a certain amount of metadata associated with each bucket. For mutable files, this includes the write-enabler. [Both mutable and immutable container files used to also contain lease information at the end of the file, but that is no longer true on the leasedb branch which will be merged soon.]edit: To make share-migration easier, we originally decided to make the share files stand alone, by placing this metadata inside the share files themselves, even though the metadata is really attached to the bucket. ~~This unfortunately creates a danger for mutable files: some of the metadata is located at the end of the share, and when the share is enlarged, the server must copy the metadata to a new location within the file, creating a window during which it might be shut down, and the metadata lost.~~ Since we might want to add even more metadata (the other-share-location hints, described in #599), perhaps we should should consider moving this metadata to a separate file, so there would be one copy per bucket, rather than one copy per share. One approach might be to place a non-numeric "metadata" file in each bucket directory, so: ``` NODEDIR/storage/shares/ak/aktyxrieysdumjed2hoynwpnl4/metadata NODEDIR/storage/shares/ak/aktyxrieysdumjed2hoynwpnl4/0 NODEDIR/storage/shares/ak/aktyxrieysdumjed2hoynwpnl4/5 ``` Another approach would be to stop using subdirectories for buckets altogether, and include the share numbers in the metadata file: ``` NODEDIR/storage/shares/ak/aktyxrieysdumjed2hoynwpnl4.metadata NODEDIR/storage/shares/ak/aktyxrieysdumjed2hoynwpnl4.0 NODEDIR/storage/shares/ak/aktyxrieysdumjed2hoynwpnl4.5 ``` In this latter approach, the `get_buckets` query would be processed by looking for an "$SI.metadata" file. If present, the file is opened and a list of share numbers read out of it (as well as other metadata). Those share numbers are then used to compute the filenames of the shares themselves, and those files can then be opened. The first approach (SI/metadata) adds an extra inode and an extra block to the total disk used per SI (probably 8kB). The second approach removes a directory and adds a file, so the disk space use is probably neutral, except that there are now multiple copies of the (long) SI-based filename, which must be stored in the prefix directory's dnode. This approach also at least doubles the number of children kept in each prefix directory, although they will all be file children rather than subdir children, and ext3 does not appear to have an arbitrary limit on the number of file children that a single directory can hold. (at least, not a small arbitrary limit like 32000). Both of these approaches make an offline share-migration tool slightly tougher: the tool must copy two files to a new server, not just one. The second approach is doubly tricky, because the metadata file must be modified (if, say, the sh0+sh5 pair are split up: the new metadata file must only reference the share that actually lives next to it). On the other hand, since metadata files will contain leases that are specific to a given server, they will likely need to be rewritten anyways. The main benefit of moving the metadata to a separate file is to reduce the complexity of the lease-maintenance code, by removing redundancy. With the current scheme, the code that walks buckets (looking for expired leases, etc) must really walk shares.

warner added the

labels 2009-01-28 23:39:27 +00:00

warner added this to the undecided milestone 2009-01-28 23:39:27 +00:00

daira commented

2013-07-17 13:45:32 +00:00

I'm not sure this ticket is any longer relevant for the leasedb branch. Brian?

I'm not sure this ticket is any longer relevant for the leasedb branch. Brian?

warner was assigned by daira

2013-07-17 13:45:32 +00:00

daira commented

2013-07-17 13:50:54 +00:00

Actually I don't think the suggested change was desirable even pre-leasedb, because we want lease information to be per-share, not per-shareset, as discussed in #1816.

Actually I don't think the suggested change was desirable even pre-leasedb, because we want lease information to be per-share, not per-shareset, as discussed in #1816.

warner commented

2013-10-02 01:23:13 +00:00

Author

Hm. Yeah, buckets are a thing of the past, and lease information
wants to be per-share, not per-anything-larger. Likewise any
metadata we might add in the future should be per-share too.

The real question is: how should the on-disk storage backend
organize its pieces? If we rely upon the leasedb to satisfy
"do-you-have-share" queries (which I think is good), then we don't
need to query the disk each time. We still need to query it for the
crawler, but that can be relatively slow, since it only happens in
the background.

Removing per-bucket subdirectories will probably slow down the
on-disk "do we know anything about this SI" query, because it
basically turns into a large readdir() and a grep through the
results (looking for a prefix-match on the SI). For our nominal
1M-share server, each prefix-directory contains 1k shares, and an
ideal one-share-per-server encoding will result in listing a
1k-entry directory for each query.

If people are doing crazy encodings that put lots of shares on each
server, we'll incur a larger lookup cost.

So yeah, I think I'm +1 on changing the on-disk format to get rid
of the bucket directories. It should probably be driven by the
pluggable-backend-storage changes y'all (LAE) are making, though..
what would fit best with the scheme you've put together?

Hm. Yeah, buckets are a thing of the past, and lease information wants to be per-share, not per-anything-larger. Likewise any metadata we might add in the future should be per-share too. The real question is: how should the on-disk storage backend organize its pieces? If we rely upon the leasedb to satisfy "do-you-have-share" queries (which I think is good), then we don't need to query the disk each time. We still need to query it for the crawler, but that can be relatively slow, since it only happens in the background. Removing per-bucket subdirectories will probably slow down the on-disk "do we know anything about this SI" query, because it basically turns into a large readdir() and a grep through the results (looking for a prefix-match on the SI). For our nominal 1M-share server, each prefix-directory contains 1k shares, and an ideal one-share-per-server encoding will result in listing a 1k-entry directory for each query. If people are doing crazy encodings that put lots of shares on each server, we'll incur a larger lookup cost. So yeah, I think I'm +1 on changing the on-disk format to get rid of the bucket directories. It should probably be driven by the pluggable-backend-storage changes y'all (LAE) are making, though.. what would fit best with the scheme you've put together?

daira commented

2013-10-02 10:33:53 +00:00

Hmm. If the motivation for doing this is only performance, I'd like to see some measurements before doing anything. I suspect this would probably be way down the list of changes in order of performance improvement for a given effort (if there's any improvement at all).

For the cloud backend, there would be no performance benefit, only complexity hassle (either compatibility problems if we changed its object keys to match paths in the disk backend, or breakage in the tests if we let them diverge).

Hmm. If the motivation for doing this is only performance, I'd like to see some measurements before doing anything. I suspect this would probably be way down the list of changes in order of performance improvement for a given effort (if there's any improvement at all). For the cloud backend, there would be no performance benefit, only complexity hassle (either compatibility problems if we changed its object keys to match paths in the disk backend, or breakage in the tests if we let them diverge).

zooko commented

2013-10-02 16:46:34 +00:00

What's the motivation to change the layout? I don't think there is any metadata that is per-set-of-shares, is there? There is lease information which is held in the leasedb (and by the way is per-share, not per-set-of-shares), and then there are write-enablers which are held in the mutable shares and which are per-share, not per-set-of-shares. Anything else?

What's the motivation to change the layout? I don't think there is any metadata that is per-set-of-shares, is there? There is lease information which is held in the leasedb (and by the way is per-share, not per-set-of-shares), and then there are write-enablers which are held in the mutable shares and which are per-share, not per-set-of-shares. Anything else?

Sign in to join this conversation.

No labels

c/code-dirnodes

c/code-encoding

c/code-frontend

c/code-frontend-cli

c/code-frontend-ftp-sftp

c/code-frontend-magic-folder

c/code-frontend-web

c/code-nodeadmin

c/code-peerselection

c/dev-infrastructure

kw:AttributeError

kw:DataUnavailable

kw:DeadReferenceError

kw:GetLastError

kw:IFinishableConsumer

kw:LeastAuthority

kw:RIStorageServer

kw:UncoordinatedWriteError

kw:access-control

kw:accessibility

kw:accounting-crawler

kw:anti-censorship

kw:api_auth_token

kw:authentication

kw:availability

kw:backward-compatibility

kw:blocks-cloud-deployment

kw:blocks-cloud-merge

kw:blocks-magic-folder-merge

kw:blocks-merge

kw:blocks-release

kw:brians-opinion-needed

kw:build-helpers

kw:cloud-backend

kw:coding-standards

kw:coding-tools

kw:coding_tools

kw:compatibility

kw:confidentiality

kw:configuration

kw:configuration.txt

kw:connectivity

kw:control.furl

kw:coordination

kw:coveralls.io

kw:create-container

kw:cryptography

kw:cryptography-lib

kw:denial-of-service

kw:desert-island

kw:desert-island-build

kw:design-review-needed

kw:dev-infrastructure

kw:directory-page

kw:disk-backend

kw:earth-dragon

kw:erasure-coding

kw:excess-authority

kw:extensibility

kw:file-descriptor

kw:floatingpoint

kw:forward-compatibility

kw:forward-secrecy

kw:garbage-collection

kw:google-cloud-storage

kw:google-drive-backend

kw:grid-manager

kw:illustration

kw:inlineCallbacks

kw:integration-test

kw:interoperability

kw:interstellar-exploration

kw:introduction

kw:key-value-store

kw:magic-folder

kw:manual-test-needed

kw:more-info-needed

kw:mountain-lion

kw:multiuser-gateway

kw:needs-review

kw:notification

kw:notifyOnDisconnect

kw:openitp-packaging

kw:operation-helpers

kw:optimization

kw:organization

kw:otf-magic-folder-objective1

kw:otf-magic-folder-objective2

kw:otf-magic-folder-objective3

kw:otf-magic-folder-objective4

kw:otf-magic-folder-objective5

kw:otf-magic-folder-objective6

kw:peer-selection

kw:pkg_resources

kw:preservation

kw:pycrypto-lib

kw:pyfilesystem

kw:random-access

kw:raspberry-pi

kw:release-blocker

kw:removable-disk

kw:reserved_space

kw:response-needed

kw:response-time

kw:review-needed

kw:self-contained

kw:servers-of-happiness

kw:setup_requires

kw:setuptools_darcs

kw:simultaneous

kw:space-efficiency

kw:static-analysis

kw:stats_gatherer

kw:tahoe-add-alias

kw:tahoe-archive

kw:tahoe-backup

kw:tahoe-create-alias

kw:tahoe-create-introducer

kw:tahoe-deep-check

kw:tahoe-deepcheck

kw:tahoe-lafs-trac-stream

kw:tahoe-list-aliases

kw:tahoe-magic-folder

kw:tahoe-manifest

kw:tahoe-restart

kw:tahoe-unlink

kw:tahoe-webopen

kw:test-and-set

kw:test-from-egg

kw:ticket999-s3-backend

kw:to-be-closed-on-2011-08-01

kw:tor-protocol

kw:transparency

kw:tub.location

kw:twisted-trial

kw:uncoordinated-writes

kw:unfinished-business

kw:unhandled-error

kw:upload-helper

kw:visualization

kw:volunteergrid2

kw:warners-opinion-needed

kw:welcome-page

kw:windows-related

kw:world-domination

kw:write-enabler

kw:zookos-opinion-needed

kw:zope.interface

p/supercritical

r/cannot reproduce

r/somebody else's problem

r/was already fixed

v/1.9.0-s3branch

No milestone

No project

No assignees

3 participants

Notifications

Due date

The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: tahoe-lafs/trac#600

No description provided.