lease-expiring share crawler #633

New issue

Closed

opened 2009-02-18 18:26:27 +00:00 by warner · 8 comments

warner commented

2009-02-18 18:26:27 +00:00

The IO load of reading a terabyte of shares is considerable (I estimate 5
hours of continuous effort to just see the filenames, and 12 hours to read a
few bytes from each share), so this process needs to be careful to rate-limit
itself. It should slowly cycle its way through all the share directories,
using some persistent marker so that server restarts don't cause it to lose
all progress. The rate should be tunable, and the process should provide some
indication of how long it is likely to take to get through the entire ring.
Alternately, the config file should provide a target ring-traversal time, and
the server should adjust its rate to try and hit the target (i.e. measure how
long each prefixdir takes and slow down or speed up as necessary).

A note on rates:

Our overall GC strategy uses periodically-renewed time-limited leases and
lease-expiration-based GC. If lease expiration were free (i.e. if it
didn't take significant CPU time to walk the list of leases to find the
ones that had expired, like if we put them in an efficiently-sorted
database), then the process would nominally have two input parameters:
lease renewal time and lease duration. If expiration is expensive (as will
be the case to begin with), then really we have renewal time, duration,
and frequency of expiration checking. We can treat a non-infinite check
frequency as simply extending the average lease duration.
if we plot lease duration on the X axis, and renewal time on the Y axis,
then we have three clear tradeoff axes. The -Y/+Y axis is network traffic:
more frequent renewal means more network bandwidth with renewal messages,
and we want to be on the +Y side. The -X/+X axis is garbage: shorter
leases and faster expiration means garbage goes away faster, so we want to
be on the -X side. The +(X-Y)/-(X-Y) axis is safety: if leases aren't
renewed much faster than they expire, then we risk losing files (if a
renewal is late or missed), so we want to be on the +(X-Y) side (i.e
closer to the bottom right than to the top left). Clearly there is no one
place that optimizes all three.
my current working values for these parameters is a renewal time of one
month, and a lease duration of three months. Tahoe's server code is
currently hardwired to use lease durations of one month, so I'm also
thinking about using a renewal time of two weeks.

We don't want to turn on lease-expiration until we're sure that we've got the
lease-renewal code running properly. I'm not yet sure if that means we should
just have a tahoe.cfg flag to enable expiration (and have it default to False
for now), or if there should be a more runtime-based control (either a web
page with a button, or a control.furl method).

I anticipate future code that will need to do something to all shares (like
upgrade them to a new format, or accumulate lease/quota information in a
database) in a process that will take a considerable amount of time (a day or
two), so I'd like to keep that in mind while writing this code. The
persistent "where am I" mechanism would be useful to allow an
upgrade/accumulate process to survive restarts cleanly, and would help us
correctly handle pre-change vs post-change events (i.e. a share is added to
prefixdir AA, which has already been scanned by the quota-into-database
process, so the db should be updated. But a share being added to prefixdir
ZZ, which has not yet been scanned, should not go into the db).

A general framework for performing a process on all shares in the ring would
be useful. Lease expiration is a continuous process, but the upgrade job
described above would be a once-around-the-ring-then-finish process. Both
could be driven by the same mechanism and have similar status displays.

The web page that shows expiration status (or other scan-all-shares jobs)
should indicate where in the ring we're currently working (i.e. which
prefixdir was last processed), how fast the process is going, ETA until the
end of the ring, ETA until a full loop occurs. If this is the first time
we've traversed the ring, that fact should be clearly displayed, so that the
"ETA until end of the ring" really means ETA-till-process-complete.

It should also show status from the job itself. For lease expiration, it
should show how many leases are being expired, how many shares are being
deleted, and how much space is being reclaimed.

As part of the GC effort I'm working on now, the Tahoe storage server needs to have a background process which slowly looks for leases which have expired and removes them. When the last lease is removed from a share, the share is removed too. The IO load of reading a terabyte of shares is considerable (I estimate 5 hours of continuous effort to just see the filenames, and 12 hours to read a few bytes from each share), so this process needs to be careful to rate-limit itself. It should slowly cycle its way through all the share directories, using some persistent marker so that server restarts don't cause it to lose all progress. The rate should be tunable, and the process should provide some indication of how long it is likely to take to get through the entire ring. Alternately, the config file should provide a target ring-traversal time, and the server should adjust its rate to try and hit the target (i.e. measure how long each prefixdir takes and slow down or speed up as necessary). A note on rates: * Our overall GC strategy uses periodically-renewed time-limited leases and lease-expiration-based GC. If lease expiration were free (i.e. if it didn't take significant CPU time to walk the list of leases to find the ones that had expired, like if we put them in an efficiently-sorted database), then the process would nominally have two input parameters: lease renewal time and lease duration. If expiration is expensive (as will be the case to begin with), then really we have renewal time, duration, and frequency of expiration checking. We can treat a non-infinite check frequency as simply extending the average lease duration. * if we plot lease duration on the X axis, and renewal time on the Y axis, then we have three clear tradeoff axes. The -Y/+Y axis is network traffic: more frequent renewal means more network bandwidth with renewal messages, and we want to be on the +Y side. The -X/+X axis is garbage: shorter leases and faster expiration means garbage goes away faster, so we want to be on the -X side. The +(X-Y)/-(X-Y) axis is safety: if leases aren't renewed much faster than they expire, then we risk losing files (if a renewal is late or missed), so we want to be on the +(X-Y) side (i.e closer to the bottom right than to the top left). Clearly there is no one place that optimizes all three. * my current working values for these parameters is a renewal time of one month, and a lease duration of three months. Tahoe's server code is currently hardwired to use lease durations of one month, so I'm also thinking about using a renewal time of two weeks. We don't want to turn on lease-expiration until we're sure that we've got the lease-renewal code running properly. I'm not yet sure if that means we should just have a tahoe.cfg flag to enable expiration (and have it default to False for now), or if there should be a more runtime-based control (either a web page with a button, or a control.furl method). I anticipate future code that will need to do something to all shares (like upgrade them to a new format, or accumulate lease/quota information in a database) in a process that will take a considerable amount of time (a day or two), so I'd like to keep that in mind while writing this code. The persistent "where am I" mechanism would be useful to allow an upgrade/accumulate process to survive restarts cleanly, and would help us correctly handle pre-change vs post-change events (i.e. a share is added to prefixdir AA, which has already been scanned by the quota-into-database process, so the db should be updated. But a share being added to prefixdir ZZ, which has not yet been scanned, should *not* go into the db). A general framework for performing a process on all shares in the ring would be useful. Lease expiration is a continuous process, but the upgrade job described above would be a once-around-the-ring-then-finish process. Both could be driven by the same mechanism and have similar status displays. The web page that shows expiration status (or other scan-all-shares jobs) should indicate where in the ring we're currently working (i.e. which prefixdir was last processed), how fast the process is going, ETA until the end of the ring, ETA until a full loop occurs. If this is the first time we've traversed the ring, that fact should be clearly displayed, so that the "ETA until end of the ring" really means ETA-till-process-complete. It should also show status from the job itself. For lease expiration, it should show how many leases are being expired, how many shares are being deleted, and how much space is being reclaimed.

warner added the

labels 2009-02-18 18:26:27 +00:00

warner added this to the 1.5.0 milestone 2009-02-18 18:26:27 +00:00

warner self-assigned this 2009-02-18 18:26:27 +00:00

zooko commented

2009-02-18 20:19:55 +00:00

Hm... You know, maintaining a (semi-)sorted list can be cheap. What about this, for example:

We have a separate directory called "expiries". In it, there is a hierarchy of directories, the first layer is a set of directories named by the current unix timestamp at a megasecond granularity. Inside each "megasecond" directory there is a set of directories named by the kilosecond, and inside each "kilosecond" directory there is a set of directories named by the second. Inside each "second" directory is a set of 0-length files whose names are the storage indices of all the shares which are due to expire that second.

Whenever you update the lease on a share, you add that share's SI into the new expiry directory, remove its SI from the old expiry directory, and update the expiry stamp stored with the share itself. That's it. You could also remove the SI from the expiry directory whenever you remove a lease on a share.

Now whenever you want to find shares whose leases have expired, you need only look at the appropriate megasecond, kilosecond, and second, thus saving twelve hours of grovelling through all the shares looking for expired leases.

Note that the failure modes introduced by this scheme are "soft" because the expiries directory can be thought of as merely a "cache" of the canonical expiry timestamps which are stored with each share. Corruption of the expiries directory never results in premature deletion of a share, since you always check the canonical timestamp from the share itself before deletion. Corruption of the expiries directory can result in failure to delete an expired share, but this is usually less costly than the other kind of failure, and if can always be corrected by performing one of those 12-hour-grovels to fix or regenerate the expiries directory.

Hm... You know, maintaining a (semi-)sorted list can be cheap. What about this, for example: We have a separate directory called "expiries". In it, there is a hierarchy of directories, the first layer is a set of directories named by the current unix timestamp at a megasecond granularity. Inside each "megasecond" directory there is a set of directories named by the kilosecond, and inside each "kilosecond" directory there is a set of directories named by the second. Inside each "second" directory is a set of 0-length files whose names are the storage indices of all the shares which are due to expire that second. Whenever you update the lease on a share, you add that share's SI into the new expiry directory, remove its SI from the old expiry directory, and update the expiry stamp stored with the share itself. That's it. You could also remove the SI from the expiry directory whenever you remove a lease on a share. Now whenever you want to find shares whose leases have expired, you need only look at the appropriate megasecond, kilosecond, and second, thus saving twelve hours of grovelling through all the shares looking for expired leases. Note that the failure modes introduced by this scheme are "soft" because the expiries directory can be thought of as merely a "cache" of the canonical expiry timestamps which are stored with each share. Corruption of the expiries directory never results in premature deletion of a share, since you always check the canonical timestamp from the share itself before deletion. Corruption of the expiries directory *can* result in failure to delete an expired share, but this is usually less costly than the other kind of failure, and if can always be corrected by performing one of those 12-hour-grovels to fix or regenerate the expiries directory.

warner commented

2009-02-18 20:44:10 +00:00

Author

Attachment lease-tradeoffs.png (32678 bytes) added

diagram showing lease duration/expiration tradeoffs, between traffic, garbage, and safety

**Attachment** lease-tradeoffs.png (32678 bytes) added diagram showing lease duration/expiration tradeoffs, between traffic, garbage, and safety

lease-tradeoffs.png

32 KiB

warner commented

2009-02-18 21:21:15 +00:00

Author

For example, we don't need to maintain an exact sorted order: if leases on A
and B both don't expire for a month, we don't care (right now) whether A
comes first or B does.. we can put off that sort for a couple of weeks.
Likewise we don't care about timestamp resolution smaller than a day.

I definitely like having the share contain the canonical lease information,
and using the ancillary data structures merely as a cache. If we were to go
with a traditional database (sqlite or the like), then I'd have the DB
contain a table with (storageindex, leasedata, expirationtime), with an index
on both storageindex and expirationtime, and the daily or hourly query would
then be "SELECT storageindex FROM table WHERE expirationtime < now". We'd
read the real lease data from the share before acting upon it (which incurs
an IO cost, but share expiration is relatively infrequent, and the safety
benefits are well worth it).

Given the large number of shares we're talking about (a few million per
server), I'm hesitant to create a persistent data structure that needs one
file per share. The shares themselves are already wasting GBs of space on the
minimum block size overhead. Mind you, ext3 is pretty good about zero-length
files, a quick test shows that it spends one 4kB block for each 113 files
(each named with the same length as one of our storage index strings, 26
bytes, which means ext3's per-file overhead is an impressively-small 10.25
bytes), so a million would take about 36MB.. not too bad.

Having a separate directory for each second would probably result in a
million directories, but a tree of expire-time directories (as you described)
that only goes down to the kilosecond might be reasonably-sized. It would
still require a slow initial crawl to set up, though.

Incidentally, a slow-share-crawler could also be used to do local share
verification (slowly read and check hashes on all local shares, to discover
local disk failures before the filecap holder gets around to doing a
bandwidth-expensive remote verification), and even server-driven repair (ask
other servers if they have other shares for this file, perform ciphertext
repair if it looks like the file needs it). Hm, note to self: server-driven
repair should create new shares with the same lease expiration time as the
original shares, so that it doesn't cause a garbage file to live forever like
some infectious epidemic.

Hm, yeah, there are a number of optimizations that can take advantage of the fact that we're allowed to delete shares late. You can think of this as another factor in the tradeoff diagram I just attached to this ticket: with marginally increased complexity, we can reduce the CPU/diskIO costs, by increasing the lease expiration time. For example, we don't need to maintain an exact sorted order: if leases on A and B both don't expire for a month, we don't care (right now) whether A comes first or B does.. we can put off that sort for a couple of weeks. Likewise we don't care about timestamp resolution smaller than a day. I definitely like having the share contain the canonical lease information, and using the ancillary data structures merely as a cache. If we were to go with a traditional database (sqlite or the like), then I'd have the DB contain a table with (storageindex, leasedata, expirationtime), with an index on both storageindex and expirationtime, and the daily or hourly query would then be "SELECT storageindex FROM table WHERE expirationtime < now". We'd read the real lease data from the share before acting upon it (which incurs an IO cost, but share expiration is relatively infrequent, and the safety benefits are well worth it). Given the large number of shares we're talking about (a few million per server), I'm hesitant to create a persistent data structure that needs one file per share. The shares themselves are already wasting GBs of space on the minimum block size overhead. Mind you, ext3 is pretty good about zero-length files, a quick test shows that it spends one 4kB block for each 113 files (each named with the same length as one of our storage index strings, 26 bytes, which means ext3's per-file overhead is an impressively-small 10.25 bytes), so a million would take about 36MB.. not too bad. Having a separate directory for each second would probably result in a million directories, but a tree of expire-time directories (as you described) that only goes down to the kilosecond might be reasonably-sized. It would still require a slow initial crawl to set up, though. Incidentally, a slow-share-crawler could also be used to do local share verification (slowly read and check hashes on all local shares, to discover local disk failures before the filecap holder gets around to doing a bandwidth-expensive remote verification), and even server-driven repair (ask other servers if they have other shares for this file, perform ciphertext repair if it looks like the file needs it). Hm, note to self: server-driven repair should create new shares with the same lease expiration time as the original shares, so that it doesn't cause a garbage file to live forever like some infectious epidemic.

zooko commented

2009-02-18 21:49:43 +00:00

It isn't so easy to remove an element from a set, but we don't really need to do that for this application, and if we do need to do it it isn't that hard -- you just have to scan the whole file to find the SI you want to remove.

Yeah, if the common operations are just appending an SI to a set, and consuming an entire set, then you could easily implement this with each set being a file instead of a directory of 0-length files. Open the file and append the (binary, fixed-length) SI to it to add the element the set. Read through the whole file, processing each SI, and then rm it to consume the set. It isn't so easy to *remove* an element from a set, but we don't really need to do that for this application, and if we do need to do it it isn't that hard -- you just have to scan the whole file to find the SI you want to remove.

warner commented

2009-02-18 23:01:49 +00:00

Author

If we have multiple crawlers, it might be a good idea to combine them
into a single crawler, basically to improve locality of reference and be
kinder to the filesystem's directory cache.

Hm, so it feels like a crawler is either a transition tool (used to
first populate the lease DB, or convert shares to a new format, or
something), or a fallback/error-recovery tool (to detect problems in the
DB, or rebuild it after it gets corrupted), or something to use in the
interim until we build ourselves a fast database (like for a
share-counter, or a local-share-verifier). Maybe it is not deserving of
the hassle of merging multiple crawlers into a single one.

Some tests I just ran on a prodnet storage server (pt4.st4, with about
1TB of shares) show that it takes about 130-200ms to list the buckets in
each prefixdir (with a lukewarm cache.. with a hot one, it's closer to
17ms). There are 1040 prefixdirs, and on this server each one has an
average of 2460 buckets, giving us about 2.56M buckets total. Actually
listing the shares in a prefixdir takes considerably longer, more like
55 seconds, since it requires accessing all 2460 bucketdirs, which
suggests that merely enumerating every share on this server would take
57ksec, or 16 hours. And doing a stat() on every file in a prefixdir
takes 76s, which suggests all bucketdirs would take 79ksec, or 22 hours.
A hot cache again brings down the stat() time considerably, to about
100ms per prefixdir.

Reading something from each file takes even longer. The other data point
I have is from several months ago, and I don't remember which server it
was run on. What I seem to remember was 5 hours to do a 'find' of all
shares, and 12 hours to create a "share catalog", which must read the
header and leases from each share.

The normal upload/download/do-you-have-block traffic of a tahoe storage
server will cause most of the prefixdirs to be cached (this is the
"lukewarm" state I mentioned above), so the crawler can assume that it
will be cheap to learn the bucketdir names. To do anything with the
actual shares, the crawler will have to bring the bucketdir contents
into memory, which should be assumed to be fairly expensive.

Hm, come to think of it, a share-crawler would also be useful to keep track of how many shares are being managed by this server. At the moment we have some broken scripts that try to estimate this by watching a couple of prefixdirs on a few servers. A crawler which loops once every few days could give us a better estimate. Of course, we could get the same information out of a lease DB in O(1) time, in exchange for complexity and a constant-time overhead per share add/remove. If we have multiple crawlers, it might be a good idea to combine them into a single crawler, basically to improve locality of reference and be kinder to the filesystem's directory cache. Hm, so it feels like a crawler is either a transition tool (used to first populate the lease DB, or convert shares to a new format, or something), or a fallback/error-recovery tool (to detect problems in the DB, or rebuild it after it gets corrupted), or something to use in the interim until we build ourselves a fast database (like for a share-counter, or a local-share-verifier). Maybe it is not deserving of the hassle of merging multiple crawlers into a single one. Some tests I just ran on a prodnet storage server (pt4.st4, with about 1TB of shares) show that it takes about 130-200ms to list the buckets in each prefixdir (with a lukewarm cache.. with a hot one, it's closer to 17ms). There are 1040 prefixdirs, and on this server each one has an average of 2460 buckets, giving us about 2.56M buckets total. Actually listing the shares in a prefixdir takes considerably longer, more like 55 seconds, since it requires accessing all 2460 bucketdirs, which suggests that merely enumerating every share on this server would take 57ksec, or 16 hours. And doing a stat() on every file in a prefixdir takes 76s, which suggests all bucketdirs would take 79ksec, or 22 hours. A hot cache again brings down the stat() time considerably, to about 100ms per prefixdir. Reading something from each file takes even longer. The other data point I have is from several months ago, and I don't remember which server it was run on. What I seem to remember was 5 hours to do a 'find' of all shares, and 12 hours to create a "share catalog", which must read the header and leases from each share. The normal upload/download/do-you-have-block traffic of a tahoe storage server will cause most of the prefixdirs to be cached (this is the "lukewarm" state I mentioned above), so the crawler can assume that it will be cheap to learn the bucketdir names. To do anything with the actual shares, the crawler will have to bring the bucketdir contents into memory, which should be assumed to be fairly expensive.

warner commented

2009-03-24 20:22:25 +00:00

Author

Current trunk has code to do what I want. It isn't fast: a prodnet machine with 1TB of shares (about 3M objects) with a crawler limited to 10% runtime is looking to take about 14 days to cycle all the way around. But it's working.

Rows
Columns