bring back sizelimit (i.e. max consumed, not min free) #671

New issue

Open

opened 2009-03-29 04:07:10 +00:00 by zooko · 25 comments

zooko commented

2009-03-29 04:07:10 +00:00

We used to have a sizelimit option which would do a recursive examination of the storage directory at startup and calculate approximately how much disk space was used, and refuse to accept new shares if the disk space would exceed the limit. #34 shows when it was implemented. It was later removed because it took a long time -- about 30 minutes -- on allmydata.com storage servers, and the servers remained unavailable to clients during this period, and because it was replaced by the reserved_space configuration, which was very fast and which satisfied the requirements of the allmydata.com storage servers.

This ticket is to reintroduce sizelimit because [//pipermail/tahoe-dev/2009-March/001493.html some users want it]. This might mean that the storage server doesn't start serving clients until it finishes the disk space inspection at startup.

Note that sizelimit would impose a maximum limit on the amount of space consumed by the node's storage/shares/ directory, whereas reserved_space imposes a minimum limit on the amount of remaining available disk space. In general, reserved_space can be implemented by asking the OS for filesystem stats, whereas sizelimit must be implemented by tracking the node's own usage and accumulating the sizes over time.

To close this ticket, you do not need to implement some sort of interleaving of inspecting disk space and serving clients.

To close this ticket, you MUST NOT implement any sort of automatic deletion of shares to get back under the sizelimit if you find yourself over it (for example, if the user has changed the sizelimit to be lower after you've already filled it to the max), but you SHOULD implement some sort of warning message to the log if you detect this condition.

We used to have a `sizelimit` option which would do a recursive examination of the storage directory at startup and calculate approximately how much disk space was used, and refuse to accept new shares if the disk space would exceed the limit. #34 shows when it was implemented. It was later removed because it took a long time -- about 30 minutes -- on allmydata.com storage servers, and the servers remained unavailable to clients during this period, and because it was replaced by the `reserved_space` configuration, which was very fast and which satisfied the requirements of the allmydata.com storage servers. This ticket is to reintroduce `sizelimit` because [//pipermail/tahoe-dev/2009-March/001493.html some users want it]. This might mean that the storage server doesn't start serving clients until it finishes the disk space inspection at startup. Note that `sizelimit` would impose a maximum limit on the amount of space consumed by the node's `storage/shares/` directory, whereas `reserved_space` imposes a minimum limit on the amount of remaining available disk space. In general, `reserved_space` can be implemented by asking the OS for filesystem stats, whereas `sizelimit` must be implemented by tracking the node's own usage and accumulating the sizes over time. To close this ticket, you do *not* need to implement some sort of interleaving of inspecting disk space and serving clients. To close this ticket, you MUST NOT implement any sort of automatic deletion of shares to get back under the sizelimit if you find yourself over it (for example, if the user has changed the sizelimit to be lower after you've already filled it to the max), but you SHOULD implement some sort of warning message to the log if you detect this condition.

zooko added the

labels 2009-03-29 04:07:10 +00:00

zooko added this to the 1.6.0 milestone 2009-03-29 04:07:10 +00:00

warner commented

2009-11-30 21:43:47 +00:00

(updated description)

Note that any sizelimit code is allowed to speed things up by remembering state from one run to the next. The old code did the slow recursive-traversal sharewalk to handle the (important) case where this state was inaccurate or unavailable (i.e. when shares had been deleted by some external process, or to handle the local-fs-level overhead that accounts for the difference between what /bin/ls and /bin/df each report). But we could trade off accuracy for speed: it should be acceptable to just ensure that the sizelimit is eventually approximately correct.

A modern implementation should probably use the "share crawler" mechanism, doing a stat on each share, and adding up the results. It can store state in the normal crawler stash, probably in the form of a single total-bytes value per prefixdir. The do-I-have-space test should use max(last-pass, current-pass), to handle the fact that the current-pass value will be low while the prefixdir is being scanned. The crawler would replace this state on each pass, so any stale information would go away within a few hours or days.

Ideally, the server code should also keep track of new shares that were written into each prefixdir, and add the sizes of those shares to the state value, but only until the next crawler pass had swung by and seen the new shares. You'd also want do to something similar with shares that were deleted (by the lease expirer). To accomplish this, you'd want to make a ShareCrawler subclass that tracks this extra space in a per-prefixdir dict, and have the storage-server/lease-expirer notify it every time a share was created or deleted. The ShareCrawler subclass is in the right position to know when the crawler has reached a bucket.

Doing this with the crawler would also have the nice side-effect of balancing fast startup with accurate size limiting. Even though this ticket has been defined as not requiring such a feature, I'm sure users would appreciate it.

(updated description) Note that any sizelimit code is allowed to speed things up by remembering state from one run to the next. The old code did the slow recursive-traversal sharewalk to handle the (important) case where this state was inaccurate or unavailable (i.e. when shares had been deleted by some external process, or to handle the local-fs-level overhead that accounts for the difference between what /bin/ls and /bin/df each report). But we could trade off accuracy for speed: it should be acceptable to just ensure that the sizelimit is eventually approximately correct. A modern implementation should probably use the "share crawler" mechanism, doing a `stat` on each share, and adding up the results. It can store state in the normal crawler stash, probably in the form of a single total-bytes value per prefixdir. The do-I-have-space test should use `max(last-pass, current-pass)`, to handle the fact that the current-pass value will be low while the prefixdir is being scanned. The crawler would replace this state on each pass, so any stale information would go away within a few hours or days. Ideally, the server code should also keep track of new shares that were written into each prefixdir, and add the sizes of those shares to the state value, but only until the next crawler pass had swung by and seen the new shares. You'd also want do to something similar with shares that were deleted (by the lease expirer). To accomplish this, you'd want to make a `ShareCrawler` subclass that tracks this extra space in a per-prefixdir dict, and have the storage-server/lease-expirer notify it every time a share was created or deleted. The `ShareCrawler` subclass is in the right position to know when the crawler has reached a bucket. Doing this with the crawler would also have the nice side-effect of balancing fast startup with accurate size limiting. Even though this ticket has been defined as not requiring such a feature, I'm sure users would appreciate it.

warner changed title from ~~sizelimit~~ to bring back sizelimit (i.e. max consumed, not min free)

2009-11-30 21:43:47 +00:00

zooko commented

2009-12-13 05:18:15 +00:00

Author

Brian: did you intend to put this into Milestone 1.6? I assume not, so I'm moving it to eventually. Apologies if you meant to put it here and feel free to move it back.

Brian: did you intend to put this into Milestone 1.6? I assume not, so I'm moving it to *eventually*. Apologies if you meant to put it here and feel free to move it back.

zooko modified the milestone from 1.6.0 to eventually

2009-12-13 05:18:15 +00:00

daira commented

2010-12-30 22:53:31 +00:00

#1285 asks for the df command on a Tahoe filesystem mounted over SFTP to show some estimate for the space used on a grid (as well as the space available). However, by default we shouldn't slow down the startup process of storage servers in order to achieve that.

Note that on a conventional filesystem, the total size of files corresponds roughly to the amount of space used (ignoring per-file overhead). On a Tahoe filesystem, the latter is usually greater than the former by the expansion factor, N/k. However if the encoding parameters have changed, or if different gateways are using different parameters, then dividing the total space used by the current N/k on a given gateway would lead to an inaccurate estimate of total file size.

Both the total file size and the total space usage are potentially interesting. If we are periodically crawling all shares as this ticket suggests, then it is not significantly more difficult to compute both (under the assumption that N shares are stored for each file, which is true if the shares are optimally balanced).

OTOH, perhaps the total size of files and the total space usage are just not important enough to do all this work to compute them, given that storing shares on a separate filesystem is sufficient to achieve the goal of limiting total space usage.

OTGH, long-term preservation is improved by occasionally crawling all shares to ensure that they can still be read. (That requires actually reading the shares rather than just the metadata, though.)

#1285 asks for the `df` command on a Tahoe filesystem mounted over SFTP to show some estimate for the space used on a grid (as well as the space available). However, by default we shouldn't slow down the startup process of storage servers in order to achieve that. Note that on a conventional filesystem, the total size of files corresponds roughly to the amount of space used (ignoring per-file overhead). On a Tahoe filesystem, the latter is usually greater than the former by the expansion factor, N/k. However if the encoding parameters have changed, or if different gateways are using different parameters, then dividing the total space used by the current N/k on a given gateway would lead to an inaccurate estimate of total file size. Both the total file size and the total space usage are potentially interesting. If we are periodically crawling all shares as this ticket suggests, then it is not significantly more difficult to compute both (under the assumption that N shares are stored for each file, which is true if the shares are optimally balanced). OTOH, perhaps the total size of files and the total space usage are just not important enough to do all this work to compute them, given that storing shares on a separate filesystem is sufficient to achieve the goal of limiting total space usage. OTGH, long-term preservation is improved by occasionally crawling all shares to ensure that they can still be read. (That requires actually reading the shares rather than just the metadata, though.)

daira commented

2011-10-11 03:05:36 +00:00

See also #940 (share-crawler should estimate+display space-used).

daira commented

2012-10-25 20:49:52 +00:00

Our current plan is to support this using the leasedb.

Our current plan is to support this using the [leasedb](https://github.com/davidsarah/tahoe-lafs/blob/666-accounting/docs/specifications/leasedb.rst).

daira modified the milestone from eventually to 1.11.0

2012-10-25 20:50:06 +00:00

daira self-assigned this 2012-10-25 20:50:06 +00:00

zooko commented

2012-12-14 20:21:19 +00:00

Author

The next step is to implement #1836, then we can use that to implement this ticket!

zooko commented

2012-12-14 20:27:30 +00:00

Author

#1043 was a duplicate of this.

daira modified the milestone from 1.11.0 to 1.12.0

2013-08-13 22:53:01 +00:00

markberger commented

2013-08-19 17:40:07 +00:00

I'm working on this ticket here: https://github.com/markberger/tahoe-lafs/tree/671-bring-back-sizelimit

I'm working on this ticket here: <https://github.com/markberger/tahoe-lafs/tree/671-bring-back-sizelimit>

daira commented

2013-08-19 23:51:18 +00:00

Also needs documentation in source:docs/configuration.rst.

daira removed their assignment 2013-08-19 23:51:31 +00:00

markberger was assigned by daira

2013-08-19 23:51:31 +00:00

markberger commented

2013-08-22 23:15:22 +00:00

I've started to write tests for this patch, but the share overhead seems to be pretty high. When I write a 1000 byte share, leasedb is reporting the share size to be 4098 bytes. Is this the expected behavior?

daira commented

2013-08-22 23:36:07 +00:00

Yes, that's expected behaviour. Filesystems can be surprisingly inefficient. Which fs are you using?

zooko commented

2013-08-22 23:46:48 +00:00

Author

Lets see… currently that is set by accounting_crawler (see #1835 to make it so the lease gets added into the leasedb immediately at the same time as the share is added to the store instead of later by a crawler) and I see from elsewhere in accounting_crawler that it is using the return value from the share's get_used_space().

Here are the implementations of get_used_space() in the 1819-cloud-merge-opensource branch:

null backend (returns 0)
disk mutable (calls fileutil.get_used_space(filepath))
disk immutable (calls fileutil.get_used_space(home)+fileutil.get_used_space(finalhome); That's interesting! The home/finalhome distinction is that during the upload of an immutable file it is written into a location named home, and only after the upload is complete is it mv'ed into finalhome)
cloud_common (returns just the share's size)

Okay, check out the implementation of fileutil.get_used_space.

So, in answer to your question:

When I write a 1000 byte share, leasedb is reporting the share size to be 4098 bytes. Is this the expected behavior?

Yes. ☺

Lets see… currently that is set by [accounting_crawler](https://github.com/LeastAuthority/tahoe-lafs/blame/318b34aac6f7780f7d23cdb6ac3f53fcaf2f27dd/src/allmydata/storage/accounting_crawler.py#L104) (see #1835 to make it so the lease gets added into the leasedb immediately at the same time as the share is added to the store instead of later by a crawler) and I see from [elsewhere in accounting_crawler](https://github.com/LeastAuthority/tahoe-lafs/blame/318b34aac6f7780f7d23cdb6ac3f53fcaf2f27dd/src/allmydata/storage/accounting_crawler.py#L39) that it is using the return value from [the share's get_used_space()](https://github.com/LeastAuthority/tahoe-lafs/blame/ce24a56283c715570613a8cf38605e0c83027ad0/src/allmydata/interfaces.py#l491). Here are the implementations of `get_used_space()` in the 1819-cloud-merge-opensource branch: * [null backend](https://github.com/LeastAuthority/tahoe-lafs/blame/ce24a56283c715570613a8cf38605e0c83027ad0/src/allmydata/storage/backends/null/null_backend.py#L138) (returns 0) * [disk mutable](https://github.com/LeastAuthority/tahoe-lafs/blame/bc2c1c103a1fb798881f31e8bfd2efbe7222500c/src/allmydata/storage/backends/disk/mutable.py#L124) (calls [fileutil.get_used_space(filepath)](https://github.com/LeastAuthority/tahoe-lafs/blame/affc9739ec9e83b011a69f8389c5d3552c3b81bd/src/allmydata/util/fileutil.py#L453)) * [disk immutable](https://github.com/LeastAuthority/tahoe-lafs/blame/ac74881c27c2be1355c347e459f747eb7150da68/src/allmydata/storage/backends/disk/immutable.py) (calls `fileutil.get_used_space(home)+fileutil.get_used_space(finalhome)`; That's interesting! The home/finalhome distinction is that during the upload of an immutable file it is written into a location named home, and only after the upload is complete is it mv'ed into finalhome) * [cloud_common](https://github.com/LeastAuthority/tahoe-lafs/blame/eaa6e22358f1fcf924b3b26094029fad14d72df7/src/allmydata/storage/backends/cloud/cloud_common.py) (returns just the share's size) Okay, check out the implementation of `fileutil.get_used_space`. So, in answer to your question: > When I write a 1000 byte share, leasedb is reporting the share size to be 4098 bytes. Is this the expected behavior? Yes. ☺

markberger commented

2013-08-23 01:03:03 +00:00

Replying to daira:

Yes, that's expected behaviour. Filesystems can be surprisingly inefficient. Which fs are you using?

I'm using ext4 but I didn't even think about fs overhead. I assumed the overhead was created by tahoe. Thanks daira.

And thanks for the detailed trace zooko.

Replying to [daira](/tahoe-lafs/trac/issues/671#issuecomment-371306): > Yes, that's expected behaviour. Filesystems can be surprisingly inefficient. Which fs are you using? I'm using ext4 but I didn't even think about fs overhead. I assumed the overhead was created by tahoe. Thanks daira. And thanks for the detailed trace zooko.

markberger commented

2013-08-23 19:26:35 +00:00

My branch now has tests: https://github.com/markberger/tahoe-lafs/tree/671-bring-back-sizelimit

Note that my branch is based on #1836 which also needs to be reviewed.

My branch now has tests: <https://github.com/markberger/tahoe-lafs/tree/671-bring-back-sizelimit> Note that my branch is based on #1836 which also needs to be reviewed.

joepie91 commented

2013-10-14 22:24:01 +00:00

Owner

markberger, first of all, thanks for the patch!

I've been reviewing it (diff against 72b49750d95b0ca01321e8cd0e2bc93cd0c71165), and in web/storage.py, in StorageStatus.render_JSON, the bucket-counter element is changed to always return None. While this appears to be for compatibility reasons when looking at the context, it might be wise to clarify as such in a comment :)

markberger, first of all, thanks for the patch! I've been reviewing it (diff against [72b49750d95b0ca01321e8cd0e2bc93cd0c71165](https://github.com/markberger/tahoe-lafs/tree/72b49750d95b0ca01321e8cd0e2bc93cd0c71165)), and in web/storage.py, in StorageStatus.render_JSON, the bucket-counter element is changed to always return None. While this appears to be for compatibility reasons when looking at the context, it might be wise to clarify as such in a comment :)

markberger commented

2013-10-15 21:40:08 +00:00

Hi joepie91, thanks for pointing that out.

I'm really busy this week, but I will add some comments this weekend.

Hi joepie91, thanks for pointing that out. I'm really busy this week, but I will add some comments this weekend.

markberger commented

2013-10-22 03:34:28 +00:00

I added the comment joepie91 suggested. It can be found on the pull request.

joepie91 commented

2013-10-22 04:13:16 +00:00

Owner

Also, I'll be removing the review-needed tag, but note that I haven't reviewed #1836 which this one is based on - #1836 is still awaiting review.

Alright, great. I do have to note... I was told there would be intentional backdoor easter eggs to check how well patches were reviewed, but I have not run across any. I hope this is a good thing, and that it means there aren't any :) Also, I'll be removing the review-needed tag, but note that I haven't reviewed #1836 which this one is based on - #1836 is still awaiting review.

amontero commented

2013-12-12 17:49:49 +00:00

Owner

Posted a comment at related article at http://bitcartel.wordpress.com/2012/10/21/rbic-redundant-bunch-of-independent-clouds pointing to this ticket. Its use case might benefit of this.

Posted a comment at related article at <http://bitcartel.wordpress.com/2012/10/21/rbic-redundant-bunch-of-independent-clouds> pointing to this ticket. Its use case might benefit of this.

Lcstyle commented

2014-09-24 04:29:21 +00:00

Owner

looking at #648 and I support the functionality of a size limit.
Looks like #1836 is intended to address the long delays in calculating exactly how much space the storage node is actually using.

As far as what happens if a node originally given a limit of for example 1TB and actually grows to this size, and afterwards an admin seeks to reduce the limit to 500GB, there should be some functionality that allows shares to be transferred or copied to other servers to accommodate the storage node's request to downsize or shrink. It might take time to shrink the node, but at least it would provide the capability of a graceful way of resolving the issue. I don't know if such functionality (or a FR for this functionality) already exists.

looking at #648 and I support the functionality of a size limit. Looks like #1836 is intended to address the long delays in calculating exactly how much space the storage node is actually using. As far as what happens if a node originally given a limit of for example 1TB and actually grows to this size, and afterwards an admin seeks to reduce the limit to 500GB, there should be some functionality that allows shares to be transferred or copied to other servers to accommodate the storage node's request to downsize or shrink. It might take time to shrink the node, but at least it would provide the capability of a graceful way of resolving the issue. I don't know if such functionality (or a FR for this functionality) already exists.

zooko commented

2014-09-24 04:40:46 +00:00

Author

Replying to Lcstyle:

Looks like #1836 is intended to address the long delays in calculating exactly how much space the storage node is actually using.

Yes. And there is a good patch for #1836, but it isn't merged into trunk yet.

As far as what happens if a node originally given a limit of for example 1TB and actually grows to this size, and afterwards an admin seeks to reduce the limit to 500GB, there should be some functionality that allows shares to be transferred or copied to other servers to accommodate the storage node's request to downsize or shrink. It might take time to shrink the node, but at least it would provide the capability of a graceful way of resolving the issue. I don't know if such functionality (or a FR for this functionality) already exists.

There is another ticket about that, #864.

Replying to [Lcstyle](/tahoe-lafs/trac/issues/671#issuecomment-371315): > Looks like #1836 is intended to address the long delays in calculating exactly how much space the storage node is actually using. Yes. And there is a good patch for #1836, but it isn't merged into trunk yet. > As far as what happens if a node originally given a limit of for example 1TB and actually grows to this size, and afterwards an admin seeks to reduce the limit to 500GB, there should be some functionality that allows shares to be transferred or copied to other servers to accommodate the storage node's request to downsize or shrink. It might take time to shrink the node, but at least it would provide the capability of a graceful way of resolving the issue. I don't know if such functionality (or a FR for this functionality) already exists. There is another ticket about that, #864.

Rows
Columns