tahoe-lafs/trac

consider removing some st_* fields from metadata #1946

New issue

Open

opened 2013-04-17 20:44:51 +00:00 by daira · 15 comments

daira commented

2013-04-17 20:44:51 +00:00

Here is typical metadata for a Tahoe directory entry:

  "metadata": {
   "st_gid": 1000,
   "ctime": 1360790214.51,
   "st_ino": 264045,
   "st_dev": 2052,
   "st_mode": 33204,
   "mtime": 1360790214.5,
   "st_uid": 1000
  }

We should consider removing the st_gid, st_uid, st_ino, and st_dev fields, since they are not very meaningful and may constitute a privacy or anonymity risk for some uses of Tahoe.

Here is typical metadata for a Tahoe directory entry: ``` "metadata": { "st_gid": 1000, "ctime": 1360790214.51, "st_ino": 264045, "st_dev": 2052, "st_mode": 33204, "mtime": 1360790214.5, "st_uid": 1000 } ``` We should consider removing the `st_gid`, `st_uid`, `st_ino`, and `st_dev` fields, since they are not very meaningful and may constitute a privacy or anonymity risk for some uses of Tahoe.

daira added the

labels 2013-04-17 20:44:51 +00:00

daira added this to the soon milestone 2013-04-17 20:44:51 +00:00

elb commented

2013-04-17 20:47:55 +00:00

Owner

I would advocate retaining a way to set those fields, and restore them, as they are useful and even desirable when using Tahoe for backups. They need not be stored by default for this use case.

daira commented

2013-04-17 20:59:05 +00:00

Author

Note that for backup, we really need to store more than this, in particular permissions. (Maybe we can just support that for Unix; Windows NT ACLs are complicated.) So the amount of data we are currently storing falls between two stools.

daira commented

2013-04-17 21:00:49 +00:00

Author

Replying to daira:

Note that for backup, we really need to store more than this, in particular permissions. (Maybe we can just support that for Unix; Windows NT ACLs are complicated.)

Oh, sorry, Unix permissions are what st_mode is.

Replying to [daira](/tahoe-lafs/trac/issues/1946#issuecomment-393009): > Note that for backup, we really need to store more than this, in particular permissions. (Maybe we can just support that for Unix; Windows NT ACLs are complicated.) Oh, sorry, Unix permissions are what `st_mode` is.

daira commented

2013-04-17 21:02:15 +00:00

Author

elb: for the record, I would want my 'tahoe backup' to record gid uid and mode

zooko: But, I use "tahoe backup" as my normal way to upload things to the grid.

zooko: Even though I consider uid and gid meaningless once the file has been uploaded to the tahoe-lafs grid.

cehteh: you could add another 'privacy' attribute which other tools then respect

elb: for the record, I would want my 'tahoe backup' to record gid uid and mode zooko: But, I use "tahoe backup" as my normal way to upload things to the grid. zooko: Even though I consider uid and gid meaningless once the file has been uploaded to the tahoe-lafs grid. cehteh: you could add another 'privacy' attribute which other tools then respect

gdt commented

2013-04-17 22:34:42 +00:00

Owner

I think this discussion is pointing out a fundamental confusion in tahoe. On one hand, it's a filesystem, which doesn't implement POSIX semantics, and we're told that FUSE/vfs access is troubling. On the other hand, it has a builtin backup program, even though there aren't really any good reasons to blur backup programs and filesystems (consider bup with the BUPDIR on tahoe).

zooko commented

2013-04-17 23:32:43 +00:00

I think tahoe backup was originally conceived as being, well, for "backup". And there was originally the idea that there would eventually be a tahoe restore. The assumption of a "backup and restore" use case is that the set of files and directories probably won't be shared or published while it is backed-up, and that it will eventually be restored to the same or a "similar" system as the one it came from.

However, I don't use tahoe backup that way. I use it as a good way to publish files to my LAFS grid. I almost always use tahoe backup whenever I'm uploading files, and I never "restore" them, and I often share them. Even if a tahoe restore command existed, and even if I decided to use it, then the system I was restoring to might not have the same uid and gid set as the original system.

So why do I use tahoe backup then if not for backup? Well, it is faster than "tahoe cp" or "tahoe put" or FUSE because it maintains a local cache db of files that it has already backed-up. It also keeps time-stamp-keyed snapshots of all previous versions that have been written to the grid. These are two very nice features.

What if we create a command named something like tahoe snapshot or tahoe mirror that has those two features, but does not have the st_* metadata, which I do not think is meaningful in my use case, and which could be a privacy leak? An advantage of this new tahoe snapshot command is that then more people would discover its existence and its usefulness, even when they have a "publish" kind of use-case instead of a "backup" kind of use case.

If you like this ticket, you might also like #1865, #897, or even [/query?status=!closed&keywords=~tahoe-backup&order=priority all open tickets about tahoe-backup]

I think `tahoe backup` was originally conceived as being, well, for "backup". And there was originally the idea that there would eventually be a `tahoe restore`. The assumption of a "backup and restore" use case is that the set of files and directories probably won't be shared or published while it is backed-up, and that it will eventually be restored to the same or a "similar" system as the one it came from. However, I don't use `tahoe backup` that way. I use it as a good way to publish files to my LAFS grid. I almost always use `tahoe backup` whenever I'm uploading files, and I never "restore" them, and I often share them. Even if a `tahoe restore` command existed, and even if I decided to use it, then the system I was restoring to might not have the same uid and gid set as the original system. So why do I use `tahoe backup` then if not for backup? Well, it is faster than "tahoe cp" or "tahoe put" or FUSE because it maintains a local cache db of files that it has already backed-up. It also keeps time-stamp-keyed snapshots of all previous versions that have been written to the grid. These are two very nice features. What if we create a command named something like `tahoe snapshot` or `tahoe mirror` that has those two features, but does not have the `st_*` metadata, which I do not think is meaningful in my use case, and which could be a privacy leak? An advantage of this new `tahoe snapshot` command is that then more people would discover its existence and its usefulness, even when they have a "publish" kind of use-case instead of a "backup" kind of use case. If you like this ticket, you might also like #1865, #897, or even [/query?status=!closed&keywords=~tahoe-backup&order=priority all open tickets about tahoe-backup]

zooko commented

2013-04-17 23:33:17 +00:00

Here's why I don't like uid and gid in here the way they currently are in here. This is kind of like Daira's complaint about falling between two stools. It might be cool to store extra information if it were unambiguously interpretable. For example, maybe if you had uid, gid, and "UUID of the disk partition from which the root filesystem was loaded", then you would later be able to tell (in a hypothetical future "restore" command) whether the stored uid and gid could be meaningfully copied back into the target of the restore. Or maybe that wouldn't work, I don't know. Maybe instead you need to store a copy of the /etc/passwd and /etc/groups so that you can check whether the target system has a sufficiently similar entry in its files as the source system had, for this uid and gid? But in any case I don't like "floating pieces of data which have broken from their anchors", because you can never safely re-anchor them, except by guessing or by asking a human user to guess. To me, uid and gid numbers without any way to recognize their context are that sort of "floating pieces of data". I know it's the Unix Way, but that doesn't mean I have to like it. Also, that tradition originated in a use-case where you might reasonably expect the sysadmin to write down what he needs to recognize their context (i.e. the name of the system from which this tarball was produced), and that's less true — at least in my experience — for the way tahoe backup is used.

Here's why I don't like uid and gid in here the way they currently are in here. This is kind of like Daira's complaint about falling between two stools. It might be cool to store extra information if it were unambiguously interpretable. For example, maybe if you had uid, gid, and "UUID of the disk partition from which the root filesystem was loaded", then you would later be able to tell (in a hypothetical future "restore" command) whether the stored uid and gid could be meaningfully copied back into the target of the restore. Or maybe that wouldn't work, I don't know. Maybe instead you need to store a copy of the `/etc/passwd` and `/etc/groups` so that you can check whether the target system has a sufficiently similar entry in its files as the source system had, for this uid and gid? But in any case I don't like "floating pieces of data which have broken from their anchors", because you can never safely re-anchor them, except by guessing or by asking a human user to guess. To me, uid and gid numbers without any way to recognize their context are that sort of "floating pieces of data". I know it's the Unix Way, but that doesn't mean I have to like it. Also, that tradition originated in a use-case where you might reasonably expect the sysadmin to write down what he needs to recognize their context (i.e. the name of the system from which this tarball was produced), and that's less true — at least in my experience — for the way `tahoe backup` is used.

elb commented

2013-04-18 02:17:21 +00:00

Owner

zooko:

I get that, but I think it neglects the use case where a user really is backing up and restoring a particular system. That's a use case I care about. Now, I don't care if the stored information is a uid or a username or gid or group name or what (and the symbolic names may provide some measure of robustness to the disassociation you describe, but can't really be said to fix the problem), but I do care that the information is stored.

zooko: I get that, but I think it neglects the use case where a user really is backing up and restoring a particular system. That's a use case I care about. Now, I don't care if the stored information is a uid or a username or gid or group name or what (and the symbolic names may provide some measure of robustness to the disassociation you describe, but can't really be said to fix the problem), but I do care that the information is stored.

gdt commented

2013-04-18 13:04:44 +00:00

Owner

Stepping back, I think metadata gets used for three things, and we should think about them separately:

normal vfs operations like any other filesystem. This is hard, and I think we should look to other filesystems to see what they do about uid/gid and username mappings. zooko's point about unanchored uids is totally sensible. AFS is perhaps a good example, because while it lacks the cypherpunk flavor, it has large scale. But, it's a bit centralized in terms of adminsitration.
wide-scale sharing. Here, filesystems blur into publishing systems.
backup: here, we want to store metadata to be used later, but it's really part of the data of the backup, not metadata about files.

My complaint about 'tahoe backup' is that I don't see any good reason to couple the backup program and the filesystem. All the reasons about local storage of metadata about what's there, etc. should apply to any back-end storage.

Stepping back, I think metadata gets used for three things, and we should think about them separately: * normal vfs operations like any other filesystem. This is hard, and I think we should look to other filesystems to see what they do about uid/gid and username mappings. zooko's point about unanchored uids is totally sensible. AFS is perhaps a good example, because while it lacks the cypherpunk flavor, it has large scale. But, it's a bit centralized in terms of adminsitration. * wide-scale sharing. Here, filesystems blur into publishing systems. * backup: here, we want to store metadata to be used later, but it's really part of the data of the backup, not metadata about files. My complaint about 'tahoe backup' is that I don't see any good reason to couple the backup program and the filesystem. All the reasons about local storage of metadata about what's there, etc. should apply to any back-end storage.

zooko commented

2013-04-20 02:13:22 +00:00

By the way, I have, in one tree, 695,000 files. Can that be right? Anyway, I have a lot of them, so the space cost of a little bit of metadata on each one might be significant for me.

elb commented

2013-04-20 16:09:52 +00:00

Owner

I just checked one of the trees I want to back up; 1.1M files. That's certainly an argument for some efficiency in metadata, anyway. :-)

daira commented

2013-04-20 18:47:51 +00:00

Author

Replying to zooko:

By the way, I have, in one tree, 695,000 files. Can that be right? Anyway, I have a lot of them, so the space cost of a little bit of metadata on each one might be significant for me.

A deep-stats operation will tell you the total size of directories, which is an upper bound on the amount of space that could possibly be saved by reducing metadata, in the "size-directories" field.

Replying to [zooko](/tahoe-lafs/trac/issues/1946#issuecomment-393017): > By the way, I have, in one tree, 695,000 files. Can that be right? Anyway, I have a lot of them, so the space cost of a little bit of metadata on each one might be significant for me. A deep-stats operation will tell you the total size of directories, which is an upper bound on the amount of space that could possibly be saved by reducing metadata, in the "size-directories" field.

zooko commented

2013-04-23 03:34:41 +00:00

Replying to gdt:

On the other hand, it has a builtin backup program, even though there aren't really any good reasons to blur backup programs and filesystems (consider bup with the BUPDIR on tahoe).

gdt, you might be interested in [//pipermail/tahoe-dev/2008-September/000814.html this old thread], where I failed to deter Brian from inventing tahoe backup. The thing is, neither bup, duplicity, nor any of the other ones that I named would have the same behavior that tahoe backup does. I use tahoe backup a lot, and I like its behavior, and I suspect we couldn't get the same behavior by composing a backup tool and a storage system through a generic POSIX API.

Replying to [gdt](/tahoe-lafs/trac/issues/1946#issuecomment-393012): > On the other hand, it has a builtin backup program, even though there aren't really any good reasons to blur backup programs and filesystems (consider bup with the BUPDIR on tahoe). gdt, you might be interested in [//pipermail/tahoe-dev/2008-September/000814.html this old thread], where I failed to deter Brian from inventing `tahoe backup`. The thing is, neither bup, duplicity, nor any of the other ones that I named would have the same behavior that `tahoe backup` does. I use `tahoe backup` a lot, and I like its behavior, and I suspect we couldn't get the same behavior by composing a backup tool and a storage system through a generic POSIX API.

zooko commented

2013-04-23 03:42:58 +00:00

Replying to elb:

I get that, but I think it neglects the use case where a user really is backing up and restoring a particular system. That's a use case I care about. Now, I don't care if the stored information is a uid or a username or gid or group name or what (and the symbolic names may provide some measure of robustness to the disassociation you describe, but can't really be said to fix the problem), but I do care that the information is stored.

Maybe that use case is better served by an archive-oriented backup tool such as duplicity or duplicati instead of tahoe backup. All of the other backup tools that I know of are archive-oriented instead of file-oriented. tahoe backup, on the other hand, makes a separate copy of each file that it encounters. So there are a bunch of things which archivers (including venerable "tar") can handle that tahoe backup can't handle, such as symlinks. Traditional backup is better thought of as something that you do to a filesystem or at least a subtree, rather than to a set of independent files. They also achieve much better efficiency by compressing between files and across subsequent versions than tahoe backup does, for the same reason -- tahoe backup can't compress across files or across subsequent versions of the filesystem, because it has to store every file as an independently addressable and accessible entity.

You can use duplicity or duplicati to archive, delta-compress, store metadata, etc. and use their tahoe-lafs backends to securely store the actual data. That sounds like a good solution for this use case.

I kind of think that Brian was right to invent tahoe backup, because it is cool and useful, but that I was right that it isn't a good solution for the "backup a filesystem" use case. Maybe it should be renamed to tahoe mirror or tahoe snapshot.

Replying to [elb](/tahoe-lafs/trac/issues/1946#issuecomment-393015): > > I get that, but I think it neglects the use case where a user really is backing up and restoring a particular system. That's a use case I care about. Now, I don't care if the stored information is a uid or a username or gid or group name or what (and the symbolic names may provide some measure of robustness to the disassociation you describe, but can't really be said to fix the problem), but I do care that the information is stored. Maybe that use case is better served by an archive-oriented backup tool such as duplicity or duplicati instead of `tahoe backup`. All of the other backup tools that I know of are archive-oriented instead of file-oriented. `tahoe backup`, on the other hand, makes a separate copy of each file that it encounters. So there are a bunch of things which archivers (including venerable "tar") can handle that `tahoe backup` can't handle, such as symlinks. Traditional backup is better thought of as something that you do to a filesystem or at least a subtree, rather than to a set of independent files. They also achieve much better efficiency by compressing between files and across subsequent versions than `tahoe backup` does, for the same reason -- `tahoe backup` can't compress across files or across subsequent versions of the filesystem, because it has to store every file as an independently addressable and accessible entity. You can use duplicity or duplicati to archive, delta-compress, store metadata, etc. and use their tahoe-lafs backends to securely store the actual data. That sounds like a good solution for this use case. I kind of think that Brian was right to invent `tahoe backup`, because it is cool and useful, but that I was right that it isn't a good solution for the "backup a filesystem" use case. Maybe it should be renamed to `tahoe mirror` or `tahoe snapshot`.

Rows
Columns