"tahoe backup" on the same immutable content when some shares are missing does not repair that content. #2035

Open
opened 2013-07-23 16:46:43 +00:00 by nejucomo · 20 comments
nejucomo commented 2013-07-23 16:46:43 +00:00
Owner

From srl295 on IRC:

Interesting behavior- Had some stuff on our grid backed up with 'tahoe backup'  which creates Latest, Archive etc- some of them are immutable dirs.   Then, I lost a drive.  So that particular storage node went down and never came back.   Some directories were irrecoverable, OK fine. I create new directories and run tahoe backup again on the new directory URI.    However, I'm still getting errors on deep check. - ERROR: NotEnoughSharesError(ran out of shares: complete= pending=Share(sh1-on-sslg5v) overdue= unused= need 2. Last failure: None)   -- I wonder, is tahoe reusing some unrecoverable URIs since the same immutable directory was created?
From `srl295` on IRC: Interesting behavior- Had some stuff on our grid backed up with 'tahoe backup' which creates Latest, Archive etc- some of them are immutable dirs. Then, I lost a drive. So that particular storage node went down and never came back. Some directories were irrecoverable, OK fine. I create new directories and run tahoe backup again on the new directory URI. However, I'm still getting errors on deep check. - ERROR: NotEnoughSharesError(ran out of shares: complete= pending=Share(sh1-on-sslg5v) overdue= unused= need 2. Last failure: None) -- I wonder, is tahoe reusing some unrecoverable URIs since the same immutable directory was created?
tahoe-lafs added the
unknown
normal
defect
1.10.0
labels 2013-07-23 16:46:43 +00:00
tahoe-lafs added this to the undecided milestone 2013-07-23 16:46:43 +00:00
Author
Owner

I think it was set to need 2, happy 2, total 3 on 1.9.2 when the original directory upload happen. Same settings under 1.10 when the failure and re-publish happened.

I think it was set to need 2, happy 2, total 3 on 1.9.2 when the original directory upload happen. Same settings under 1.10 when the failure and re-publish happened.
nejucomo commented 2013-07-23 16:49:05 +00:00
Author
Owner

Also from IRC, a recommended reproduction:

a good test case would probably be- upload an immutable file, make it unhealthy or unrecoverable, then later try to upload it again
Also from IRC, a recommended reproduction: a good test case would probably be- upload an immutable file, make it unhealthy or unrecoverable, then later try to upload it again
nejucomo commented 2013-07-23 17:05:34 +00:00
Author
Owner

More IRC from nejucomo (me):

Ah!  I have a hypothesis: The backup command keeps a local cache of which file revisions have been uploaded.  Then it checks that as an optimization.  If you can find that cache db, try renaming it, then rerun the backup.
More IRC from nejucomo (me): Ah! I have a hypothesis: The backup command keeps a local cache of which file revisions have been uploaded. Then it checks that as an optimization. If you can find that cache db, try renaming it, then rerun the backup.
nejucomo commented 2013-07-23 17:22:39 +00:00
Author
Owner

Replying to nejucomo:

More IRC from nejucomo (me):

Ah!  I have a hypothesis: The backup command keeps a local cache of which file revisions have been uploaded.  Then it checks that as an optimization.  If you can find that cache db, try renaming it, then rerun the backup.

srl verified that removing backupdb.sqlite, deleting the backup directories, and then rerunning backup successfully stored their data into the same immutable caps.

Therefore I propose this is a bug in backupdb caching logic. If possible it should verify the health of items in the cache. If this is expensive, maybe it could be an opt-in behavior with a commandline option.

I'm going to update the keywords to reflect this new information.

Replying to [nejucomo](/tahoe-lafs/trac-2024-07-25/issues/2035#issuecomment-134188): > More IRC from nejucomo (me): > > Ah! I have a hypothesis: The backup command keeps a local cache of which file revisions have been uploaded. Then it checks that as an optimization. If you can find that cache db, try renaming it, then rerun the backup. `srl` verified that removing `backupdb.sqlite`, deleting the backup directories, and then rerunning backup successfully stored their data into the same immutable caps. Therefore I propose this is a bug in backupdb caching logic. If possible it should verify the health of items in the cache. If this is expensive, maybe it could be an opt-in behavior with a commandline option. I'm going to update the keywords to reflect this new information.
nejucomo commented 2013-07-23 17:26:30 +00:00
Author
Owner

I'm not certain the current keywords are accurate. I attempted to err on the side of caution and apply them liberally.

  • I removed upload because I believe upload does the right thing.
  • repair may not be relevant because although this is about repairing backups, it's not using any specialized repair mechanism outside of immutable-dedup upload.
  • I added usability because without knowing the trick of nuking backupdb.sqlite users may believe they've successfully made a backup where some files remain unrecoverable due to the cache.
I'm not certain the current keywords are accurate. I attempted to err on the side of caution and apply them liberally. * I removed `upload` because I believe upload does the right thing. * `repair` may not be relevant because although this is about repairing backups, it's not using any specialized repair mechanism outside of immutable-dedup upload. * I added `usability` because without knowing the trick of nuking `backupdb.sqlite` users may believe they've successfully made a backup where some files remain unrecoverable due to the cache.
Author
Owner

Replying to [nejucomo]comment:5:

Therefore I propose this is a bug in backupdb caching logic. If possible it should verify the health of items in the cache. If this is expensive, maybe it could be an opt-in behavior with a commandline option.

backup-and-verify would be nice. I would think that backup could be efficient here, check if the shares are there before re-using its cache.

Also note the trick is to nuke the database AND unlink. So that trick probably can't work with preserving the Archived items.

Replying to [nejucomo]comment:5: > Therefore I propose this is a bug in backupdb caching logic. If possible it should verify the health of items in the cache. If this is expensive, maybe it could be an opt-in behavior with a commandline option. backup-and-verify would be nice. I would think that backup could be efficient here, check if the shares are there before re-using its cache. Also note the trick is to nuke the database AND unlink. So that trick probably can't work with preserving the Archived items.
daira commented 2013-07-23 19:03:02 +00:00
Author
Owner

The backupdb is a performance hack to avoid the latency cost of asking servers whether each file exists on the grid. If the latter were fast enough (which would probably require batching requests for multiple files), then it wouldn't be needed. (tahoe backup probabilistically checks some files on each run even if they are present in the backupdb, but I don't think that particularly helps.)

In the meantime, how about adding a --repair option to tahoe backup, which would bypass the backupdb-based conditional upload and upload/repair every file?

The backupdb is a performance hack to avoid the latency cost of asking servers whether each file exists on the grid. If the latter were fast enough (which would probably require batching requests for multiple files), then it wouldn't be needed. (`tahoe backup` probabilistically checks some files on each run even if they are present in the backupdb, but I don't think that particularly helps.) In the meantime, how about adding a `--repair` option to `tahoe backup`, which would bypass the backupdb-based conditional upload and upload/repair every file?
daira commented 2013-07-23 19:10:29 +00:00
Author
Owner

Hmm, it looks from this code in the method BackupDB_v2.check_file in source:src/allmydata/scripts/backupdb.py, as though the --ignore-timestamps option of tahoe backup causes existing db entries to be completely ignored, rather than only ignoring timestamps.

Perhaps we just need to rename --ignore-timestamps or document it better?

Hmm, it looks from this code in the method `BackupDB_v2.check_file` in source:src/allmydata/scripts/backupdb.py, as though the `--ignore-timestamps` option of `tahoe backup` causes existing db entries to be completely ignored, rather than only ignoring timestamps. Perhaps we just need to rename `--ignore-timestamps` or document it better?
daira commented 2013-07-23 19:13:05 +00:00
Author
Owner

Sorry, intended to paste the code:

        if ((last_size != size
             or not use_timestamps
             or last_mtime != mtime
             or last_ctime != ctime) # the file has been changed
            or (not row2) # we somehow forgot where we put the file last time
            ):
            c.execute("DELETE FROM local_files WHERE path=?", (path,))
            self.connection.commit()
            return FileResult(self, None, False, path, mtime, ctime, size)

So when not use_timestamps, the existing db entry is deleted and the FileResult has None for the existing file URI. (Note that we still might not repair the file very well; see #1382.)

Sorry, intended to paste the code: ``` if ((last_size != size or not use_timestamps or last_mtime != mtime or last_ctime != ctime) # the file has been changed or (not row2) # we somehow forgot where we put the file last time ): c.execute("DELETE FROM local_files WHERE path=?", (path,)) self.connection.commit() return FileResult(self, None, False, path, mtime, ctime, size) ``` So when `not use_timestamps`, the existing db entry is deleted and the `FileResult` has `None` for the existing file URI. (Note that we still might not repair the file very well; see #1382.)
Author
Owner

Replying to daira:

Hmm, it looks from this code in the method BackupDB_v2.check_file in source:src/allmydata/scripts/backupdb.py, as though the --ignore-timestamps option of tahoe backup causes existing db entries to be completely ignored, rather than only ignoring timestamps.

Perhaps we just need to rename --ignore-timestamps or document it better?

Just to note, I had to both rename the db AND unlink the bad directories to get them repaired.

Replying to [daira](/tahoe-lafs/trac-2024-07-25/issues/2035#issuecomment-134193): > Hmm, it looks from this code in the method `BackupDB_v2.check_file` in source:src/allmydata/scripts/backupdb.py, as though the `--ignore-timestamps` option of `tahoe backup` causes existing db entries to be completely ignored, rather than only ignoring timestamps. > > Perhaps we just need to rename `--ignore-timestamps` or document it better? Just to note, I had to both rename the db AND unlink the bad directories to get them repaired.
zooko commented 2013-07-24 15:01:17 +00:00
Author
Owner

Replying to [srl]comment:13:

Just to note, I had to both rename the db AND unlink the bad directories to get them repaired.

Unlink them from where?

Replying to [srl]comment:13: > > Just to note, I had to both rename the db AND unlink the bad directories to get them repaired. Unlink them from where?
Author
Owner

Replying to [zooko]comment:15:

Replying to [srl]comment:13:

Just to note, I had to both rename the db AND unlink the bad directories to get them repaired.

Unlink them from where?

I unlinked the Latest and Archives directories that tahoe backup created

Replying to [zooko]comment:15: > Replying to [srl]comment:13: > > > > Just to note, I had to both rename the db AND unlink the bad directories to get them repaired. > > Unlink them from where? I unlinked the `Latest` and `Archives` directories that `tahoe backup` created
daira commented 2013-09-01 16:42:01 +00:00
Author
Owner

Nitpick: use "uploading" for immutable files or shares, and "publishing" for (versions of) mutable files or shares.

markberger: do any of your improvements address this?

Nitpick: use "uploading" for immutable files or shares, and "publishing" for (versions of) mutable files or shares. markberger: do any of your improvements address this?
tahoe-lafs modified the milestone from undecided to 1.12.0 2013-09-01 16:42:01 +00:00
tahoe-lafs changed title from Publishing the same immutable content when some shares are unrecoverable does not repair that content. to Uploading the same immutable content when some shares are unrecoverable does not repair that content. 2013-09-01 16:42:01 +00:00
tahoe-lafs added
code-encoding
and removed
unknown
labels 2013-09-01 16:42:14 +00:00
daira commented 2013-09-01 18:23:15 +00:00
Author
Owner

It's unclear to me whether this is just a duplicate of other bugs (e.g. #1130 and #1124) that are being fixed in #1382, or whether it is a separate problem in tahoe backup.

It's unclear to me whether this is just a duplicate of other bugs (e.g. #1130 and #1124) that are being fixed in #1382, or whether it is a separate problem in `tahoe backup`.
zooko commented 2013-09-01 21:55:27 +00:00
Author
Owner

Replying to daira:

It's unclear to me whether this is just a duplicate of other bugs (e.g. #1130 and #1124) that are being fixed in #1382, or whether it is a separate problem in tahoe backup.

I think this is a different problem to #1382. I think this problem has to do with the fact that "tahoe backup" inspects its local cache "backupdb" and decides that the file is already backed-up, and then does not issue any network requests, which would allow it find out that the file is damaged or even broken.

If that's the issue, possible solutions include:

  • a "backup-and-check" or "backup-and-verify" feature, as mentioned in comment:134192 and other comments,
  • causing all checks and verifies to update the backupdb and record the fact that the file was detected as damaged or broken, and then the next time you run "tahoe backup" tahoe could notice this recorded fact and automatically trigger the "-and-check" or "-and-verify" behavior

Changing the Summary of this ticket to reflect what I think the issue is.

Replying to [daira](/tahoe-lafs/trac-2024-07-25/issues/2035#issuecomment-134203): > It's unclear to me whether this is just a duplicate of other bugs (e.g. #1130 and #1124) that are being fixed in #1382, or whether it is a separate problem in `tahoe backup`. I think this is a different problem to #1382. I think this problem has to do with the fact that "tahoe backup" inspects its local cache "backupdb" and decides that the file is already backed-up, and then does not issue any network requests, which would allow it find out that the file is damaged or even broken. If that's the issue, possible solutions include: * a "backup-and-check" or "backup-and-verify" feature, as mentioned in [comment:134192](/tahoe-lafs/trac-2024-07-25/issues/2035#issuecomment-134192) and other comments, * causing all checks and verifies to update the backupdb and record the fact that the file was detected as damaged or broken, and then the next time you run "tahoe backup" tahoe could notice this recorded fact and automatically trigger the "-and-check" or "-and-verify" behavior Changing the Summary of this ticket to reflect what I think the issue is.
tahoe-lafs changed title from Uploading the same immutable content when some shares are unrecoverable does not repair that content. to "tahoe backup" on the same immutable content when some shares are unrecoverable does not repair that content. 2013-09-01 21:55:27 +00:00
daira commented 2013-09-01 22:26:27 +00:00
Author
Owner

Shares can be missing; only files/directories can be unrecoverable.

Shares can be missing; only files/directories can be unrecoverable.
tahoe-lafs changed title from "tahoe backup" on the same immutable content when some shares are unrecoverable does not repair that content. to "tahoe backup" on the same immutable content when some shares are missing does not repair that content. 2013-09-01 22:26:27 +00:00
daira commented 2013-09-02 01:34:08 +00:00
Author
Owner

Oh, I missed this:

[nejucomo]comment:5:

srl verified that removing backupdb.sqlite, deleting the backup directories, and then rerunning backup successfully stored their data into the same immutable caps.

So it's definitely the backupdb logic.

Oh, I missed this: [nejucomo]comment:5: > `srl` verified that removing `backupdb.sqlite`, deleting the backup directories, and then rerunning backup successfully stored their data into the same immutable caps. So it's definitely the backupdb logic.
tahoe-lafs modified the milestone from 1.12.0 to soon 2013-09-02 01:34:08 +00:00
amontero commented 2013-11-28 01:40:30 +00:00
Author
Owner

I strongly +1 for an "ignore-backupdb" kind of option that could ensure that all files were uploaded at backup time without any backupdb optimization. Even if it is at that added time cost.
If I'm not wrong, unmodified files would produce shares identical to already stored ones and no bandwith would be used.

I strongly +1 for an "ignore-backupdb" kind of option that could ensure that all files were uploaded at backup time without any backupdb optimization. Even if it is at that added time cost. If I'm not wrong, unmodified files would produce shares identical to already stored ones and no bandwith would be used.
daira commented 2013-11-28 22:14:07 +00:00
Author
Owner

See also #1331 (--verify option for tahoe backup).

See also #1331 (--verify option for `tahoe backup`).
tlhonmey commented 2018-08-21 21:56:16 +00:00
Author
Owner

"(tahoe backup probabilistically checks some files on each run even if they are present in the backupdb, but I don't think that particularly helps.) "

If the random check is to be helpful, then when it encounters a file that the backupdb says should be there, but isn't, it should discard the backupdb and start over assuming that it needs to check every file.

It would also be good to be able to specify a frequency for the random checking since the size and composition of the data in question affects what the best tradeoff is between speed and thoroughness.

"(tahoe backup probabilistically checks some files on each run even if they are present in the backupdb, but I don't think that particularly helps.) " If the random check is to be helpful, then when it encounters a file that the backupdb says should be there, but isn't, it should discard the backupdb and start over assuming that it needs to check every file. It would also be good to be able to specify a frequency for the random checking since the size and composition of the data in question affects what the best tradeoff is between speed and thoroughness.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: tahoe-lafs/trac-2024-07-25#2035
No description provided.