tahoe-lafs/trac-2024-07-25

1.3.0

New issue

Release focus: checker/verifier/repairer:

checking: just count number of available shares
verifying: read share contents, check hashes
repair: create new shares as necessary to replace bad/missing ones.
- Mutable shares are repaired in place. Note that mutable repair requires
  a write-cap, to make sure the write-enabler shared secrets are created
  correctly. It would be nice to be able to repair from just a read-cap or
  a verify-cap, but this may need to wait until we switch to DSA mutable
  files, and/or change the way we control server-side write access.
- Immutable shares must be manually deleted from the storage servers, so
  repair needs a mechanism to report which shares should be examined and
  removed. Immutable repair really means creating new shares to make up
  for the bad ones.
there should be a "check" button for each file to initiate a check, or a
verify, with or without auto-repair. The page that this button displays
should contain the results of the operation: which shares were found
where, how much verification was performed, whether repair was deemed
necessary, whether repair was actually done, and the success or failure of
the repair operation.
there should be a "deep-check" button for each directory to perform a
recursive traversal and check/verify/repair everything reachable from that
point. The page this returns should show aggregate information about the
check/repair: a count of how many files/dirs were examined, how many were
healthy, how many had problems, etc. The page should have a line or two
about each problem.
there should be machine-parseable versions of these buttons: POST
operations that return JSON with the same information as the
human-targetted HTML described above.
serious problems (like hash failures) should be automatically reported to
some centralized Incident Gatherer, so we can discover bugs, failing
drives, or malice.
allmydata should be able to run a periodic checker/repairer on customer
rootcaps. We should be able to count the number of missing/bad shares and
track it over time (to inform us of the impact of bouncing/moving storage
servers, discover failing drives, etc). We need to find out how long the
full check/verify/repair process takes, so we can decide upon a suitable
repeat rate (perhaps once a month).
to handle bad immutable shares, we should add a 'tahoe check-share'
command that can be run on the storage-server side and check all the
hashes of a single share file on disk. If file-verify observes a bad hash,
we should be able to go to the disk and use this tool to see if the
problem is transient or persistent, to make decisions about the stability
of that disk.