
Release focus: checker/verifier/repairer:

  • checking: just count number of available shares

  • verifying: read share contents, check hashes

  • repair: create new shares as necessary to replace bad/missing ones.

    • Mutable shares are repaired in place. Note that mutable repair requires
      a write-cap, to make sure the write-enabler shared secrets are created
      correctly. It would be nice to be able to repair from just a read-cap or
      a verify-cap, but this may need to wait until we switch to DSA mutable
      files, and/or change the way we control server-side write access.
    • Immutable shares must be manually deleted from the storage servers, so
      repair needs a mechanism to report which shares should be examined and
      removed. Immutable repair really means creating new shares to make up
      for the bad ones.
  • there should be a "check" button for each file to initiate a check, or a
    verify, with or without auto-repair. The page that this button displays
    should contain the results of the operation: which shares were found
    where, how much verification was performed, whether repair was deemed
    necessary, whether repair was actually done, and the success or failure of
    the repair operation.

  • there should be a "deep-check" button for each directory to perform a
    recursive traversal and check/verify/repair everything reachable from that
    point. The page this returns should show aggregate information about the
    check/repair: a count of how many files/dirs were examined, how many were
    healthy, how many had problems, etc. The page should have a line or two
    about each problem.

  • there should be machine-parseable versions of these buttons: POST
    operations that return JSON with the same information as the
    human-targetted HTML described above.

  • serious problems (like hash failures) should be automatically reported to
    some centralized Incident Gatherer, so we can discover bugs, failing
    drives, or malice.

  • allmydata should be able to run a periodic checker/repairer on customer
    rootcaps. We should be able to count the number of missing/bad shares and
    track it over time (to inform us of the impact of bouncing/moving storage
    servers, discover failing drives, etc). We need to find out how long the
    full check/verify/repair process takes, so we can decide upon a suitable
    repeat rate (perhaps once a month).

  • to handle bad immutable shares, we should add a 'tahoe check-share'
    command that can be run on the storage-server side and check all the
    hashes of a single share file on disk. If file-verify observes a bad hash,
    we should be able to go to the disk and use this tool to see if the
    problem is transient or persistent, to make decisions about the stability
    of that disk.

100% Completed
#542 by warner was closed 2008-12-01 23:48:24 +00:00