tahoe-lafs/trac-2024-07-25

large directories take a long time to modify #383

New issue

Open

opened 2008-04-12 18:58:21 +00:00 by warner · 3 comments

warner commented

2008-04-12 18:58:21 +00:00

Owner

We don't actually need to decrypt the whole thing. Most of the modifications we're doing are to add or replace specific children. Since the dirnode is represented as a concatenations of netstrings (one per child), we could have a loop that iterates through the string, reading the netstring length prefix, extracting the child name, seeing if it matches, and skipping ahead to the next child if not. This would result in a big string of everything before the match, the match itself, and a big string of everything after the match. We should modify the small match piece, then concatenate everything back together when we're done. Only the piece we're changing needs to be decrypted/reencrypted.

In addition, we could probably get rid of the HMAC on those writecaps now, I think they're leftover from the central-vdrive-server days. But we should put that compatibility break off until we move to DSA directories (if we choose to go with the 'deep-verify' caps).

We found that the prodnet webapi servers were taking about 35 seconds to modify a large (about 10k entries) dirnode. That time is measured from the end of the Retrieve to the beginning of the Publish. We're pretty sure that this is because the loop that decrypts and verifies the write-cap in each row is in python (whereas the code that decrypts the mutable file contents as a whole, in a single pycryptopp call, runs in 8 milliseconds). Then the other loop that re-encrypts everything takes a similar amount of time, probably 17 seconds each. We don't actually need to decrypt the whole thing. Most of the modifications we're doing are to add or replace specific children. Since the dirnode is represented as a concatenations of netstrings (one per child), we could have a loop that iterates through the string, reading the netstring length prefix, extracting the child name, seeing if it matches, and skipping ahead to the next child if not. This would result in a big string of everything before the match, the match itself, and a big string of everything after the match. We should modify the small match piece, then concatenate everything back together when we're done. Only the piece we're changing needs to be decrypted/reencrypted. In addition, we could probably get rid of the HMAC on those writecaps now, I think they're leftover from the central-vdrive-server days. But we should put that compatibility break off until we move to DSA directories (if we choose to go with the 'deep-verify' caps).

tahoe-lafs added the

major

enhancement

1.0.0

labels 2008-04-12 18:58:21 +00:00

tahoe-lafs added this to the eventually milestone 2008-04-12 18:58:21 +00:00

tahoe-lafs added the

code-dirnodes

label 2008-04-24 23:50:10 +00:00

zooko commented

2009-05-04 16:53:43 +00:00

Author

Owner

See also #327 (performance measurement of directories), #414 (profiling on directory unpacking), and #329 (dirnodes could cache encrypted/serialized entries for speed).

zooko commented

2009-06-25 16:30:40 +00:00

Author

Owner

Tahoe-LAFS hasn't checked the HMAC since changeset:f1fbd4feae1fb5d7, 2008-12-21, which patch was first released in Tahoe-LAFS v1.3.0, 2009-02-13.

If we produced dirnode entries which didn't have the HMAC tag (or which had a blank space instead of correct tag bytes there -- I don't know how the parsing works), then clients older than v1.3.0 would get some sort of integrity error when trying to read that entry. Our backward-compatibility tradition is typically longer-duration than this. For example, [the most recent release notes]source:relnotes.txt@20090414025430-92b7f-6e06ebbd16f80e68a6141d44fc25cc1d49726b22 say that Tahoe-LAFS v1.4.1 is backwards-compatible with v1.0, and in fact it is actually compatible with v0.8 or so (unless you try to upload large files -- files with shares larger than about 4 GiB).

So, let's not yet break compatibility by ceasing to emit the HMAC tags.

Also, let this be a lesson to us to that if we notice forward-compatibility issues and fix them early then this frees us up to evolve the protocols earlier. We actually stopped needing the HMAC tags when we released Tahoe-LAFS v0.7 in 2008-01-07, but we didn't notice that we were still checking them and erroring if they were wrong until the v1.3.0 release. So, everybody go look at forward-compatibility issues and fix them!

Tahoe-LAFS hasn't checked the HMAC since changeset:f1fbd4feae1fb5d7, 2008-12-21, which patch was first released in Tahoe-LAFS v1.3.0, 2009-02-13. If we produced dirnode entries which didn't have the HMAC tag (or which had a blank space instead of correct tag bytes there -- I don't know how the parsing works), then clients older than v1.3.0 would get some sort of integrity error when trying to read that entry. Our backward-compatibility tradition is typically longer-duration than this. For example, [the most recent release notes]source:relnotes.txt@20090414025430-92b7f-6e06ebbd16f80e68a6141d44fc25cc1d49726b22 say that Tahoe-LAFS v1.4.1 is backwards-compatible with v1.0, and in fact it is actually compatible with v0.8 or so (unless you try to upload large files -- files with shares larger than about 4 GiB). So, let's not yet break compatibility by ceasing to emit the HMAC tags. Also, let this be a lesson to us to that if we notice forward-compatibility issues and fix them early then this frees us up to evolve the protocols earlier. We actually stopped *needing* the HMAC tags when we released Tahoe-LAFS v0.7 in 2008-01-07, but we didn't notice that we were still checking them and erroring if they were wrong until the v1.3.0 release. So, everybody go look at [forward-compatibility issues](http://allmydata.org/trac/tahoe/search?q=forward-compatibility) and fix them!

zooko commented

2009-06-25 16:39:02 +00:00

Author

Owner

Oh, by the way the time to actually compute and write the HMAC tags is really tiny compared to the other performance issues. (The following tickets are how we can be sure of this: #327 (performance measurement of directories), #414 (profiling on directory unpacking).) If we could stop producing the HMAC tags, I would be happier about the simplification than about the speed-up...