padding to hide the size of plaintexts #2018

New issue

Open

opened 2013-07-07 22:50:46 +00:00 by zooko · 14 comments

zooko commented

2013-07-07 22:50:46 +00:00

I propose that LAFS automatically add a randomized number of padding bytes to files when encrypting. Concretely, how about something like this. With F as the file size in bytes,

Let the "max padding", X, be 32*ceil(log₂(F)).
Choose a number of padding bytes, P, evenly from [0..X) as determined by the encryption key. Note: this is important that the number is deterministic from the key, so that multiple encryptions of the same-keyed file will not pick different random numbers and allow an attacker to statistically observe the padding's size. Be sure the pad length gets derived from the key via a strongly one-way path.
Append P bytes of padding (0 bytes) to the plaintext before encryption. (This does not affect how the key is derived from the plaintext in the case of convergent encryption.)

Even though LAFS keeps the contents of files secret from attackers, it currently exposes the length (in bytes) of that content. This can be useful information to an attacker in various ways. For one thing, an attacker might be able to "recognize" specific files or kinds of files from a pattern of file sizes. More subtle dangers may also exist, depending on the circumstances, for example the famous "CRIME" attack on SSL (<http://security.stackexchange.com/questions/19911/crime-how-to-beat-the-beast-successor/19914#19914>) which depends crucially on the attacker being able to measure the exact size of certain encrypted output. Ticket #925 is about how potentially interesting metadata about the LAFS filesystem itself can be inferred from the byte-level granularity of exposed sizes. I propose that LAFS automatically add a randomized number of padding bytes to files when encrypting. Concretely, how about something like this. With `F` as the file size in bytes, 1. Let the "max padding", `X`, be `32*ceil(log₂(F))`. 2. Choose a number of padding bytes, `P`, evenly from `[0..X)` as determined by the encryption key. *Note: this is important that the number is deterministic from the key, so that multiple encryptions of the same-keyed file will not pick different random numbers and allow an attacker to statistically observe the padding's size.* Be sure the pad length gets derived from the key via a strongly one-way path. 3. Append `P` bytes of padding (0 bytes) to the plaintext before encryption. (This does not affect how the key is derived from the plaintext in the case of convergent encryption.)

zooko added the

labels 2013-07-07 22:50:46 +00:00

zooko added this to the undecided milestone 2013-07-07 22:50:46 +00:00

zooko commented

2013-07-07 23:13:31 +00:00

Author

Started a mailing list conversation: [//pipermail/tahoe-dev/2013-July/008492.html https://tahoe-lafs.org/pipermail/tahoe-dev/2013-July/008492.html]

Started a mailing list conversation: [//pipermail/tahoe-dev/2013-July/008492.html <https://tahoe-lafs.org/pipermail/tahoe-dev/2013-July/008492.html>]

leif commented

2013-07-08 18:12:56 +00:00

Is there an advantage to random padding instead of just padding up to some fixed interval?

If someone uploads many different files which are all the exact same size, random padding will not stop attackers from inferring what that size is.

Is there an advantage to random padding instead of just padding up to some fixed interval? If someone uploads many different files which are all the exact same size, random padding will not stop attackers from inferring what that size is.

nickm commented

2013-07-08 20:01:05 +00:00

Owner

Here's how to do the analysis.

Look at what information the attacker sees over time, and what the
attacker is trying to learn. Consider how fast they can learn what
they like as Tahoe stands today. Then consider how fast they can
learn that with the proposed padding scheme.

Generally, padding many things to the same size tends to work better
than adding random amounts of padding to a lot of things. In the "pad
to same size" case, the attacker learns less from seeing the size of a
single object.

Don't forget object linkability in your analysis. That is, if certain
messages are likelier to be received together than two messages chosen
at random, then the attacker can make inferences over time, so you
can't just look at single-object probabilities in isolation.

Feel free to share this message wherever it will do good.

Here's how to do the analysis. Look at what information the attacker sees over time, and what the attacker is trying to learn. Consider how fast they can learn what they like as Tahoe stands today. Then consider how fast they can learn that with the proposed padding scheme. Generally, padding many things to the same size tends to work better than adding random amounts of padding to a lot of things. In the "pad to same size" case, the attacker learns less from seeing the size of a single object. Don't forget object linkability in your analysis. That is, if certain messages are likelier to be received together than two messages chosen at random, then the attacker can make inferences over time, so you can't just look at single-object probabilities in isolation. Feel free to share this message wherever it will do good.

zooko commented

2013-07-12 14:14:18 +00:00

Author

Comments from Marsh Ray:

I like the length hiding idea. Be sure the pad length gets derived from the key via a strongly one-way path.
Would file access patterns reveal the amount of padding? Would it ever make sense to distribute padding over the whole file?
"you might use 32*ceil(log₂(F)) to hide F a little better"

from https://twitter.com/marshray/status/354028204446584832

Comments from Marsh Ray: * I like the length hiding idea. Be sure the pad length gets derived from the key via a strongly one-way path. * Would file access patterns reveal the amount of padding? Would it ever make sense to distribute padding over the whole file? * "you might use 32*ceil(log₂(F)) to hide F a little better" from <https://twitter.com/marshray/status/354028204446584832>

nejucomo commented

2013-07-21 07:30:15 +00:00

+1 on the need for a threat model (mentioned on the list by Greg Troxel).

A threat model is really important so that we notice conflicting design goals, or unnecessary complexity.

An example conflict of goals: consider a threat model with an attacker who only operates a storage node and has no resources outside of that storage node, and consider two features: range requests versus "size confidentiality" through padding.

An incremental update to a byte range reveals that that range is interesting, and probably not padding. A lack of byte range updates means updates require full file uploads, which is a large usability cost.

Range updates can also potentially reveal information through layers outside of LAFS! Suppose a user is using an encrypted loop-back filesystem stored in a single "local filesystem file", but that single file happens to be backed by some magic LAFS goo that "smartly" notices only a range has been altered, and only sends updates for that range. Now the user changes a small secret stored inside the loop-back encrypted filesystem, and that translates to a tiny range request a storage node operator could see, whose size is close to the tiny secret size.

So, are bup-style hash splitting or LDMF-style deltas with individual padding superior to range updates? We can't answer this unless we have a threat model and we also prioritize other features against defense-features for that threat model.

+1 on the need for a threat model (mentioned on the list by Greg Troxel). A threat model is really important so that we notice conflicting design goals, or unnecessary complexity. An example conflict of goals: consider a threat model with an attacker who only operates a storage node and has no resources outside of that storage node, and consider two features: range requests versus "size confidentiality" through padding. An incremental update to a byte range reveals that that range is interesting, and probably not padding. A lack of byte range updates means updates require full file uploads, which is a large usability cost. Range updates can also potentially reveal information through layers outside of LAFS! Suppose a user is using an encrypted loop-back filesystem stored in a single "local filesystem file", *but* that single file happens to be backed by some magic LAFS goo that "smartly" notices only a range has been altered, and only sends updates for that range. Now the user changes a small secret stored inside the loop-back encrypted filesystem, and that translates to a tiny range request a storage node operator could see, whose size is close to the tiny secret size. So, are bup-style hash splitting or `LDMF`-style deltas with individual padding superior to range updates? We can't answer this unless we have a threat model and we also prioritize other features against defense-features for that threat model.

nejucomo commented

2013-07-21 07:43:37 +00:00

A natural starting place for threat modeling attacker capabilities would be the operator of a single storage node. Here's how to get started in your career of blackmailing LAFS users:

run a storage node and use find, ls, and the like to examine the filesystem metadata on shares. (This could give sizes, creation times, modification times, access times.)
examine local share contents using any tools at your disposal. (What can this tell an attacker about shares? Serial numbers? Signing keys? merkle tree roots?)
turn up logging, and modify the storage node code to log protocol-level requests data of interest. (This could give client IPs, more precise timing information, range requests on shares.)

A natural starting place for threat modeling attacker capabilities would be the operator of a single storage node. Here's how to get started in your career of blackmailing `LAFS` users: 1. run a storage node and use `find`, `ls`, and the like to examine the filesystem metadata on shares. (This could give sizes, creation times, modification times, access times.) 1. examine local share contents using any tools at your disposal. (What can this tell an attacker about shares? Serial numbers? Signing keys? merkle tree roots?) 1. turn up logging, and modify the storage node code to log protocol-level requests data of interest. (This could give client IPs, more precise timing information, range requests on shares.)

zooko commented

2013-07-23 00:33:53 +00:00

Author

I'm not sure how to proceed with the threat model you suggest, nejucomo. The avenue of information that you mention (range updates) is closely to the one Marsh Ray mentioned (comment:394136).

At the moment I feel blocked, because I'm not sure how to proceed with the threat model. My feeling is, if I want to make progress on this I should ignore the whole idea of a threat model and move ahead with implementation! That sounds wrong, but what's the next-step on the threat model?

I'm not sure how to proceed with the threat model you suggest, nejucomo. The avenue of information that you mention (range updates) is closely to the one Marsh Ray mentioned ([comment:394136](/tahoe-lafs/trac/issues/2018#issuecomment-394136)). At the moment I feel blocked, because I'm not sure how to proceed with the threat model. My feeling is, if I want to make progress on this I should ignore the whole idea of a threat model and move ahead with implementation! That sounds wrong, but what's the next-step on the threat model?

nejucomo was assigned by zooko

2013-07-23 00:33:53 +00:00

nejucomo commented

2013-07-23 17:28:39 +00:00

I added keyword research because I believe there's opportunity for academic research contributing greatly here.

Note: I'm quite unfamiliar with the relevant literature so this may already be a well understood problem.

I added keyword `research` because I believe there's opportunity for academic research contributing greatly here. Note: I'm quite unfamiliar with the relevant literature so this may already be a well understood problem.

nejucomo commented

2013-07-23 17:56:01 +00:00

Replying to zooko:

I'm not sure how to proceed with the threat model you suggest, nejucomo. The avenue of information that you mention (range updates) is closely to the one Marsh Ray mentioned (comment:394136).

At the moment I feel blocked, because I'm not sure how to proceed with the threat model. My feeling is, if I want to make progress on this I should ignore the whole idea of a threat model and move ahead with implementation! That sounds wrong, but what's the next-step on the threat model?

Let's start here:

What specific confidentiality claims can we make to users about this feature?

Perhaps we should review existing threat models for other storage applications which have confidentiality-from-storage-operator requirements.

Another way to approach the threat model is to show a tangible vulnerability for confidentiality of the current LAFS storage (without padding). Then demonstrate that padding thwarts that to some degree. Some interesting challenges:

As a storage operator can I deduce with > 50% accuracy that a particular share:

represents a directory?
represents a directory with N children.
represents a directory with N children with a total of K bytes of child edge names.
is part of a well known specific directory structure? eg. the linux source.
is part of a well known general directory structure? eg. "this is a git repository" or "this is a linux home directory".
is a well known general file format? eg. "this is an ssh private key"
correlations amongst the above. eg. given the probability that share A is a home directory and B is a .ssh directory inside it, the probability that share C is a private ssh key is P.

Replying to [zooko](/tahoe-lafs/trac/issues/2018#issuecomment-394139): > I'm not sure how to proceed with the threat model you suggest, nejucomo. The avenue of information that you mention (range updates) is closely to the one Marsh Ray mentioned ([comment:394136](/tahoe-lafs/trac/issues/2018#issuecomment-394136)). > > At the moment I feel blocked, because I'm not sure how to proceed with the threat model. My feeling is, if I want to make progress on this I should ignore the whole idea of a threat model and move ahead with implementation! That sounds wrong, but what's the next-step on the threat model? Let's start here: * What specific confidentiality claims can we make to users about this feature? Perhaps we should review existing threat models for other storage applications which have confidentiality-from-storage-operator requirements. Another way to approach the threat model is to show a tangible vulnerability for confidentiality of the current `LAFS` storage (without padding). Then demonstrate that padding thwarts that to some degree. Some interesting challenges: As a storage operator can I deduce with > 50% accuracy that a particular share: * represents a directory? * represents a directory with N children. * represents a directory with N children with a total of K bytes of child edge names. * is part of a well known specific directory structure? *eg.* the `linux` source. * is part of a well known general directory structure? *eg.* "this is a git repository" or "this is a linux home directory". * is a well known general file format? *eg.* "this is an ssh private key" * correlations amongst the above. *eg.* given the probability that share A is a home directory and B is a `.ssh` directory inside it, the probability that share C is a private ssh key is P.

nejucomo commented

2013-07-23 18:17:12 +00:00

Replying to zooko:

[...]

My feeling is, if I want to make progress on this I should ignore the whole idea of a threat model and move ahead with implementation! That sounds wrong, but what's the next-step on the threat model?

I wouldn't want to implement this feature yet without hearing more people's answers to these questions:

Do we believe range updates compromise the benefits of padding? Under what conditions?
Will the implementation only work for files which disallow range updates (such as immutables or SDMF files)?
If so, how will the UI indicate to users which files / file types have which size confidentiality features?
-or, will users decide whether they prefer range updates or size confidentiality on a global scale, and change a configuration setting?
What should the default setting be?
Will we spend more effort improving confidentiality in this way, or more time improving update efficiency, or will we split our time between two features which may counteract each other?
Are there other approaches to efficient update that work well with other approaches to size confidentiality?

For the last question, I'm interested in the proposed LDMF format which would store file data as snapshots with deltas. The range offsets in the deltas can be encrypted for confidentiality. The file content of the delta itself could be padded. Now a storage operator can see the timing and padded size of deltas, but not which parts of a file they update, nor the exact size of the delta contents.

So that last question is an example of how we might realize that we'd rather invest engineering into snapshot+delta formats instead of flat files with range updates.

Replying to [zooko](/tahoe-lafs/trac/issues/2018#issuecomment-394139): [...] > My feeling is, if I want to make progress on this I should ignore the whole idea of a threat model and move ahead with implementation! That sounds wrong, but what's the next-step on the threat model? I wouldn't want to implement this feature yet without hearing more people's answers to these questions: * Do we believe range updates compromise the benefits of padding? Under what conditions? * Will the implementation only work for files which disallow range updates (such as immutables or `SDMF` files)? * If so, how will the UI indicate to users which files / file types have which size confidentiality features? * -or, will users decide whether they prefer range updates or size confidentiality on a global scale, and change a configuration setting? * What should the default setting be? * Will we spend more effort improving confidentiality in this way, or more time improving update efficiency, or will we split our time between two features which may counteract each other? * Are there other approaches to efficient update that work well with other approaches to size confidentiality? For the last question, I'm interested in the proposed `LDMF` format which would store file data as snapshots with deltas. The range offsets in the deltas can be encrypted for confidentiality. The file content of the delta itself could be padded. Now a storage operator can see the timing and padded size of deltas, but not which parts of a file they update, nor the exact size of the delta contents. So that last question is an example of how we might realize that we'd rather invest engineering into snapshot+delta formats instead of flat files with range updates.

zooko commented

2014-02-10 04:04:41 +00:00

Author

Edited the ticket description to use two of Marsh's suggestions from comment:394136. (The third suggestion is potentially really important, too, but doesn't easily fit into the ticket description.)

Edited the ticket description to use two of Marsh's suggestions from [comment:394136](/tahoe-lafs/trac/issues/2018#issuecomment-394136). (The third suggestion is potentially really important, too, but doesn't easily fit into the ticket description.)

nickm commented

2014-02-10 17:24:49 +00:00

Owner

32*ceil(log2(F)) doesn't really hide file sizes so well; are you sure it's what you mean? If a file is 1.1MB long, should the maximum padding really be only 32 * 21 == 672 bytes? That doesn't seem big enough to obscure the true file size.

Here's a simple experiment I just did. It's not a very good one, but with any luck people can improve it to make it something more scientifically sound. I took all the music files on my computer (about 16000 of them), and built a database of their file sizes. (I chose music files because their sizes are already in a similar range to one another, without being overly homogeneous. And because I had a lot of them. More thinking will identify a better corpus.)

% find ~/Music/*/ -type f -print0 |xargs -0 ls -l  | perl -ne '@a = split; print "$a[4]\n";' > sizes
% wc -l sizes
16275 sizes

At the start, nearly every file size was unique.

% sort -n sizes | uniq -c  |grep '^ *1 ' |wc -l
16094

Next I tried a "round the size up to the nearest multiple of max_pad" rule, taking max_pad as 32*ceil(log2(F)).

% python ~/Music/fsize.py < sizes | sort -n |uniq -c | grep '^ *1 ' |wc -l 
9671

So, more than half of my files still have a unique size. That's probably not so good.

(I chose "rounding up" rather than "randomizing" since it makes the attacker's job strictly harder, and the analysis much simpler.)

Next I tried the rule: Let N = ceil(log2(F)). If N >= 5, let X = N - 5; otherwise let X = N. Let MaxPad = 2^X^. Round the file size up to the nearest multiple of MaxPad. This time:

% python ~/Music/fsize.py < sizes | sort -n |uniq -c | grep '^ *1 ' |wc -l 
65

Only 65 files had unique sizes. Note also that the padding overhead is about 1/32 of the file size, for large files.

(A better experiment would pick a better metric than "how many files have a unique size?" The count of unique-sized files doesn't capture the notion that having three files with the same padded size is better than having two files with the same padded size. Maybe something entropy-based would be the right thing here.)

32*ceil(log2(F)) doesn't really hide file sizes so well; are you sure it's what you mean? If a file is 1.1MB long, should the maximum padding _really_ be only 32 * 21 == 672 bytes? That doesn't seem big enough to obscure the true file size. Here's a simple experiment I just did. It's not a very good one, but with any luck people can improve it to make it something more scientifically sound. I took all the music files on my computer (about 16000 of them), and built a database of their file sizes. (I chose music files because their sizes are already in a similar range to one another, without being overly homogeneous. And because I had a lot of them. More thinking will identify a better corpus.) ``` % find ~/Music/*/ -type f -print0 |xargs -0 ls -l | perl -ne '@a = split; print "$a[4]\n";' > sizes % wc -l sizes 16275 sizes ``` At the start, nearly every file size was unique. ``` % sort -n sizes | uniq -c |grep '^ *1 ' |wc -l 16094 ``` Next I tried a "round the size up to the nearest multiple of max_pad" rule, taking max_pad as 32*ceil(log2(F)). ``` % python ~/Music/fsize.py < sizes | sort -n |uniq -c | grep '^ *1 ' |wc -l 9671 ``` So, more than half of my files still have a unique size. That's probably not so good. (I chose "rounding up" rather than "randomizing" since it makes the attacker's job strictly harder, and the analysis much simpler.) Next I tried the rule: Let N = ceil(log2(F)). If N >= 5, let X = N - 5; otherwise let X = N. Let MaxPad = 2^X^. Round the file size up to the nearest multiple of MaxPad. This time: ``` % python ~/Music/fsize.py < sizes | sort -n |uniq -c | grep '^ *1 ' |wc -l 65 ``` Only 65 files had unique sizes. Note also that the padding overhead is about 1/32 of the file size, for large files. (A better experiment would pick a better metric than "how many files have a unique size?" The count of unique-sized files doesn't capture the notion that having three files with the same padded size is better than having two files with the same padded size. Maybe something entropy-based would be the right thing here.)

nikita commented

2014-02-10 19:47:39 +00:00

Owner

Some other useful metrics: the minimum bucket size (which is essentially min-entropy) and the average bucket size (which I think is related to guessing entropy but I haven't done the math yet), where a bucket is a collection of files that pad to the same length.

zooko commented

2014-02-12 22:29:24 +00:00

Author

Hi, Nikita and NickM. I'm really perplexed about this whole topic, because I don't know what formal model applies to the real-world cases that I care about.

Just to throw out an example of a model, there's a set S of files, each with a size, and a player A picks some one file from S and reveals some function of that file's size and/or compression, and then the other player M tries to guess which file A picked. Is that something anyone cares about? I really don't know.

Something that I'm sure that I do care about is the CRIME/BEAST-style attacks on compression. I would feel better about LAFS's resistance to those if we added padding (of the kind described in the Description of this ticket).

Hi, Nikita and NickM. I'm really perplexed about this whole topic, because I don't know what formal model applies to the real-world cases that I care about. Just to throw out an example of a model, there's a set **S** of files, each with a size, and a player **A** picks some one file from **S** and reveals some function of that file's size and/or compression, and then the other player **M** tries to guess which file **A** picked. Is that something anyone cares about? I really don't know. Something that I'm sure that I do care about is the CRIME/BEAST-style attacks on compression. I would feel better about LAFS's resistance to those if we added padding (of the kind described in the Description of this ticket).

Rows
Columns