capabilities from the future could have non-ascii characters #1051

Closed
opened 2010-05-19 05:50:10 +00:00 by zooko · 24 comments
zooko commented 2010-05-19 05:50:10 +00:00
Owner

In the work around ticket #833 we implemented forward-compatibility for future capability formats. However we required all future cap formats to be expressed in ASCII without (as far as I know) actually thinking it through and deciding that we really wanted to constrain future cap formats in that way.

This ticket is to loosen that constraint on future cap formats. Note that this doesn't require future cap formats to have non-ASCII characters in them -- it just makes it so that if they do then they will still enjoy the same (limited) backward-compatibility to Tahoe-LAFS v1.7 that pure-ASCII caps from the future enjoy back to Tahoe-LAFS v1.6.1.

In the work around ticket #833 we implemented forward-compatibility for future capability formats. However we required all future cap formats to be expressed in ASCII without (as far as I know) actually thinking it through and deciding that we really wanted to constrain future cap formats in that way. This ticket is to loosen that constraint on future cap formats. Note that this doesn't require future cap formats to have non-ASCII characters in them -- it just makes it so that if they *do* then they will still enjoy the same (limited) backward-compatibility to Tahoe-LAFS v1.7 that pure-ASCII caps from the future enjoy back to Tahoe-LAFS v1.6.1.
tahoe-lafs added the
unknown
major
enhancement
1.6.1
labels 2010-05-19 05:50:10 +00:00
tahoe-lafs added this to the 1.7.0 milestone 2010-05-19 05:50:10 +00:00
zooko commented 2010-05-19 06:07:39 +00:00
Author
Owner

Attachment refactor-test-web.dpatch (24338 bytes) added

**Attachment** refactor-test-web.dpatch (24338 bytes) added
zooko commented 2010-05-19 06:08:29 +00:00
Author
Owner

Please review attachment:refactor-test-web.dpatch.

Please review attachment:refactor-test-web.dpatch.
tahoe-lafs added
code
and removed
unknown
labels 2010-05-19 06:08:29 +00:00
zooko commented 2010-05-19 06:21:58 +00:00
Author
Owner

Attachment test-nonascii-future-caps.dpatch (35715 bytes) added

**Attachment** test-nonascii-future-caps.dpatch (35715 bytes) added
zooko commented 2010-05-19 06:23:50 +00:00
Author
Owner

Here are unit tests which trunk currently fails. They test what happens if someone gives you a cap (through wui/wapi, cli, or in a child of a dir) that has non-ascii chars in the cap itself. Please review these test patches!

Here are unit tests which trunk currently fails. They test what happens if someone gives you a cap (through wui/wapi, cli, or in a child of a dir) that has non-ascii chars in the cap itself. Please review these test patches!
zooko commented 2010-05-19 06:28:04 +00:00
Author
Owner

FWIW I agree with Brian's and David-Sarah's comments on IRC that the next version of caps (hopefully coming out this year) should not use non-ASCII chars.

I would still like to see this forward-compatibility feature in Tahoe-LAFS as soon as possible though to give our future selves and our successors more graceful options.

FWIW I agree with Brian's and David-Sarah's comments on IRC that the next version of caps (hopefully coming out this year) should not use non-ASCII chars. I would still like to see this forward-compatibility feature in Tahoe-LAFS as soon as possible though to give our future selves and our successors more graceful options.
davidsarah commented 2010-06-03 04:48:28 +00:00
Author
Owner

Tests look ok, but we don't have a patch that makes them pass. This may have to wait until 1.8.

Tests look ok, but we don't have a patch that makes them pass. This may have to wait until 1.8.
davidsarah commented 2010-06-03 04:49:51 +00:00
Author
Owner

refactor-test-web.dpatch can be applied now.

refactor-test-web.dpatch can be applied now.
davidsarah commented 2010-06-06 17:46:19 +00:00
Author
Owner

When we parse a JSON bytestring using simplejson.loads, the result is a mixture of ASCII and Unicode strings. Therefore any "rw_uri" and "ro_uri" fields in the result should be coerced using allmydata.util.stringutils.to_str, which is defined like this:

def to_str(s):
    if s is None or isinstance(s, str):
        return s
    return s.encode('utf-8')

I've been doing this for the CLI scripts as part of fixing the ticket534 branch.

(The stringutils module is likely to be renamed, maybe to encodingutil.)

When we parse a JSON bytestring using `simplejson.loads`, the result is a mixture of ASCII and Unicode strings. Therefore any "`rw_uri`" and "`ro_uri`" fields in the result should be coerced using `allmydata.util.stringutils.to_str`, which is defined like this: ``` def to_str(s): if s is None or isinstance(s, str): return s return s.encode('utf-8') ``` I've been doing this for the CLI scripts as part of fixing the ticket534 branch. (The `stringutils` module is likely to be renamed, maybe to `encodingutil`.)
davidsarah commented 2010-06-08 04:06:03 +00:00
Author
Owner

The test patch is testing that (some of) our internal APIs directly support URIs that can be Unicode strings as well as byte strings. I don't think we should do it that way: we should represent URIs as UTF-8 (also in the encoded form of directories), and convert when reading caps/paths from the command line, and when displaying caps to stdout/stderr and in the WUI. (I think the former may already work, due to the changes for #534 and #565.)

The test patch is testing that (some of) our internal APIs directly support URIs that can be Unicode strings as well as byte strings. I don't think we should do it that way: we should represent URIs as UTF-8 (also in the encoded form of directories), and convert when reading caps/paths from the command line, and when displaying caps to stdout/stderr and in the WUI. (I think the former *may* already work, due to the changes for #534 and #565.)
davidsarah commented 2010-06-08 04:37:08 +00:00
Author
Owner

refactor-test-web.dpatch was applied in changeset:9e2da058372cad56.

refactor-test-web.dpatch was applied in changeset:9e2da058372cad56.
tahoe-lafs modified the milestone from 1.7.0 to 1.8.0 2010-06-08 04:37:26 +00:00
warner commented 2010-06-19 00:22:16 +00:00
Author
Owner

I concur: filecaps are either ASCII or arbitrary bytestrings, not unicode
objects.

Actually, I think of it this way:

  • filecaps are abstract bundles of location/identification information about
    files/directories (think about our URI objects)
  • we currently have one concrete expression syntax for filecaps, call it V1,
    which starts with "URI:" and always contains printable ASCII
  • we can imagine other expression syntaxes in the future, in particular a
    dense binary form (call it V2), and a more-official
    follows-the-RFC-about-URIs URI form, with :// and everything (call
    it V3)
  • the internal dirnode-traversing code must be able to understand the syntax
    used in the dirnodes that it unpacks. The dirnodes contain a bytestring
    (packed with netstrings, not JSON). This is the first constraint on what
    future code can put into dirnodes
  • the WUI which displays caps in dirnodes shouldn't explode when it sees a
    cap it doesn't recognize. This is the second constraint.
  • the URLs passed into the webapi can contain both a filecap and child name
    (subdir path and/or filename). HTTP enforces a specific type here:
    %-encoding of a bytestring, and it is common to expect the bytestring to
    be a UTF-8 encoding of a unicode string.
  • the WAPI (and our CLI tools), in particular the t=json bodies, shouldn't
    explode when they see unrecognized caps, in the . Furthermore, it'd be
    nice if they could treat such caps as opaque objects and still be able to
    do certain manipulation of them (like copying/moving them from one
    directory to another).

The way we build dirnodes tells us that we can put arbitrary bytestrings into
them: whatever syntax we use is not constrained by the container they are put
in, as long as any non-bytestring syntax we use is encoded into a bytestring
on the way in.

The WUI display is not a significant constraint: we rarely accept filecaps as
input in the WUI, and we only display them on the "More Info" page, which
could use repr(). We don't need to construct URLs out of unrecognized caps,
because the WUI doesn't provide any operations to work with them.

The WAPI t=json encoding is big constraint: since we pass filecaps as
dictionary values in the JSON body, which limits them to being ASCII or
unicode objects, unless we change the definition of the t=json API to declare
that the values it contains are something else.

(sometimes you can retroactively change protocol definitions like this in
ways that magically retain backwards compatibility: for example, if we
declared that the V1 encoding of filecaps has in fact actually been unicode,
but that everybody thought it was just ASCII because nobody had ever created
a unicode filecap before, then the t=json body definition could similarly be
retroactively defined as "UTF-8 encoding of the V1 filecap", and all the
ASCII filecaps coming from those APIs would still look the same)

So anyways, the reason that I think filecaps are ASCII or bytestrings is
because they contain machine-readable data like cryptovalues and numbers, not
human-generated/human-readable data like names.

I'd strongly recommend that, if we're going to plan for expansion of the "V1"
filecap-as-string syntax beyond ASCII, then we should plan for them to be
bytestrings. You can reliably compare bytestrings for equality (which is not
generally the case for unicode strings), there is an unambigious mapping from
filecap-as-bundle-of-data to filecap-as-bytestring (which is not the case for
unicode: even if we tell everyone to use UTF-8 instead of UTF-16/etc, there
are still too many options).

A related but separate issue is how to plan for expansion to the V2/V3/etc
syntaxes. The V1 syntax, as currently (narrowly) defined, is always printable
ASCII and always starts with "URI:". We could define a V2 dense-binary syntax
which, given a single leading version byte, would not overlap with the V1
syntax. Likewise a V3 real-URI syntax, which started with tahoe://,
would not overlap. We might then retroactively define the filecaps stored in
dirnodes to be bytestrings that parse in one of these three forms (allowing
smaller dirnodes with dense binary caps). If the current dirnode-handling
code can tolerate+ignore arbitrary bytestrings, then this might be safe.
(t=json might not, however).

I concur: filecaps are either ASCII or arbitrary bytestrings, not unicode objects. Actually, I think of it this way: * filecaps are abstract bundles of location/identification information about files/directories (think about our URI objects) * we currently have one concrete expression syntax for filecaps, call it V1, which starts with "URI:" and always contains printable ASCII * we can imagine other expression syntaxes in the future, in particular a dense binary form (call it V2), and a more-official follows-the-RFC-about-URIs URI form, with `://` and everything (call it V3) * the internal dirnode-traversing code must be able to understand the syntax used in the dirnodes that it unpacks. The dirnodes contain a bytestring (packed with netstrings, not JSON). This is the first constraint on what future code can put into dirnodes * the WUI which displays caps in dirnodes shouldn't explode when it sees a cap it doesn't recognize. This is the second constraint. * the URLs passed into the webapi can contain both a filecap and child name (subdir path and/or filename). HTTP enforces a specific type here: %-encoding of a bytestring, and it is common to expect the bytestring to be a UTF-8 encoding of a unicode string. * the WAPI (and our CLI tools), in particular the t=json bodies, shouldn't explode when they see unrecognized caps, in the . Furthermore, it'd be nice if they could treat such caps as opaque objects and still be able to do certain manipulation of them (like copying/moving them from one directory to another). The way we build dirnodes tells us that we can put arbitrary bytestrings into them: whatever syntax we use is not constrained by the container they are put in, as long as any non-bytestring syntax we use is encoded into a bytestring on the way in. The WUI display is not a significant constraint: we rarely accept filecaps as input in the WUI, and we only display them on the "More Info" page, which could use repr(). We don't need to construct URLs out of unrecognized caps, because the WUI doesn't provide any operations to work with them. The WAPI t=json encoding is big constraint: since we pass filecaps as dictionary values in the JSON body, which limits them to being ASCII or unicode objects, unless we change the definition of the t=json API to declare that the values it contains are something else. (sometimes you can retroactively change protocol definitions like this in ways that magically retain backwards compatibility: for example, if we declared that the V1 encoding of filecaps has in fact actually been unicode, but that everybody thought it was just ASCII because nobody had ever created a unicode filecap before, then the t=json body definition could similarly be retroactively defined as "UTF-8 encoding of the V1 filecap", and all the ASCII filecaps coming from those APIs would still look the same) So anyways, the reason that I think filecaps are ASCII or bytestrings is because they contain machine-readable data like cryptovalues and numbers, not human-generated/human-readable data like names. I'd strongly recommend that, if we're going to plan for expansion of the "V1" filecap-as-string syntax beyond ASCII, then we should plan for them to be bytestrings. You can reliably compare bytestrings for equality (which is not generally the case for unicode strings), there is an unambigious mapping from filecap-as-bundle-of-data to filecap-as-bytestring (which is not the case for unicode: even if we tell everyone to use UTF-8 instead of UTF-16/etc, there are still too many options). A related but separate issue is how to plan for expansion to the V2/V3/etc syntaxes. The V1 syntax, as currently (narrowly) defined, is always printable ASCII and always starts with "URI:". We could define a V2 dense-binary syntax which, given a single leading version byte, would not overlap with the V1 syntax. Likewise a V3 real-URI syntax, which started with `tahoe://`, would not overlap. We might then retroactively define the filecaps stored in dirnodes to be bytestrings that parse in one of these three forms (allowing smaller dirnodes with dense binary caps). If the current dirnode-handling code can tolerate+ignore arbitrary bytestrings, then this might be safe. (t=json might not, however).
zooko commented 2010-07-08 16:54:38 +00:00
Author
Owner

I need to read, understand, respond to Brian's objection.

I need to read, understand, respond to Brian's objection.
zooko commented 2010-07-11 05:42:31 +00:00
Author
Owner

Okay I've read this through a few times now and I'm not sure I understand all of it.

To start with, the "related but separate issue" at the end of comment:119136 can safely go into a separate ticket, right?

Next, I'm fairly sure that this can also go into a separate ticket, possibly that same one: I'd strongly recommend that, if we're going to plan for expansion of the "V1" filecap-as-string syntax beyond ASCII, then we should plan for them to be bytestrings..

Maybe this ticket could be named "capabilities from the future could be non-ascii and non-unicode bytestrings". Does that make sense at all?

So, to focus on what I see as the point of this ticket I would like to ask Brian and David-Sarah a few questions. "Socratic" questioning often sounds condescending and irritating to me. These are actual questions that I don't already know the "right answers" to.

Suppose Alice is running Tahoe-LAFS v1.8.0, in the year 2020, and suppose hypothetically that for some reason that is currently unimaginable to us, we have in the year 2019 defined an "expression syntax" for Tahoe-LAFS caps which are unicode, like this: lafs://from_the_future_fw-蜔쳨欝遃䝦舜琇襇邤䍏㵦☚✸킾궑蒴犏띎냔㳆㼿졨浴䒉ΐ屝稜퍙鉧.

Now suppose, that Bob has a cap like that, and he conveys it to Alice, either by sending it to her and inviting her to click on it or cut and paste it (or wave her magic wand at it or whatever they do in 2020) to enter it into her Tahoe-LAFS v1.8.0 client. Or, suppose Bob puts it into a Tahoe-LAFS directory which Alice has read-access to and asks her to look at that directory.

Question 1: What would you want to happen (in this hypothetical scenario) when Alice waves her wand at it or lists that Tahoe-LAFS directory?

Question 2: What would happen if Alice were using Tahoe-LAFS v1.7.0? (I don't know the answer to this question. Wouldn't her client incur an internal TypeError of some kind?)

Question 3: How would you write a unit test which answers Question 2? My attempt at that was test-nonascii-future-caps.dpatch , but maybe that test doesn't actually answer Question 2. I'm not sure.

Okay I've read this through a few times now and I'm not sure I understand all of it. To start with, the "related but separate issue" at the end of [comment:119136](/tahoe-lafs/trac-2024-07-25/issues/1051#issuecomment-119136) can safely go into a separate ticket, right? Next, I'm *fairly* sure that this can also go into a separate ticket, possibly that same one: *I'd strongly recommend that, if we're going to plan for expansion of the "V1" filecap-as-string syntax beyond ASCII, then we should plan for them to be bytestrings.*. Maybe this ticket could be named "capabilities from the future could be non-ascii and non-unicode bytestrings". Does that make sense at all? So, to focus on what I see as the point of *this* ticket I would like to ask Brian and David-Sarah a few questions. "Socratic" questioning often sounds condescending and irritating to me. These are actual questions that I don't already know the "right answers" to. Suppose Alice is running Tahoe-LAFS v1.8.0, in the year 2020, and suppose *hypothetically* that for some reason that is currently unimaginable to us, we have in the year 2019 defined an "expression syntax" for Tahoe-LAFS caps which are unicode, like this: `lafs://from_the_future_fw-蜔쳨欝遃䝦舜琇襇邤䍏㵦☚✸킾궑蒴犏띎냔㳆㼿졨浴䒉ΐ屝稜퍙鉧`. Now suppose, that Bob has a cap like that, and he conveys it to Alice, either by sending it to her and inviting her to click on it or cut and paste it (or wave her magic wand at it or whatever they do in 2020) to enter it into her Tahoe-LAFS v1.8.0 client. Or, suppose Bob puts it into a Tahoe-LAFS directory which Alice has read-access to and asks her to look at that directory. Question 1: What would you want to happen (in this hypothetical scenario) when Alice waves her wand at it or lists that Tahoe-LAFS directory? Question 2: What would happen if Alice were using Tahoe-LAFS v1.7.0? (I don't know the answer to this question. Wouldn't her client incur an internal TypeError of some kind?) Question 3: How would you write a unit test which answers Question 2? My attempt at that was [test-nonascii-future-caps.dpatch](/tahoe-lafs/trac-2024-07-25/attachments/000078ac-929f-9640-bc82-d383dad09a16) , but maybe that test doesn't actually answer Question 2. I'm not sure.
davidsarah commented 2010-07-11 18:02:57 +00:00
Author
Owner

My objection was slightly different to Brian's. I was objecting to the internal representation of URIs being "either a Unicode string or a bytestring", as the patch assumed. From experience, that's extremely error-prone and leads to horrible problems with implicit conversions.

Representing URIs as UTF-8 doesn't have that problem, and satisfies Brian's criteria in comment:119136 :

  • you can reliably compare UTF-8 bytestrings for equality
  • there is an unambiguous mapping from filecap-as-bundle-of-data to filecap-as-UTF-8-string (some bundles-of-data will be invalid, but that's fine)

I can't speak for Brian, but I believe from IRC conversations that he was skeptical of the whole idea of Unicode in caps. I share some of that skepticism, but I think it's relatively harmless to allow the possibility; it doesn't add much complexity.

Note that if we do use Unicode in caps in future, we should limit the character set to characters for which normalization is not an issue. (There are big blocks of Han characters with no equivalences, for example.)

Re: the JSON encoding -- JSON strings are by definition Unicode, and the JSON bodies are already assumed to be UTF-8 (which is necessary for filenames). The only other compatible option for encoding URIs in JSON would be ISO-Latin-1 (i.e. encode bytes 0x80..0xFF as \x80..\xFF), but it makes no sense to me to use a mixture of UTF-8 and ISO-Latin-1. Also see below for the current behaviour when using simplejson.dumps.

Dense cap encoding is a separate issue that has nothing to do with Unicode. At some point we will probably change the dirnode format for other reasons (e.g. to support deep-verify caps), and then we can consider whether the new format uses dense encoding. I don't think there's much space to be saved, though.

Replying to warner:

We don't need to construct URLs out of unrecognized caps, because the WUI doesn't provide any operations to work with them.

That's not quite true; the WUI accepts unrecognized caps in the form field to link a cap into a directory.

Replying to zooko:

Okay I've read this through a few times now and I'm not sure I understand all of it.

To start with, the "related but separate issue" at the end of comment:119136 can safely go into a separate ticket, right?

I think so.

Next, I'm fairly sure that this can also go into a separate ticket, possibly that same one: I'd strongly recommend that, if we're going to plan for expansion of the "V1" filecap-as-string syntax beyond ASCII, then we should plan for them to be bytestrings..

No; we have to make a choice between UTF-8 and arbitrary bytestrings. That's part of this ticket.

Maybe this ticket could be named "capabilities from the future could be non-ascii and non-unicode bytestrings". Does that make sense at all?

Err, no. All bytestrings are non-Unicode. The question is whether they represent Unicode, i.e. whether they need to be valid UTF-8.

[...]

Suppose Alice is running Tahoe-LAFS v1.8.0, in the year 2020, and suppose hypothetically that for some reason that is currently unimaginable to us, we have in the year 2019 defined an "expression syntax" for Tahoe-LAFS caps which are unicode, like this: lafs://from_the_future_fw-蜔쳨欝遃䝦舜琇襇邤䍏㵦☚✸킾궑蒴犏띎냔㳆㼿졨浴䒉ΐ屝稜퍙鉧.

OK.

Now suppose, that Bob has a cap like that, and he conveys it to Alice, either by sending it to her and inviting her to click on it or cut and paste it (or wave her magic wand at it or whatever they do in 2020) to enter it into her Tahoe-LAFS v1.8.0 client. Or, suppose Bob puts it into a Tahoe-LAFS directory which Alice has read-access to and asks her to look at that directory.

Question 1: What would you want to happen (in this hypothetical scenario) when Alice waves her wand at it or lists that Tahoe-LAFS directory?

She should get an unlinked entry with '?', '?-IMM', or '?-RO' in the first column.

Question 2: What would happen if Alice were using Tahoe-LAFS v1.7.0? (I don't know the answer to this question. Wouldn't her client incur an internal TypeError of some kind?)

I haven't tried an end-to-end test, but from browsing the code for HTML directory listings (web/directory.py), I believe that it won't attempt to decode the URI, so it will be treated like any other unknown URI -- i.e. she will get the desired unlinked entry. I don't know what will happen for the Info page.

For JSON directory listings, again I haven't tried an end-to-end test, but the behaviour of simplejson.dumps is to assume that bytestrings in the input are UTF-8, and (if they are valid UTF-8) encode them with a \u escape in the resulting JSON. In other words, I think this may entirely accidentally do the right thing. If the directory contains an URI that is not valid UTF-8, then a UnicodeDecodeError will probably occur [here]source:src/allmydata/web/directory.py@4527#L851.

Question 3: How would you write a unit test which answers Question 2? My attempt at that was test-nonascii-future-caps.dpatch , but maybe that test doesn't actually answer Question 2. I'm not sure.

I'd use something like that patch, but with .encode('utf-8') added to all of the Unicode URI strings. (Also I prefer to use Unicode escapes rather than UTF-8 in source files.)

My objection was slightly different to Brian's. I was objecting to the internal representation of URIs being "either a Unicode string or a bytestring", as the patch assumed. From experience, that's extremely error-prone and leads to horrible problems with implicit conversions. Representing URIs as UTF-8 doesn't have that problem, and satisfies Brian's criteria in [comment:119136](/tahoe-lafs/trac-2024-07-25/issues/1051#issuecomment-119136) : * you can reliably compare UTF-8 bytestrings for equality * there is an unambiguous mapping from filecap-as-bundle-of-data to filecap-as-UTF-8-string (some bundles-of-data will be invalid, but that's fine) I can't speak for Brian, but I believe from IRC conversations that he was skeptical of the whole idea of Unicode in caps. I share some of that skepticism, but I think it's relatively harmless to allow the possibility; it doesn't add much complexity. Note that if we do use Unicode in caps in future, we should limit the character set to characters for which normalization is not an issue. (There are big blocks of Han characters with no equivalences, for example.) Re: the JSON encoding -- JSON strings are by definition Unicode, and the JSON bodies are already assumed to be UTF-8 (which is necessary for filenames). The only other compatible option for encoding URIs in JSON would be ISO-Latin-1 (i.e. encode bytes 0x80..0xFF as \x80..\xFF), but it makes no sense to me to use a mixture of UTF-8 and ISO-Latin-1. Also see below for the current behaviour when using `simplejson.dumps`. Dense cap encoding is a separate issue that has nothing to do with Unicode. At some point we will probably change the dirnode format for other reasons (e.g. to support deep-verify caps), and then we can consider whether the new format uses dense encoding. I don't think there's much space to be saved, though. Replying to [warner](/tahoe-lafs/trac-2024-07-25/issues/1051#issuecomment-119136): > We don't need to construct URLs out of unrecognized caps, because the WUI doesn't provide any operations to work with them. That's not quite true; the WUI accepts unrecognized caps in the form field to link a cap into a directory. Replying to [zooko](/tahoe-lafs/trac-2024-07-25/issues/1051#issuecomment-119138): > Okay I've read this through a few times now and I'm not sure I understand all of it. > > To start with, the "related but separate issue" at the end of [comment:119136](/tahoe-lafs/trac-2024-07-25/issues/1051#issuecomment-119136) can safely go into a separate ticket, right? I think so. > Next, I'm *fairly* sure that this can also go into a separate ticket, possibly that same one: *I'd strongly recommend that, if we're going to plan for expansion of the "V1" filecap-as-string syntax beyond ASCII, then we should plan for them to be bytestrings.*. No; we have to make a choice between UTF-8 and arbitrary bytestrings. That's part of this ticket. > Maybe this ticket could be named "capabilities from the future could be non-ascii and non-unicode bytestrings". Does that make sense at all? Err, no. All bytestrings are non-Unicode. The question is whether they *represent* Unicode, i.e. whether they need to be valid UTF-8. [...] > Suppose Alice is running Tahoe-LAFS v1.8.0, in the year 2020, and suppose *hypothetically* that for some reason that is currently unimaginable to us, we have in the year 2019 defined an "expression syntax" for Tahoe-LAFS caps which are unicode, like this: `lafs://from_the_future_fw-蜔쳨欝遃䝦舜琇襇邤䍏㵦☚✸킾궑蒴犏띎냔㳆㼿졨浴䒉ΐ屝稜퍙鉧`. OK. > Now suppose, that Bob has a cap like that, and he conveys it to Alice, either by sending it to her and inviting her to click on it or cut and paste it (or wave her magic wand at it or whatever they do in 2020) to enter it into her Tahoe-LAFS v1.8.0 client. Or, suppose Bob puts it into a Tahoe-LAFS directory which Alice has read-access to and asks her to look at that directory. > > Question 1: What would you want to happen (in this hypothetical scenario) when Alice waves her wand at it or lists that Tahoe-LAFS directory? She should get an unlinked entry with '?', '?-IMM', or '?-RO' in the first column. > Question 2: What would happen if Alice were using Tahoe-LAFS v1.7.0? (I don't know the answer to this question. Wouldn't her client incur an internal TypeError of some kind?) I haven't tried an end-to-end test, but from browsing the code for HTML directory listings (web/directory.py), I believe that it won't attempt to decode the URI, so it will be treated like any other unknown URI -- i.e. she will get the desired unlinked entry. I don't know what will happen for the Info page. For JSON directory listings, again I haven't tried an end-to-end test, but the behaviour of `simplejson.dumps` is to assume that bytestrings in the input are UTF-8, and (if they are valid UTF-8) encode them with a `\u` escape in the resulting JSON. In other words, I think this *may* entirely accidentally do the right thing. If the directory contains an URI that is not valid UTF-8, then a `UnicodeDecodeError` will probably occur [here]source:src/allmydata/web/directory.py@4527#L851. > Question 3: How would you write a unit test which answers Question 2? My attempt at that was [test-nonascii-future-caps.dpatch](/tahoe-lafs/trac-2024-07-25/attachments/000078ac-929f-9640-bc82-d383dad09a16) , but maybe that test doesn't actually answer Question 2. I'm not sure. I'd use something like that patch, but with `.encode('utf-8')` added to all of the Unicode URI strings. (Also I prefer to use Unicode escapes rather than UTF-8 in source files.)
davidsarah commented 2010-07-11 20:19:55 +00:00
Author
Owner

Attachment test-utf8-future-caps.dpatch (162820 bytes) added

Add tests of caps from the future that have non-ASCII characters in them (encoded as UTF-8). The changes to test_uri .py, test_client.py, and test_dirnode.py add tests of non-ASCII future caps in addition to the current tests. The change s to test_web.py just replace the tests of all-ASCII future caps with tests of non-ASCII future caps. We also change use s of failUnlessEqual to failUnlessReallyEqual, in order to catch cases where the type of a string is not as expected.

**Attachment** test-utf8-future-caps.dpatch (162820 bytes) added Add tests of caps from the future that have non-ASCII characters in them (encoded as UTF-8). The changes to test_uri .py, test_client.py, and test_dirnode.py add tests of non-ASCII future caps in addition to the current tests. The change s to test_web.py just replace the tests of all-ASCII future caps with tests of non-ASCII future caps. We also change use s of failUnlessEqual to failUnlessReallyEqual, in order to catch cases where the type of a string is not as expected.
davidsarah commented 2010-07-11 20:21:34 +00:00
Author
Owner

Attachment behaviour-utf8-future-caps.dpatch (5160 bytes) added

Allow URIs passed in the initial JSON for t=mkdir-with-children, t=mkdir-immutable to be Unicode (this makes 'test-utf8-future-caps.dpatch' pass). Also pass the name of each child into nodemaker.create_from_cap for error reporting.

**Attachment** behaviour-utf8-future-caps.dpatch (5160 bytes) added Allow URIs passed in the initial JSON for t=mkdir-with-children, t=mkdir-immutable to be Unicode (this makes 'test-utf8-future-caps.dpatch' pass). Also pass the name of each child into nodemaker.create_from_cap for error reporting.
tahoe-lafs modified the milestone from 1.8.0 to 1.7.1 2010-07-11 20:51:52 +00:00
davidsarah commented 2010-07-11 21:04:44 +00:00
Author
Owner

Incidentally, these tests show that the directory listing case already worked in 1.7.0 (as far as I can tell having only run them on Windows). I think the only thing that didn't work was passing Unicode URIs in the JSON for t=mkdir-with-children and t=mkdir-immutable.

Incidentally, these tests show that the directory listing case already worked in 1.7.0 (as far as I can tell having only run them on Windows). I think the only thing that didn't work was passing Unicode URIs in the JSON for `t=mkdir-with-children` and `t=mkdir-immutable`.
zooko commented 2010-07-12 04:16:02 +00:00
Author
Owner

Okay, I still haven't fully understood Brian's objection. Brian: please review!

As far as I dimly understand the whole issue, with these patches we will effectively have no constraints on future representations of caps (except that they can't start with 'URI:'). Tahoe-LAFS v1.7.0 turns out to already allow any future-caps in Tahoe-LAFS dirs and just show them as '?', but would have raised an exception if you tried to write such future-caps in with the WAPI's t=mkdir-with-children and t=mkdir-immutable. With these patches, Tahoe-LAFS v1.7.1 would also accept any future-caps through the WAPI.

Okay, I still haven't fully understood Brian's objection. Brian: please review! As far as I dimly understand the whole issue, with these patches we will effectively have no constraints on future representations of caps (except that they can't start with 'URI:'). Tahoe-LAFS v1.7.0 turns out to already allow any future-caps in Tahoe-LAFS dirs and just show them as '?', but would have raised an exception if you tried to write such future-caps in with the WAPI's `t=mkdir-with-children` and `t=mkdir-immutable`. With these patches, Tahoe-LAFS v1.7.1 would also accept any future-caps through the WAPI.
zooko commented 2010-07-12 04:47:00 +00:00
Author
Owner

I reviewed behaviour-utf8-future-caps.dpatch and it looks correct to me.

I reviewed [behaviour-utf8-future-caps.dpatch](/tahoe-lafs/trac-2024-07-25/attachments/000078ac-929f-9640-bc82-3a0801ebc6eb) and it looks correct to me.
warner commented 2010-07-14 23:29:57 +00:00
Author
Owner

David-Sarah's analysis in comment:14 is mostly in line with my thinking.

I object less to "filecaps are UTF-8 encoding of some unicode string" than
"filecaps are unicode strings". This would let us say that filecaps are
bytestrings but with a constraint that filecap.decode("utf-8") must not
throw an exception, and perhaps the additional constraint that
filecap.decode("utf-8").encode("utf-8")==filecap. If we went this way,
we should say that the UTF-8 -encoded form is the primary one (i.e., if you
want to compare two filecaps, use filecap1==filecap2, not
filecap1.decode("utf-8")==filecap2.decode("utf-8").

That still feels weird, though: UTF-8 is an encoding of something else, and
in general you want to be comparing the primary form, not some encoding
thereof. And filecaps must be unambiguous. If you wanted to visually
compare two ASCII filecaps, you could do it easily (in fact the base32 takes
out the o/0 1/i/I/l/L homoglyphs). While I don't expect people to do this
much, the fact that two unicode strings simply cannot be safely compared this
way has got to be a bad sign.

If we really must accept more than just ASCII, then I'd prefer to accept
completely arbitrary bytestrings. The biggest problem with doing this is the
t=json WAPI: if I'd taken this issue at all seriously when I built the
webapi, I would have defined the t=json format to emit base64-encoded
filecaps or something similar. (actually, at that point I did not yet realize
that JSON could not handle arbitrary binary data.. if I had, I might have
skipped JSON altogether and used protocol buffers or netstrings or
something).

But one option would be to have the t=json response leave out any filecap
that cannot be expressed in printable ASCII (i.e., run a regexp against it
before populating the child-info dictionary, replace it with an "unknown cap"
marker if that fails). I can't remember if we covered this one during the
earlier caps-from-the-future discussion.

If we go with "filecaps are UTF-8 encoding of a unicode string", then the
t=json API doesn't give enough information to clients to compare the real
filecaps: all they can get is filecap.decode("utf-8") . In addition, at
some point inside the webapi, we'd have to convert the filecaps into unicode
before adding them to the JSON response. I'm really nervous about the
information-losing behavior of unicode conversions, and security problems
that can result.

Note that if we do use Unicode in caps in future, we should limit the
character set to characters for which normalization is not an issue. (There
are big blocks of Han characters with no equivalences, for example.)

Ugh.. how can we make this safe? That is, when somebody pastes in a cap, how
do we verify that it isn't using any characters in this set? Is this set even
constant? When we're all speaking Lojban or Ilaksh or Marain or something in
the future, won't there be new codepoints which the old code can't recognize
as being non-normalizable?

A related but separate issue is how to plan for expansion to the V2/V3/etc
syntaxes.

While parts of this may belong in other tickets, I think it remains relevant
for this one. Your desire to plan for new things in our V1 filecaps might
actually be a desire to define and implement those V2/V3 syntaxes (and
improve the webapi to accept them, etc). So it may be better to leave the V1
syntax definition alone, leave certain Tahoe interfaces intolerant to the
potential new forms, and declare that we'll replace those interfaces with
V2+-tolerant ones before we start using those forms.

== Re: behaviour-utf8-future-caps.dpatch ==

Why the s/name/namex/g ? Did you maybe mean to say "name = unicode(namex)"
to highlight the transition from "unicode or bytestring" to "really unicode",
and then leave the other instances of "name" alone?

The writecap = to_str(propropdict.get("rw_uri")) line performs the
unicode-to-UTF8 conversion. This means that webapi users calling
t=mkdir-with-children or t=mkdir-immutable are giving us unicode,
not UTF-8 bytestrings (i.e. tahoe gets
callerwritecap.decode("utf-8").encode("utf-8"), because the JSON
library is doing a decode before tahoe proper sees the data). Worse yet, the
decode and the encode are being done by different pieces of code (I'd hope
that the JSON library uses python's .decode logic, but who knows?).
That's the best way to implement the unicode-caps design, but it also makes
it clear that this is not an exact transformation.

I didn't review it earlier, but nodemaker.create_from_cap(name=) is weird.
I'd be concerned about unicode creeping into an exception instance and then
causing bytestring-only logging to break (such as when it is written to
twistd.log). I'm not sure what a good solution is: I see how it's a bit
easier to pass "extraneous" information down into a function that might raise
an exception (and stuff it into the exception message down there), rather
than e.g. catch the exception higher up (where knowing name= is a bit more
natural) and somehow gluing the name into the already-constructed exception
object.

== Re: test-utf8-future-caps.dpatch ==

Hrm, could you reduce the instances of "failUnlessReallyEqual" to things that
just test caps? Seeing it on things like
(c.getServiceNamed("storage"}.reserved_space, 0) makes the patch
awfully big. Hm, and if there were some clever way to make it the same length
as "failUnlessEqual", that would reduce the noise even further (if you do
this, which I don't think you should, note that
len(assertTypeEqual)==len(failUnlessEqual)).

I don't think using failUnlessReallyEqual in test_dirnode.py on things
like set(metadata.keys() does everything you want it to: it will assert
that both sides are of type Set, but it won't assert that the members of
those sets are both of type string.

In test_dirnode.py, I would call the new variables
"future_unicode_write_uri", rather than "future_nonascii_write_uri", to make
it clear that this is one possible direction (and that there are others).

== Conclusions ==

behaviour-utf8-future-caps.dpatch: yes, this patch is pretty harmless, I
don't mind it going in.

test-utf8-future-caps.dpatch: I see no problems with the patch per se, but I
think the examples it uses set a bad precedent, by causing anyone reading the
test to believe that tahoe's future caps will be unicode, which I think is a
bad idea.

I don't object to these two patches going in, but I will continue to object
to the idea that the filecaps accepted by our existing interfaces (and stored
in existing dirnodes) should be defined as unicode-encoded-to-UTF8. I think
the best approaches are, in order of preference:

  1. continue to restrict filecaps to printable ASCII
  2. define filecaps as arbitrary bytestrings and replace the t=json WAPI
    interface which is unable to tolerate such a wide range

I don't want to define filecaps to be unicode. Unicode exists to represent
strings of written human languages. Filecaps are records/structs of
cryptovalues. We have more tools to manipulate printable/copypastable strings
than to manipulate abstract records of cryptovalues, so expressing filecaps
as strings is convenient, but we should pick the encoding to serve tahoe's
needs, rather than trying to make any conceivable written-human-language
string meaningful as a tahoe filecap.

That said, for users who have a solid unicode-friendly set of tools and want
to tweet their filecaps, I don't object to an encoding scheme that somehow
takes a filecap and expresses it as a string of unicode characters (this
would be a "V4", in my V1/V2/V3 scheme from comment:11). But the tahoe
interfaces that accept this need to be clearly marked, and I think the
current t=json is not one of them.

David-Sarah's analysis in comment:14 is mostly in line with my thinking. I object less to "filecaps are UTF-8 encoding of some unicode string" than "filecaps are unicode strings". This would let us say that filecaps are bytestrings but with a constraint that `filecap.decode("utf-8")` must not throw an exception, and perhaps the additional constraint that `filecap.decode("utf-8").encode("utf-8")==filecap`. If we went this way, we should say that the UTF-8 -encoded form is the primary one (i.e., if you want to compare two filecaps, use `filecap1==filecap2`, not `filecap1.decode("utf-8")==filecap2.decode("utf-8")`. That still feels weird, though: UTF-8 is an encoding of something else, and in general you want to be comparing the primary form, not some encoding thereof. And filecaps *must* be unambiguous. If you wanted to visually compare two ASCII filecaps, you could do it easily (in fact the base32 takes out the o/0 1/i/I/l/L homoglyphs). While I don't expect people to do this much, the fact that two unicode strings simply cannot be safely compared this way has got to be a bad sign. If we really must accept more than just ASCII, then I'd prefer to accept completely arbitrary bytestrings. The biggest problem with doing this is the t=json WAPI: if I'd taken this issue at all seriously when I built the webapi, I would have defined the t=json format to emit base64-encoded filecaps or something similar. (actually, at that point I did not yet realize that JSON could not handle arbitrary binary data.. if I had, I might have skipped JSON altogether and used protocol buffers or netstrings or something). But one option would be to have the t=json response leave out any filecap that cannot be expressed in printable ASCII (i.e., run a regexp against it before populating the child-info dictionary, replace it with an "unknown cap" marker if that fails). I can't remember if we covered this one during the earlier caps-from-the-future discussion. If we go with "filecaps are UTF-8 encoding of a unicode string", then the t=json API doesn't give enough information to clients to compare the real filecaps: all they can get is `filecap.decode("utf-8")` . In addition, at some point inside the webapi, we'd have to convert the filecaps into unicode before adding them to the JSON response. I'm really nervous about the information-losing behavior of unicode conversions, and security problems that can result. > Note that if we do use Unicode in caps in future, we should limit the > character set to characters for which normalization is not an issue. (There > are big blocks of Han characters with no equivalences, for example.) Ugh.. how can we make this safe? That is, when somebody pastes in a cap, how do we verify that it isn't using any characters in this set? Is this set even constant? When we're all speaking Lojban or Ilaksh or Marain or something in the future, won't there be new codepoints which the old code can't recognize as being non-normalizable? > A related but separate issue is how to plan for expansion to the V2/V3/etc > syntaxes. While parts of this may belong in other tickets, I think it remains relevant for this one. Your desire to plan for new things in our V1 filecaps might actually be a desire to define and implement those V2/V3 syntaxes (and improve the webapi to accept them, etc). So it may be better to leave the V1 syntax definition alone, leave certain Tahoe interfaces intolerant to the potential new forms, and declare that we'll replace those interfaces with V2+-tolerant ones before we start using those forms. == Re: behaviour-utf8-future-caps.dpatch == Why the s/name/namex/g ? Did you maybe mean to say "`name = unicode(namex)`" to highlight the transition from "unicode or bytestring" to "really unicode", and then leave the other instances of "name" alone? The `writecap = to_str(propropdict.get("rw_uri"))` line performs the unicode-to-UTF8 conversion. This means that webapi users calling `t=mkdir-with-children` or `t=mkdir-immutable` are giving us unicode, not UTF-8 bytestrings (i.e. tahoe gets `callerwritecap.decode("utf-8").encode("utf-8")`, because the JSON library is doing a decode before tahoe proper sees the data). Worse yet, the decode and the encode are being done by different pieces of code (I'd hope that the JSON library uses python's `.decode` logic, but who knows?). That's the best way to implement the unicode-caps design, but it also makes it clear that this is not an exact transformation. I didn't review it earlier, but nodemaker.create_from_cap(name=) is weird. I'd be concerned about unicode creeping into an exception instance and then causing bytestring-only logging to break (such as when it is written to twistd.log). I'm not sure what a good solution is: I see how it's a bit easier to pass "extraneous" information down into a function that might raise an exception (and stuff it into the exception message down there), rather than e.g. catch the exception higher up (where knowing name= is a bit more natural) and somehow gluing the name into the already-constructed exception object. == Re: test-utf8-future-caps.dpatch == Hrm, could you reduce the instances of "failUnlessReallyEqual" to things that just test caps? Seeing it on things like `(c.getServiceNamed("storage"}.reserved_space, 0)` makes the patch awfully big. Hm, and if there were some clever way to make it the same length as "failUnlessEqual", that would reduce the noise even further (if you do this, which I don't think you should, note that len(assertTypeEqual)==len(failUnlessEqual)). I don't think using `failUnlessReallyEqual` in test_dirnode.py on things like `set(metadata.keys()` does everything you want it to: it will assert that both sides are of type Set, but it won't assert that the members of those sets are both of type string. In test_dirnode.py, I would call the new variables "future_unicode_write_uri", rather than "future_nonascii_write_uri", to make it clear that this is one possible direction (and that there are others). == Conclusions == behaviour-utf8-future-caps.dpatch: yes, this patch is pretty harmless, I don't mind it going in. test-utf8-future-caps.dpatch: I see no problems with the patch per se, but I think the examples it uses set a bad precedent, by causing anyone reading the test to believe that tahoe's future caps will be unicode, which I think is a bad idea. I don't object to these two patches going in, but I will continue to object to the idea that the filecaps accepted by our existing interfaces (and stored in existing dirnodes) should be defined as unicode-encoded-to-UTF8. I think the best approaches are, in order of preference: 1. continue to restrict filecaps to printable ASCII 2. define filecaps as arbitrary bytestrings and replace the t=json WAPI interface which is unable to tolerate such a wide range I don't want to define filecaps to be unicode. Unicode exists to represent strings of written human languages. Filecaps are records/structs of cryptovalues. We have more tools to manipulate printable/copypastable strings than to manipulate abstract records of cryptovalues, so expressing filecaps as strings is convenient, but we should pick the encoding to serve tahoe's needs, rather than trying to make any conceivable written-human-language string meaningful as a tahoe filecap. That said, for users who have a solid unicode-friendly set of tools and want to tweet their filecaps, I don't object to an encoding scheme that somehow takes a filecap and expresses it as a string of unicode characters (this would be a "V4", in my V1/V2/V3 scheme from comment:11). But the tahoe interfaces that accept this need to be clearly marked, and I think the current t=json is not one of them.
zooko commented 2010-07-15 00:01:20 +00:00
Author
Owner

Wow! Thanks for the detailed review.

I think we need to take the broader precedent-setting discussion somewhere else, such as the mailing list and then once it starts to gel move it to the wiki/NewCapDesign page.

I'll try to focus on the narrower issues in this comment.

Replying to warner:

But one option would be to have the t=json response leave out any filecap
that cannot be expressed in printable ASCII (i.e., run a regexp against it
before populating the child-info dictionary, replace it with an "unknown cap"
marker if that fails). I can't remember if we covered this one during the
earlier caps-from-the-future discussion.

What purpose would that serve?

test-utf8-future-caps.dpatch: I see no problems with the patch per se, but I
think the examples it uses set a bad precedent, by causing anyone reading the
test to believe that tahoe's future caps will be unicode, which I think is a
bad idea.

Maybe add a comment saying that we do not intend to invent caps like these--these are only examples of possibilities for testing.

Wow! Thanks for the detailed review. I think we need to take the broader precedent-setting discussion somewhere else, such as the mailing list and then once it starts to gel move it to the [wiki/NewCapDesign](wiki/NewCapDesign) page. I'll try to focus on the narrower issues in this comment. Replying to [warner](/tahoe-lafs/trac-2024-07-25/issues/1051#issuecomment-119144): > > But one option would be to have the t=json response leave out any filecap > that cannot be expressed in printable ASCII (i.e., run a regexp against it > before populating the child-info dictionary, replace it with an "unknown cap" > marker if that fails). I can't remember if we covered this one during the > earlier caps-from-the-future discussion. What purpose would that serve? > test-utf8-future-caps.dpatch: I see no problems with the patch per se, but I > think the examples it uses set a bad precedent, by causing anyone reading the > test to believe that tahoe's future caps will be unicode, which I think is a > bad idea. Maybe add a comment saying that we do not intend to invent caps like these--these are only examples of possibilities for testing.
davidsarah commented 2010-07-17 22:48:55 +00:00
Author
Owner

behaviour-utf8-future-caps.dpatch applied in changeset:fa0fd66e17fe845b.

behaviour-utf8-future-caps.dpatch applied in changeset:fa0fd66e17fe845b.
zooko commented 2010-07-18 05:51:04 +00:00
Author
Owner

applied tests as changeset:d346e0853d9b0b4b. added comment about the tests of caps "from the future" being actually from an alternate reality future: changeset:7cc98759bd1baca3

applied tests as changeset:d346e0853d9b0b4b. added comment about the tests of caps "from the future" being actually from an alternate reality future: changeset:7cc98759bd1baca3
tahoe-lafs added the
fixed
label 2010-07-18 05:51:04 +00:00
zooko closed this issue 2010-07-18 05:51:04 +00:00
davidsarah commented 2010-07-18 23:08:00 +00:00
Author
Owner

Some of the tests added in changeset:d346e0853d9b0b4b were too strict in testing the type of values parsed from JSON (which is different depending on the simplejson version). This was fixed in changeset:74c41ebb8bb772c2.

Some of the tests added in changeset:d346e0853d9b0b4b were too strict in testing the type of values parsed from JSON (which is different depending on the `simplejson` version). This was fixed in changeset:74c41ebb8bb772c2.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: tahoe-lafs/trac-2024-07-25#1051
No description provided.