CLI (e.g. tahoe cp) does not correctly handle or display Unicode file/directory names, either in arguments or the local filesystem

francois commented

2008-11-11 20:37:57 +00:00

Owner

"tahoe cp" doesn't correctly handle unicode filenames.

Steps to reproduce:

francois@korn:~/tmp$ touch Motörhead
francois@korn:~/tmp$ tahoe cp Motörhead tahoe:
Traceback (most recent call last):
  File "/home/francois/dev/tahoe/support/bin/tahoe", line 8, in <module>
    load_entry_point('allmydata-tahoe==1.2.0-r3183', 'console_scripts', 'tahoe')()
  File "/home/francois/dev/tahoe/src/allmydata/scripts/runner.py", line 84, in run
    rc = runner(sys.argv[1:])
  File "/home/francois/dev/tahoe/src/allmydata/scripts/runner.py", line 73, in runner
    rc = cli.dispatch[command](so)
  File "/home/francois/dev/tahoe/src/allmydata/scripts/cli.py", line 261, in cp
    rc = tahoe_cp.copy(options)
  File "/home/francois/dev/tahoe/src/allmydata/scripts/tahoe_cp.py", line 722, in copy
    return Copier().do_copy(options)
  File "/home/francois/dev/tahoe/src/allmydata/scripts/tahoe_cp.py", line 478, in do_copy
    return self.copy_to_directory(sources, target)
  File "/home/francois/dev/tahoe/src/allmydata/scripts/tahoe_cp.py", line 594, in copy_to_directory
    target.populate(False)
  File "/home/francois/dev/tahoe/src/allmydata/scripts/tahoe_cp.py", line 343, in populate
    urllib.quote(name)])
  File "/usr/lib/python2.5/urllib.py", line 1205, in quote
    res = map(safe_map.__getitem__, s)
KeyError: u'\u0129'
francois@korn:~/tmp$

"tahoe cp" doesn't correctly handle unicode filenames. Steps to reproduce: ``` francois@korn:~/tmp$ touch Motörhead francois@korn:~/tmp$ tahoe cp Motörhead tahoe: Traceback (most recent call last): File "/home/francois/dev/tahoe/support/bin/tahoe", line 8, in <module> load_entry_point('allmydata-tahoe==1.2.0-r3183', 'console_scripts', 'tahoe')() File "/home/francois/dev/tahoe/src/allmydata/scripts/runner.py", line 84, in run rc = runner(sys.argv[1:]) File "/home/francois/dev/tahoe/src/allmydata/scripts/runner.py", line 73, in runner rc = cli.dispatch[command](so) File "/home/francois/dev/tahoe/src/allmydata/scripts/cli.py", line 261, in cp rc = tahoe_cp.copy(options) File "/home/francois/dev/tahoe/src/allmydata/scripts/tahoe_cp.py", line 722, in copy return Copier().do_copy(options) File "/home/francois/dev/tahoe/src/allmydata/scripts/tahoe_cp.py", line 478, in do_copy return self.copy_to_directory(sources, target) File "/home/francois/dev/tahoe/src/allmydata/scripts/tahoe_cp.py", line 594, in copy_to_directory target.populate(False) File "/home/francois/dev/tahoe/src/allmydata/scripts/tahoe_cp.py", line 343, in populate urllib.quote(name)]) File "/usr/lib/python2.5/urllib.py", line 1205, in quote res = map(safe_map.__getitem__, s) KeyError: u'\u0129' francois@korn:~/tmp$ ```

tahoe-lafs added the

labels 2008-11-11 20:37:57 +00:00

tahoe-lafs added this to the undecided milestone 2008-11-11 20:37:57 +00:00

francois commented

2008-11-11 20:38:51 +00:00

Author

Owner

Attached patch seems to fix this issue.

zooko commented

2008-11-12 13:47:34 +00:00

Dear Francois:

Thank you very much for the bug report and patch. It looks good to me. Is there any chance you can write a unit test for this bug? The test module for the command-line functions is here:

http://allmydata.org/trac/tahoe/browser/src/allmydata/test/test_cli.py

It looks like there are tests for get and put but not for cp in there.

Please let me know. Thanks!

--Zooko

Dear Francois: Thank you very much for the bug report and patch. It looks good to me. Is there any chance you can write a unit test for this bug? The test module for the command-line functions is here: <http://allmydata.org/trac/tahoe/browser/src/allmydata/test/test_cli.py> It looks like there are tests for get and put but not for cp in there. Please let me know. Thanks! --Zooko

francois commented

2008-11-13 11:30:50 +00:00

Author

Owner

Attachment cp-encoding.patch (86547 bytes) added

Bugfix + test

**Attachment** cp-encoding.patch (86547 bytes) added Bugfix + test

cp-encoding.patch

84 KiB

francois commented

2008-11-13 11:33:01 +00:00

Author

Owner

This newer patch contains the bugfix itself and associated test in test_cli.py.

Thanks for your advice Zooko !

This newer patch contains the bugfix itself and associated test in test_cli.py. Thanks for your advice Zooko !

zooko commented

2008-11-13 13:52:06 +00:00

fixed by changeset:c1f639d230bb32a1 and changeset:5c0c5bfc81e1f0cb.

zooko added the

r/fixed

label 2008-11-13 13:52:06 +00:00

zooko closed this issue

2008-11-13 13:52:06 +00:00

warner commented

2008-11-13 20:32:59 +00:00

cygwin, solaris, and yukyu-hardy buildslaves are all failing this test.

Also, the test should probably read back from those files (with a 'tahoe get') to make sure the cp actually worked (instead of perhaps failing silently).

cygwin, solaris, and yukyu-hardy buildslaves are all failing this test. Also, the test should probably read back from those files (with a 'tahoe get') to make sure the cp actually worked (instead of perhaps failing silently).

warner removed the

r/fixed

label 2008-11-13 20:32:59 +00:00

warner reopened this issue

2008-11-13 20:32:59 +00:00

warner commented

2008-11-13 21:03:14 +00:00

the test is also failing on my personal workstation (linux, debian/sid), in the same way as the other buildslaves:

[ERROR]: allmydata.test.test_cli.Cp.test_unicode_filename

Traceback (most recent call last):
  File "/home/warner/trees/tahoe/src/allmydata/test/test_cli.py", line 554, in test_unicode_filename
    open(fn1, "wb").write("unicode file content")
exceptions.UnicodeEncodeError: 'ascii' codec can't encode character u'\xc4' in position 56: ordinal not in range(128)

the test is also failing on my personal workstation (linux, debian/sid), in the same way as the other buildslaves: ``` [ERROR]: allmydata.test.test_cli.Cp.test_unicode_filename Traceback (most recent call last): File "/home/warner/trees/tahoe/src/allmydata/test/test_cli.py", line 554, in test_unicode_filename open(fn1, "wb").write("unicode file content") exceptions.UnicodeEncodeError: 'ascii' codec can't encode character u'\xc4' in position 56: ordinal not in range(128) ```

francois commented

2008-11-14 10:24:41 +00:00

Author

Owner

Ok, I can reproduce this failing test on my environment (Ubuntu hardy, 64bits, LC_ALL=en_US.UTF8).

$ env -i PATH=$PATH make test TEST="allmydata.test.test_cli.Cp"

It is related to the locales environment settings.

Ok, I can reproduce this failing test on my environment (Ubuntu hardy, 64bits, LC_ALL=en_US.UTF8). ``` $ env -i PATH=$PATH make test TEST="allmydata.test.test_cli.Cp" ``` It is related to the locales environment settings.

francois commented

2008-11-14 14:04:15 +00:00

Author

Owner

Attachment fix-failing-test.2.patch (87347 bytes) added

This patch should hopefully fix failing test on systems with LC_ALL=C

**Attachment** fix-failing-test.2.patch (87347 bytes) added This patch should hopefully fix failing test on systems with LC_ALL=C

fix-failing-test.2.patch

85 KiB

francois commented

2008-11-14 14:07:56 +00:00

Author

Owner

Attachment credit.patch (86319 bytes) added

Add credit entry

**Attachment** credit.patch (86319 bytes) added Add credit entry

credit.patch

84 KiB

warner commented

2008-11-15 00:01:00 +00:00

Thanks, I'll apply those.

What do you think about the idea of using escaped binary strings instead of native UTF-8 in those filenames, i.e. removing the "coding=utf-8" line from the beginning of the file and using "\xc3\x84rtonwall" instead of embedding the A-with-umlaut in the string? I'm wondering if it might make our intentions clearer. At the moment we have a unicode name expressed as a non-unicode type (without the leading u", that string is just a bytestring), and the casual observer might not realize that we intended to get the binary strings into the stuff on disk.

Thanks, I'll apply those. What do you think about the idea of using escaped binary strings instead of native UTF-8 in those filenames, i.e. removing the "coding=utf-8" line from the beginning of the file and using `"\xc3\x84rtonwall"` instead of embedding the A-with-umlaut in the string? I'm wondering if it might make our intentions clearer. At the moment we have a unicode name expressed as a non-unicode type (without the leading `u"`, that string is just a bytestring), and the casual observer might not realize that we intended to get the binary strings into the stuff on disk.

warner commented

2008-11-15 00:10:35 +00:00

Oops, Zooko must have beat me to it.. those patches were already applied.

But, the buildbot is still indicating test failures: see http://allmydata.org/buildbot/waterfall . This time they're on dapper and feisty. That's really weird, I don't know what would cause those to fail but not the others.

Oops, Zooko must have beat me to it.. those patches were already applied. But, the buildbot is still indicating test failures: see <http://allmydata.org/buildbot/waterfall> . This time they're on dapper and feisty. That's really weird, I don't know what would cause those to fail but not the others.

francois commented

2008-11-16 00:11:21 +00:00

Author

Owner

Replying to warner:

What do you think about the idea of using escaped binary strings instead of native UTF-8 in those filenames, i.e. removing the "coding=utf-8" line from the beginning of the file and using "\xc3\x84rtonwall" instead of embedding the A-with-umlaut in the string? I'm wondering if it might make our intentions clearer. At the moment we have a unicode name expressed as a non-unicode type (without the leading u", that string is just a bytestring), and the casual observer might not realize that we intended to get the binary strings into the stuff on disk.

Yes, this would have been a good idea but the encoding specification looks mandatory for both "\xc3\x84rtonwall" and "Ärtonwall".

exceptions.SyntaxError: Non-ASCII character '\xc3' in file /home/francois/dev/tahoe/src/allmydata/test/test_cli.py on line 567, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details (test_cli.py, line 567)

Replying to warner:

But, the buildbot is still indicating test failures: see http://allmydata.org/buildbot/waterfall . This time they're on dapper and feisty. That's really weird, I don't know what would cause those to fail but not the others.

Unfortunately, I'm not able to reproduce those failures on my dapper and feisty virtual machines. Can you please try running the attached test script (test-unicode.py) on your machines ?

Replying to [warner](/tahoe-lafs/trac/issues/534#issuecomment-368976): > What do you think about the idea of using escaped binary strings instead of native UTF-8 in those filenames, i.e. removing the "coding=utf-8" line from the beginning of the file and using `"\xc3\x84rtonwall"` instead of embedding the A-with-umlaut in the string? I'm wondering if it might make our intentions clearer. At the moment we have a unicode name expressed as a non-unicode type (without the leading `u"`, that string is just a bytestring), and the casual observer might not realize that we intended to get the binary strings into the stuff on disk. Yes, this would have been a good idea but the encoding specification looks mandatory for both "\xc3\x84rtonwall" and "Ärtonwall". ``` exceptions.SyntaxError: Non-ASCII character '\xc3' in file /home/francois/dev/tahoe/src/allmydata/test/test_cli.py on line 567, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details (test_cli.py, line 567) ``` Replying to [warner](/tahoe-lafs/trac/issues/534#issuecomment-368977): > But, the buildbot is still indicating test failures: see <http://allmydata.org/buildbot/waterfall> . This time they're on dapper and feisty. That's really weird, I don't know what would cause those to fail but not the others. Unfortunately, I'm not able to reproduce those failures on my dapper and feisty virtual machines. Can you please try running the attached test script (test-unicode.py) on your machines ?

francois commented

2008-11-16 00:11:47 +00:00

Author

Owner

Attachment test-unicode.py (205 bytes) added

Test script for unicode filename issues

**Attachment** test-unicode.py (205 bytes) added Test script for unicode filename issues

test-unicode.py

205 B

zooko commented

2008-11-16 01:48:10 +00:00

test-unicode.py prints "OK" on our dapper machine. I'm investigating in more detail why the unit test allmydata.test.test_cli.Cp.test_unicode_filename fails on that same machine and user.

zooko commented

2008-11-16 02:11:29 +00:00

Well I've gotten as far as determining that the "tahoe_get" spawned by source:src/allmydata/test/test_cli.py@20081114134458-c0148-c011dbf3b2ce743b0237e192c3f5ed33c1b81c49#L569 is getting a 404 result:

2008-11-16 02:04:50.702Z [-] xxx 4 url: 'http://127.0.0.1:44984/uri/URI%3ADIR2%3Apobw3tvboe5ztqvt472hvnlsbm%3Apyunlfwglp5lziw4dlyybqsclplm2komshqvcqa3vp6nnrpnmzda/%C3%84rtonwall', resp: <httplib.HTTPResponse instance at 0xb0772ac> status 404

Now I need to read a bed-time story to a 7-year-old.

Well I've gotten as far as determining that the "tahoe_get" spawned by source:src/allmydata/test/test_cli.py@20081114134458-c0148-c011dbf3b2ce743b0237e192c3f5ed33c1b81c49#L569 is getting a 404 result: ``` 2008-11-16 02:04:50.702Z [-] xxx 4 url: 'http://127.0.0.1:44984/uri/URI%3ADIR2%3Apobw3tvboe5ztqvt472hvnlsbm%3Apyunlfwglp5lziw4dlyybqsclplm2komshqvcqa3vp6nnrpnmzda/%C3%84rtonwall', resp: <httplib.HTTPResponse instance at 0xb0772ac> status 404 ``` Now I need to read a bed-time story to a 7-year-old.

zooko commented

2008-11-18 20:21:02 +00:00

So I switched from working on the dapper buildslave to my own workstation -- yukyuk, running Ubuntu Hardy.

The first thing I see is that that the current trunk gives a coding error:

$ python setup.py trial -a " allmydata.test.test_cli.Cp.test_unicode_filename"
running trial
(run setup.py with -vv for trial command-line details)
allmydata.test.test_cli
  Cp
    test_unicode_filename ...                                           [ERROR]

===============================================================================
[ERROR]: allmydata.test.test_cli.Cp.test_unicode_filename

Traceback (most recent call last):
  File "/home/zooko/playground/allmydata/tahoe/trunk/pristine-really/src/allmydata/test/test_cli.py", line 554, in test_unicode_filename
    open(fn1, "wb").write("unicode file content")
exceptions.UnicodeEncodeError: 'ascii' codec can't encode character u'\xc4' in position 56: ordinal not in range(128)
-------------------------------------------------------------------------------
Ran 1 tests in 0.032s

FAILED (errors=1)

The next thing I see is that if I replace the unicode object with the string "\xc3\x84rtonwall" then the test never completes:

$ python setup.py trial -a " allmydata.test.test_cli.Cp.test_unicode_filename"
running trial
(run setup.py with -vv for trial command-line details)
allmydata.test.test_cli
  Cp
    test_unicode_filename ...                                           [ERROR]
                                          [ERROR]
                                          [ERROR]

===============================================================================
[ERROR]: allmydata.test.test_cli.Cp.test_unicode_filename

Traceback (most recent call last):
Failure: twisted.internet.defer.TimeoutError: <allmydata.test.test_cli.Cp testMethod=test_unicode_filename> (test_unicode_filename) still running at 120.0 secs
===============================================================================
[ERROR]: allmydata.test.test_cli.Cp.test_unicode_filename

Traceback (most recent call last):
Failure: twisted.trial.util.DirtyReactorAggregateError: Reactor was unclean.
DelayedCalls: (set twisted.internet.base.DelayedCall.debug = True to debug)
<DelayedCall 31004360 [0.0012834072113s] called=0 cancelled=0 LoopingCall<0.01>(Cp._poll, *(<bound method Cp._check_connections of <allmydata.test.test_cli.Cp testMethod=test_unicode_filename>>, 1227039706.426708), **{})()>
<DelayedCall 36538560 [0.0546979904175s] called=0 cancelled=0 _resetLogDateTime()>
<DelayedCall 31579816 [0.0570547580719s] called=0 cancelled=0 LoopingCall<60>(CPUUsageMonitor.check, *(), **{})()>
<DelayedCall 31579672 [0.23085975647s] called=0 cancelled=0 LoopingCall<60>(CPUUsageMonitor.check, *(), **{})()>
<DelayedCall 26005304 [0.253265857697s] called=0 cancelled=0 LoopingCall<60>(CPUUsageMonitor.check, *(), **{})()>
<DelayedCall 25395856 [0.192253112793s] called=0 cancelled=0 LoopingCall<60>(CPUUsageMonitor.check, *(), **{})()>
<DelayedCall 25384432 [0.211790084839s] called=0 cancelled=0 LoopingCall<60>(CPUUsageMonitor.check, *(), **{})()>
<DelayedCall 30849576 [76.1554584503s] called=0 cancelled=0 Reconnector._timer_expired()>
<DelayedCall 29971880 [171.961197615s] called=0 cancelled=0 Reconnector._timer_expired()>
<DelayedCall 30836280 [254.607980013s] called=0 cancelled=0 Reconnector._timer_expired()>
<DelayedCall 34387584 [143.276460648s] called=0 cancelled=0 Reconnector._timer_expired()>
<DelayedCall 26019816 [3480.25343394s] called=0 cancelled=0 LoopingCall<3600>(CacheDirectoryManager.check, *(), **{})()>
<DelayedCall 26005448 [3480.21197605s] called=0 cancelled=0 LoopingCall<3600>(CacheDirectoryManager.check, *(), **{})()>
<DelayedCall 31697592 [18.9836809635s] called=0 cancelled=0 Reconnector._timer_expired()>
<DelayedCall 25383712 [3480.19285607s] called=0 cancelled=0 LoopingCall<3600>(CacheDirectoryManager.check, *(), **{})()>
<DelayedCall 33256872 [135.306014299s] called=0 cancelled=0 Reconnector._timer_expired()>
<DelayedCall 25397800 [3480.056885s] called=0 cancelled=0 LoopingCall<3600>(CacheDirectoryManager.check, *(), **{})()>
<DelayedCall 26013568 [3480.23269224s] called=0 cancelled=0 LoopingCall<3600>(CacheDirectoryManager.check, *(), **{})()>
===============================================================================
[ERROR]: allmydata.test.test_cli.Cp.test_unicode_filename

Traceback (most recent call last):
Failure: twisted.trial.util.DirtyReactorAggregateError: Reactor was unclean.
Selectables:
<<class 'twisted.internet.tcp.Port'> of foolscap.pb.Listener on 41841>
<<class 'twisted.internet.tcp.Port'> of nevow.appserver.NevowSite on 45178>
<<class 'twisted.internet.tcp.Port'> of foolscap.pb.Listener on 60650>
<<class 'twisted.internet.tcp.Port'> of nevow.appserver.NevowSite on 33634>
<<class 'twisted.internet.tcp.Port'> of foolscap.pb.Listener on 38985>
<<class 'twisted.internet.tcp.Port'> of foolscap.pb.Listener on 57046>
<<class 'twisted.internet.tcp.Port'> of foolscap.pb.Listener on 60445>
<<class 'twisted.internet.tcp.Port'> of nevow.appserver.NevowSite on 58303>
<<class 'twisted.internet.tcp.Port'> of foolscap.pb.Listener on 54338>
-------------------------------------------------------------------------------
Ran 1 tests in 120.079s

FAILED (errors=3)

So I switched from working on the dapper buildslave to my own workstation -- yukyuk, running Ubuntu Hardy. The first thing I see is that that the current trunk gives a coding error: ``` $ python setup.py trial -a " allmydata.test.test_cli.Cp.test_unicode_filename" running trial (run setup.py with -vv for trial command-line details) allmydata.test.test_cli Cp test_unicode_filename ... [ERROR] =============================================================================== [ERROR]: allmydata.test.test_cli.Cp.test_unicode_filename Traceback (most recent call last): File "/home/zooko/playground/allmydata/tahoe/trunk/pristine-really/src/allmydata/test/test_cli.py", line 554, in test_unicode_filename open(fn1, "wb").write("unicode file content") exceptions.UnicodeEncodeError: 'ascii' codec can't encode character u'\xc4' in position 56: ordinal not in range(128) ------------------------------------------------------------------------------- Ran 1 tests in 0.032s FAILED (errors=1) ``` The next thing I see is that if I replace the unicode object with the string "\xc3\x84rtonwall" then the test never completes: ``` $ python setup.py trial -a " allmydata.test.test_cli.Cp.test_unicode_filename" running trial (run setup.py with -vv for trial command-line details) allmydata.test.test_cli Cp test_unicode_filename ... [ERROR] [ERROR] [ERROR] =============================================================================== [ERROR]: allmydata.test.test_cli.Cp.test_unicode_filename Traceback (most recent call last): Failure: twisted.internet.defer.TimeoutError: <allmydata.test.test_cli.Cp testMethod=test_unicode_filename> (test_unicode_filename) still running at 120.0 secs =============================================================================== [ERROR]: allmydata.test.test_cli.Cp.test_unicode_filename Traceback (most recent call last): Failure: twisted.trial.util.DirtyReactorAggregateError: Reactor was unclean. DelayedCalls: (set twisted.internet.base.DelayedCall.debug = True to debug) <DelayedCall 31004360 [0.0012834072113s] called=0 cancelled=0 LoopingCall<0.01>(Cp._poll, *(<bound method Cp._check_connections of <allmydata.test.test_cli.Cp testMethod=test_unicode_filename>>, 1227039706.426708), **{})()> <DelayedCall 36538560 [0.0546979904175s] called=0 cancelled=0 _resetLogDateTime()> <DelayedCall 31579816 [0.0570547580719s] called=0 cancelled=0 LoopingCall<60>(CPUUsageMonitor.check, *(), **{})()> <DelayedCall 31579672 [0.23085975647s] called=0 cancelled=0 LoopingCall<60>(CPUUsageMonitor.check, *(), **{})()> <DelayedCall 26005304 [0.253265857697s] called=0 cancelled=0 LoopingCall<60>(CPUUsageMonitor.check, *(), **{})()> <DelayedCall 25395856 [0.192253112793s] called=0 cancelled=0 LoopingCall<60>(CPUUsageMonitor.check, *(), **{})()> <DelayedCall 25384432 [0.211790084839s] called=0 cancelled=0 LoopingCall<60>(CPUUsageMonitor.check, *(), **{})()> <DelayedCall 30849576 [76.1554584503s] called=0 cancelled=0 Reconnector._timer_expired()> <DelayedCall 29971880 [171.961197615s] called=0 cancelled=0 Reconnector._timer_expired()> <DelayedCall 30836280 [254.607980013s] called=0 cancelled=0 Reconnector._timer_expired()> <DelayedCall 34387584 [143.276460648s] called=0 cancelled=0 Reconnector._timer_expired()> <DelayedCall 26019816 [3480.25343394s] called=0 cancelled=0 LoopingCall<3600>(CacheDirectoryManager.check, *(), **{})()> <DelayedCall 26005448 [3480.21197605s] called=0 cancelled=0 LoopingCall<3600>(CacheDirectoryManager.check, *(), **{})()> <DelayedCall 31697592 [18.9836809635s] called=0 cancelled=0 Reconnector._timer_expired()> <DelayedCall 25383712 [3480.19285607s] called=0 cancelled=0 LoopingCall<3600>(CacheDirectoryManager.check, *(), **{})()> <DelayedCall 33256872 [135.306014299s] called=0 cancelled=0 Reconnector._timer_expired()> <DelayedCall 25397800 [3480.056885s] called=0 cancelled=0 LoopingCall<3600>(CacheDirectoryManager.check, *(), **{})()> <DelayedCall 26013568 [3480.23269224s] called=0 cancelled=0 LoopingCall<3600>(CacheDirectoryManager.check, *(), **{})()> =============================================================================== [ERROR]: allmydata.test.test_cli.Cp.test_unicode_filename Traceback (most recent call last): Failure: twisted.trial.util.DirtyReactorAggregateError: Reactor was unclean. Selectables: <<class 'twisted.internet.tcp.Port'> of foolscap.pb.Listener on 41841> <<class 'twisted.internet.tcp.Port'> of nevow.appserver.NevowSite on 45178> <<class 'twisted.internet.tcp.Port'> of foolscap.pb.Listener on 60650> <<class 'twisted.internet.tcp.Port'> of nevow.appserver.NevowSite on 33634> <<class 'twisted.internet.tcp.Port'> of foolscap.pb.Listener on 38985> <<class 'twisted.internet.tcp.Port'> of foolscap.pb.Listener on 57046> <<class 'twisted.internet.tcp.Port'> of foolscap.pb.Listener on 60445> <<class 'twisted.internet.tcp.Port'> of nevow.appserver.NevowSite on 58303> <<class 'twisted.internet.tcp.Port'> of foolscap.pb.Listener on 54338> ------------------------------------------------------------------------------- Ran 1 tests in 120.079s FAILED (errors=3) ```

zooko commented

2008-11-18 20:22:12 +00:00

The next thing I notice is that if I replace it with the string "thisisjustastring" then the test still hangs.

zooko commented

2008-11-18 21:25:53 +00:00

So, it looks like the problems on yukyuk don't necessarily have anything to do with unicode, but it's just that the test_cli.Cp test is the first one that tries to construct a TLS connection, or something:

local#247 14:08:37.958: TubConnector created from bduf2tsgfciu2j4zrs7oxvt52h5ikgtj to njeaxgfqknd2as37fxdkl4skyybywey6
local#248 14:08:37.958: connectTCP to ('127.0.0.1', 57617)
local#249 14:08:37.958: Starting factory <foolscap.negotiate.TubConnectorClientFactory object [from bduf2tsg] [to njeaxgfq] at 0x19c8b50>
local#250 14:08:37.958: connectTCP to ('192.168.1.126', 57617)
local#251 14:08:37.958: Starting factory <foolscap.negotiate.TubConnectorClientFactory object [from bduf2tsg] [to njeaxgfq] at 0x19c8c90>
local#252 14:08:37.959: want to subscribe, but no introducer yet
local#253 14:08:37.959: want to publish, but no introducer yet
local#254 14:08:37.960: want to publish, but no introducer yet
local#255 14:08:37.962: dataReceived(isClient=False,phase=0,options={}): 'GET /id/njeaxgfqknd2as37fxdkl4skyybywey6 HTTP/1.1\r\nHost: 127.0.0.1\r\nUpgrade: TLS/1.0\r\nConnection: Upgrade\r\n\r\n'
local#256 14:08:37.962: handlePLAINTEXTServer: targetTubID='njeaxgfqknd2as37fxdkl4skyybywey6'
local#257 14:08:37.962: handlePLAINTEXTServer: wantEncrypted=True
local#258 14:08:37.963: startENCRYPTED(isClient=False, encrypted=True)
local#259 14:08:37.963: startTLS, client=False
local#260 14:08:37.966: negotiation had exception
 FAILURE:
 [CopiedFailure instance: Traceback from remote host -- Traceback (most recent call last):
   File "/usr/lib/python2.5/site-packages/Twisted-8.1.0-py2.5-linux-x86_64.egg/twisted/python/context.py", line 59, in callWithContext
     return self.currentContext().callWithContext(ctx, func, *args, **kw)
   File "/usr/lib/python2.5/site-packages/Twisted-8.1.0-py2.5-linux-x86_64.egg/twisted/python/context.py", line 37, in callWithContext
     return func(*args,**kw)
   File "/usr/lib/python2.5/site-packages/Twisted-8.1.0-py2.5-linux-x86_64.egg/twisted/internet/pollreactor.py", line 186, in _doReadOrWrite
     why = selectable.doRead()
   File "/usr/lib/python2.5/site-packages/Twisted-8.1.0-py2.5-linux-x86_64.egg/twisted/internet/tcp.py", line 362, in doRead
     return self.protocol.dataReceived(data)
 --- <exception caught here> ---
   File "build/bdist.linux-x86_64/egg/foolscap/negotiate.py", line 402, in dataReceived
     
   File "build/bdist.linux-x86_64/egg/foolscap/negotiate.py", line 527, in handlePLAINTEXTServer
     
   File "build/bdist.linux-x86_64/egg/foolscap/negotiate.py", line 554, in sendPlaintextServerAndStartENCRYPTED
     
   File "build/bdist.linux-x86_64/egg/foolscap/negotiate.py", line 589, in startENCRYPTED
     
   File "build/bdist.linux-x86_64/egg/foolscap/negotiate.py", line 1139, in startTLS
     
   File "/usr/lib/python2.5/site-packages/Twisted-8.1.0-py2.5-linux-x86_64.egg/twisted/internet/tcp.py", line 660, in startTLS
     holder = Connection.startTLS(self, ctx)
 exceptions.AttributeError: type object 'Connection' has no attribute 'startTLS'
 ]
local#261 14:08:37.971: negotiation had internal error:
 FAILURE:
 [CopiedFailure instance: Traceback from remote host -- Traceback (most recent call last):
   File "/usr/lib/python2.5/site-packages/Twisted-8.1.0-py2.5-linux-x86_64.egg/twisted/python/context.py", line 59, in callWithContext
     return self.currentContext().callWithContext(ctx, func, *args, **kw)
   File "/usr/lib/python2.5/site-packages/Twisted-8.1.0-py2.5-linux-x86_64.egg/twisted/python/context.py", line 37, in callWithContext
     return func(*args,**kw)
   File "/usr/lib/python2.5/site-packages/Twisted-8.1.0-py2.5-linux-x86_64.egg/twisted/internet/pollreactor.py", line 186, in _doReadOrWrite
     why = selectable.doRead()
   File "/usr/lib/python2.5/site-packages/Twisted-8.1.0-py2.5-linux-x86_64.egg/twisted/internet/tcp.py", line 362, in doRead
     return self.protocol.dataReceived(data)
 --- <exception caught here> ---
   File "build/bdist.linux-x86_64/egg/foolscap/negotiate.py", line 402, in dataReceived
     
   File "build/bdist.linux-x86_64/egg/foolscap/negotiate.py", line 527, in handlePLAINTEXTServer
     
   File "build/bdist.linux-x86_64/egg/foolscap/negotiate.py", line 554, in sendPlaintextServerAndStartENCRYPTED
     
   File "build/bdist.linux-x86_64/egg/foolscap/negotiate.py", line 589, in startENCRYPTED
     
   File "build/bdist.linux-x86_64/egg/foolscap/negotiate.py", line 1139, in startTLS
     
   File "/usr/lib/python2.5/site-packages/Twisted-8.1.0-py2.5-linux-x86_64.egg/twisted/internet/tcp.py", line 660, in startTLS
     holder = Connection.startTLS(self, ctx)
 exceptions.AttributeError: type object 'Connection' has no attribute 'startTLS'
 ]
local#262 14:08:37.976: Tub location set to 192.168.1.126:33701,127.0.0.1:33701
local#263 14:08:37.977: client running
local#264 14:08:37.977: dataReceived(isClient=False,phase=0,options={}): 'GET /id/njeaxgfqknd2as37fxdkl4skyybywey6 HTTP/1.1\r\nHost: 192.168.1.126\r\nUpgrade: TLS/1.0\r\nConnection: Upgrade\r\n\r\n'
local#265 14:08:37.977: handlePLAINTEXTServer: targetTubID='njeaxgfqknd2as37fxdkl4skyybywey6'
local#266 14:08:37.977: handlePLAINTEXTServer: wantEncrypted=True
local#267 14:08:37.978: startENCRYPTED(isClient=False, encrypted=True)
local#268 14:08:37.978: startTLS, client=False
local#269 14:08:37.986: negotiation had exception
 FAILURE:
 [CopiedFailure instance: Traceback from remote host -- Traceback (most recent call last):
   File "/usr/lib/python2.5/site-packages/Twisted-8.1.0-py2.5-linux-x86_64.egg/twisted/python/context.py", line 59, in callWithContext

So, it looks like the problems on yukyuk don't necessarily have anything to do with unicode, but it's just that the test_cli.Cp test is the first one that tries to construct a TLS connection, or something: ``` local#247 14:08:37.958: TubConnector created from bduf2tsgfciu2j4zrs7oxvt52h5ikgtj to njeaxgfqknd2as37fxdkl4skyybywey6 local#248 14:08:37.958: connectTCP to ('127.0.0.1', 57617) local#249 14:08:37.958: Starting factory <foolscap.negotiate.TubConnectorClientFactory object [from bduf2tsg] [to njeaxgfq] at 0x19c8b50> local#250 14:08:37.958: connectTCP to ('192.168.1.126', 57617) local#251 14:08:37.958: Starting factory <foolscap.negotiate.TubConnectorClientFactory object [from bduf2tsg] [to njeaxgfq] at 0x19c8c90> local#252 14:08:37.959: want to subscribe, but no introducer yet local#253 14:08:37.959: want to publish, but no introducer yet local#254 14:08:37.960: want to publish, but no introducer yet local#255 14:08:37.962: dataReceived(isClient=False,phase=0,options={}): 'GET /id/njeaxgfqknd2as37fxdkl4skyybywey6 HTTP/1.1\r\nHost: 127.0.0.1\r\nUpgrade: TLS/1.0\r\nConnection: Upgrade\r\n\r\n' local#256 14:08:37.962: handlePLAINTEXTServer: targetTubID='njeaxgfqknd2as37fxdkl4skyybywey6' local#257 14:08:37.962: handlePLAINTEXTServer: wantEncrypted=True local#258 14:08:37.963: startENCRYPTED(isClient=False, encrypted=True) local#259 14:08:37.963: startTLS, client=False local#260 14:08:37.966: negotiation had exception FAILURE: [CopiedFailure instance: Traceback from remote host -- Traceback (most recent call last): File "/usr/lib/python2.5/site-packages/Twisted-8.1.0-py2.5-linux-x86_64.egg/twisted/python/context.py", line 59, in callWithContext return self.currentContext().callWithContext(ctx, func, *args, **kw) File "/usr/lib/python2.5/site-packages/Twisted-8.1.0-py2.5-linux-x86_64.egg/twisted/python/context.py", line 37, in callWithContext return func(*args,**kw) File "/usr/lib/python2.5/site-packages/Twisted-8.1.0-py2.5-linux-x86_64.egg/twisted/internet/pollreactor.py", line 186, in _doReadOrWrite why = selectable.doRead() File "/usr/lib/python2.5/site-packages/Twisted-8.1.0-py2.5-linux-x86_64.egg/twisted/internet/tcp.py", line 362, in doRead return self.protocol.dataReceived(data) --- <exception caught here> --- File "build/bdist.linux-x86_64/egg/foolscap/negotiate.py", line 402, in dataReceived File "build/bdist.linux-x86_64/egg/foolscap/negotiate.py", line 527, in handlePLAINTEXTServer File "build/bdist.linux-x86_64/egg/foolscap/negotiate.py", line 554, in sendPlaintextServerAndStartENCRYPTED File "build/bdist.linux-x86_64/egg/foolscap/negotiate.py", line 589, in startENCRYPTED File "build/bdist.linux-x86_64/egg/foolscap/negotiate.py", line 1139, in startTLS File "/usr/lib/python2.5/site-packages/Twisted-8.1.0-py2.5-linux-x86_64.egg/twisted/internet/tcp.py", line 660, in startTLS holder = Connection.startTLS(self, ctx) exceptions.AttributeError: type object 'Connection' has no attribute 'startTLS' ] local#261 14:08:37.971: negotiation had internal error: FAILURE: [CopiedFailure instance: Traceback from remote host -- Traceback (most recent call last): File "/usr/lib/python2.5/site-packages/Twisted-8.1.0-py2.5-linux-x86_64.egg/twisted/python/context.py", line 59, in callWithContext return self.currentContext().callWithContext(ctx, func, *args, **kw) File "/usr/lib/python2.5/site-packages/Twisted-8.1.0-py2.5-linux-x86_64.egg/twisted/python/context.py", line 37, in callWithContext return func(*args,**kw) File "/usr/lib/python2.5/site-packages/Twisted-8.1.0-py2.5-linux-x86_64.egg/twisted/internet/pollreactor.py", line 186, in _doReadOrWrite why = selectable.doRead() File "/usr/lib/python2.5/site-packages/Twisted-8.1.0-py2.5-linux-x86_64.egg/twisted/internet/tcp.py", line 362, in doRead return self.protocol.dataReceived(data) --- <exception caught here> --- File "build/bdist.linux-x86_64/egg/foolscap/negotiate.py", line 402, in dataReceived File "build/bdist.linux-x86_64/egg/foolscap/negotiate.py", line 527, in handlePLAINTEXTServer File "build/bdist.linux-x86_64/egg/foolscap/negotiate.py", line 554, in sendPlaintextServerAndStartENCRYPTED File "build/bdist.linux-x86_64/egg/foolscap/negotiate.py", line 589, in startENCRYPTED File "build/bdist.linux-x86_64/egg/foolscap/negotiate.py", line 1139, in startTLS File "/usr/lib/python2.5/site-packages/Twisted-8.1.0-py2.5-linux-x86_64.egg/twisted/internet/tcp.py", line 660, in startTLS holder = Connection.startTLS(self, ctx) exceptions.AttributeError: type object 'Connection' has no attribute 'startTLS' ] local#262 14:08:37.976: Tub location set to 192.168.1.126:33701,127.0.0.1:33701 local#263 14:08:37.977: client running local#264 14:08:37.977: dataReceived(isClient=False,phase=0,options={}): 'GET /id/njeaxgfqknd2as37fxdkl4skyybywey6 HTTP/1.1\r\nHost: 192.168.1.126\r\nUpgrade: TLS/1.0\r\nConnection: Upgrade\r\n\r\n' local#265 14:08:37.977: handlePLAINTEXTServer: targetTubID='njeaxgfqknd2as37fxdkl4skyybywey6' local#266 14:08:37.977: handlePLAINTEXTServer: wantEncrypted=True local#267 14:08:37.978: startENCRYPTED(isClient=False, encrypted=True) local#268 14:08:37.978: startTLS, client=False local#269 14:08:37.986: negotiation had exception FAILURE: [CopiedFailure instance: Traceback from remote host -- Traceback (most recent call last): File "/usr/lib/python2.5/site-packages/Twisted-8.1.0-py2.5-linux-x86_64.egg/twisted/python/context.py", line 59, in callWithContext ```

zooko commented

2008-11-18 23:34:31 +00:00

Okay, that problem on yukyuk was that I had deliberately introduced an uncaught exception into my system-wide installation of Twisted when testing something else. Please disregard all that.

So now to debug the unicode issue I'm moving back to our dapper buildslave..

Okay, that problem on yukyuk was that I had deliberately introduced an uncaught exception into my system-wide installation of Twisted when testing something else. Please disregard all that. So now to debug the unicode issue I'm moving back to our dapper buildslave..

warner commented

2008-11-20 00:42:57 +00:00

So, as an extra datapoint, sys.getfilesystemencoding() on the two buildslaves that are failing reports "UTF-8". My local workstation (on which this test passes) reports "ANSI_X3.4-1968". However, on our gutsy and hardy buildslaves (which are also passing), it reports UTF-8 too.

Some differential analysis shows that one early sign of trouble is during the webapi t=set_children call that 'tahoe cp' does. This API uses the under-documented t=set_children call, which (from what I can tell) accepts a JSON-encoded POST body that has a mapping from child name to a description of that child (including the URI). I see a difference (between the working platform and the failing platform) in the JSON that is sent with this request.

The working platform (my workstation 'fluxx') sends:

{"\u00c4rtonwall": ["filenode", {"rw_uri": "URI:LIT:ovxgsy3pmrssaztjnrssay3pnz2gk3tu"}]}

The failing platform (our feisty buildslave) sends:

{"\u00c3\u0084rtonwall": ["filenode", {"rw_uri": "URI:LIT:ovxgsy3pmrssaztjnrssay3pnz2gk3tu"}]}

Which suggests that the failing platform is double-encoding the filename.

Still investigating.. so far this is pointing at the client-side code (in source:src/allmydata/scripts/tahoe_cp.py).

So, as an extra datapoint, `sys.getfilesystemencoding()` on the two buildslaves that are failing reports "UTF-8". My local workstation (on which this test passes) reports "ANSI_X3.4-1968". However, on our gutsy and hardy buildslaves (which are also passing), it reports UTF-8 too. Some differential analysis shows that one early sign of trouble is during the webapi t=set_children call that 'tahoe cp' does. This API uses the under-documented t=set_children call, which (from what I can tell) accepts a JSON-encoded POST body that has a mapping from child name to a description of that child (including the URI). I see a difference (between the working platform and the failing platform) in the JSON that is sent with this request. The working platform (my workstation 'fluxx') sends: ``` {"\u00c4rtonwall": ["filenode", {"rw_uri": "URI:LIT:ovxgsy3pmrssaztjnrssay3pnz2gk3tu"}]} ``` The failing platform (our feisty buildslave) sends: ``` {"\u00c3\u0084rtonwall": ["filenode", {"rw_uri": "URI:LIT:ovxgsy3pmrssaztjnrssay3pnz2gk3tu"}]} ``` Which suggests that the failing platform is double-encoding the filename. Still investigating.. so far this is pointing at the client-side code (in source:src/allmydata/scripts/tahoe_cp.py).

francois commented

2008-11-24 11:20:52 +00:00

Author

Owner

Zooko, Brian:

If you give me a copy of your virtual machine disk, I could have a look into this failing test. What do you think ?

Zooko, Brian: If you give me a copy of your virtual machine disk, I could have a look into this failing test. What do you think ?

zooko commented

2008-11-24 13:56:23 +00:00

That's a nice offer, Francois. I don't know how to get a copy of that vm image and I suspect that it would take hours to transfer over the net (or, heh, upload to a Tahoe grid).

I have an idea! I'll give you an account on the dapper buildslave. There. Please send me an ssh public key to zooko@zooko.com and I'll put it into francois@slave3.allmydata.com:.ssh/authorized_keys and then you should be able to log in!

That's a nice offer, Francois. I don't know how to get a copy of that vm image and I suspect that it would take hours to transfer over the net (or, heh, upload to a Tahoe grid). I have an idea! I'll give you an account on the dapper buildslave. There. Please send me an ssh public key to zooko@zooko.com and I'll put it into `francois@slave3.allmydata.com:.ssh/authorized_keys` and then you should be able to log in!

francois commented

2008-11-25 11:28:56 +00:00

Author

Owner

I can see a difference in simplejson between the failing platform (dapper buildbot) and a working one (my workstation).

Test script:

# encoding=utf-8

import simplejson

a = "Ärtonwall"
b = simplejson.dumps(a)

print type(a), a
print type(b), b

Working platform:

<type 'str'> Ärtonwall
<type 'str'> "\u00c4rtonwall"

Failing platform:

<type 'str'> Ärtonwall
<type 'str'> "\uffc3\uff84rtonwall"

However, in both cases having \u control sequences in str objects sounds weird.

What is the current policy in tahoe according to string handling ? It looks like utf-8 encoded str objects are used in some places while unicode objects are used in others.

More on that later...

I can see a difference in simplejson between the failing platform (dapper buildbot) and a working one (my workstation). Test script: ```/usr/bin/env python # encoding=utf-8 import simplejson a = "Ärtonwall" b = simplejson.dumps(a) print type(a), a print type(b), b ``` Working platform: ``` <type 'str'> Ärtonwall <type 'str'> "\u00c4rtonwall" ``` Failing platform: ``` <type 'str'> Ärtonwall <type 'str'> "\uffc3\uff84rtonwall" ``` However, in both cases having \u control sequences in str objects sounds weird. What is the current policy in tahoe according to string handling ? It looks like utf-8 encoded str objects are used in some places while unicode objects are used in others. More on that later...

zooko commented

2008-11-25 15:15:05 +00:00

Oh, perhaps this has something to do with the version of simplejson!

The failing dapper has

2008-11-24 21:03:29.998Z [-] tahoe versions: allmydata: 1.2.0-r3235, foolscap: 0.3.2, pycryptopp: 0.5.12, zfec: 1.4.2, twisted: 8.1.0, nevow: 0.9.32, python: 2.4.3, platform: Linux-Ubuntu_6.06-i686-32bit, simplejson: 1.8.1, pyopenssl: 0.7, setuptools: 0.7a1

The failing feisty2.5 has

2008-11-24 20:47:45.960Z [-] tahoe versions: allmydata: 1.2.0-r3235, foolscap: 0.3.2, pycryptopp: 0.5.2-1, zfec: 1.4.0-4, twisted: 2.5.0, nevow: 0.9.32, python: 2.5.1, platform: Linux-Ubuntu_7.04-i686-32bit, simplejson: 1.4, pyopenssl: 0.6, setuptools: 0.7a1

Hm.

Passing buildslaves:

2008-11-24 20:53:04.924Z [-] tahoe versions: allmydata: 1.2.0-r3235, foolscap: 0.3.2, pycryptopp: 0.5.12, zfec: 1.4.2, twisted: 2.5.0, nevow: 0.9.18, python: 2.5.1, platform: CYGWIN_NT-5.1-1.5.25-0.156-4-2-i686-32bit-WindowsPE, simplejson: 2.0.5, pyopenssl: 0.6, setuptools: 0.7a1
2008-11-24 20:52:39.742Z [-] tahoe versions: allmydata: 1.2.0-r3235, foolscap: 0.3.2, pycryptopp: 0.5.12, zfec: 1.4.2, twisted: 8.1.0, nevow: 0.9.32, python: 2.4.4c1, platform: Linux-Ubuntu_6.10-i686-32bit, simplejson: 2.0.5, pyopenssl: 0.7, setuptools: 0.7a1
2008-11-24 09:54:08.473Z [-] tahoe versions: allmydata: 1.2.0-r3235, foolscap: 0.3.2, pycryptopp: 0.5.12, zfec: 1.1, twisted: 8.1.0, nevow: 0.9.32, python: 2.4.4, platform: Linux-debian_4.0-i686-32bit, simplejson: 2.0.5, pyopenssl: 0.7, setuptools: 0.7a1
2008-11-24 20:44:52.993Z [-] tahoe versions: allmydata: 1.2.0-r3235, foolscap: 0.3.2, pycryptopp: 0.5.2-1, zfec: 1.4.0-4, twisted: 2.5.0, nevow: 0.9.32, python: 2.5.1, platform: Linux-Ubuntu_7.10-i686-32bit, simplejson: 1.7.1, pyopenssl: 0.6, setuptools: 0.7a1
2008-11-24 20:51:45.037Z [-] tahoe versions: allmydata: 1.2.0-r3235, foolscap: 0.3.1, pycryptopp: 0.5.12, zfec: 1.4.2, twisted: 2.5.0, nevow: 0.9.32, python: 2.5.2, platform: Linux-Ubuntu_8.04-i686-32bit, simplejson: 2.0.5, pyopenssl: 0.6, setuptools: 0.6c8
2008-11-24 20:52:14.527Z [-] tahoe versions: allmydata: 1.2.0-r3235, foolscap: 0.3.2, pycryptopp: 0.5.12, zfec: 1.4.2, twisted: 8.1.0, nevow: 0.9.32, python: 2.6.0, platform: Linux-Ubuntu_8.04-i686-32bit_ELF, simplejson: 2.0.5, pyopenssl: 0.7, setuptools: 0.7a1
2008-11-24 20:46:17.288Z [-] tahoe versions: allmydata: 1.2.0-r3235, foolscap: 0.3.2, pycryptopp: 0.5.4, zfec: 1.4.2, twisted: 8.0.1, nevow: 0.9.32, python: 2.4.3, platform: SunOS-5.11-i86pc-i386-32bit, simplejson: 2.0.5, pyopenssl: 0.7, setuptools: 0.7a1
2008-11-24 20:49:20.293Z [-] tahoe versions: allmydata: 1.2.0-r3235, foolscap: 0.3.2, pycryptopp: 0.5.12, zfec: 1.4.2, twisted: 2.5.0, nevow: 0.9.32, python: 2.5.1, platform: Windows-XP-5.1.2600-SP2, simplejson: 2.0.5, pyopenssl: 0.6, setuptools: 0.7a1
2008-11-24 20:46:38.028Z [-] tahoe versions: allmydata: 1.2.0-r3235, foolscap: 0.3.2, pycryptopp: 0.5.12, zfec: 1.4.2, twisted: 8.1.0, nevow: 0.9.32, python: 2.5.1, platform: Darwin-9.5.0-i386-32bit, simplejson: 2.0.5, pyopenssl: 0.6, setuptools: 0.6c8
2008-11-24 20:45:18.952Z [-] tahoe versions: allmydata: 1.2.0-r3235, foolscap: 0.3.2, pycryptopp: 0.5.1, zfec: 1.4.1-2, twisted: 8.1.0, nevow: 0.9.31, python: 2.5.2, platform: Linux-Ubuntu_8.04-x86_64-64bit, simplejson: 2.0.3, pyopenssl: 0.7, setuptools: 0.7a1
2008-11-19 23:36:30.603Z [-] tahoe versions: allmydata: 1.2.0-r3225, foolscap: 0.3.2, pycryptopp: 0.5.12, zfec: 1.4.2, twisted: 8.1.0, nevow: 0.9.32, python: 2.6.0, platform: Linux-_-x86_64-64bit_ELF, simplejson: 2.0.4, pyopenssl: 0.8, setuptools: 0.7a1

Oh, perhaps this has something to do with the version of simplejson! The failing dapper has 2008-11-24 21:03:29.998Z [-] tahoe versions: allmydata: 1.2.0-r3235, foolscap: 0.3.2, pycryptopp: 0.5.12, zfec: 1.4.2, twisted: 8.1.0, nevow: 0.9.32, python: 2.4.3, platform: Linux-Ubuntu_6.06-i686-32bit, simplejson: 1.8.1, pyopenssl: 0.7, setuptools: 0.7a1 The failing feisty2.5 has 2008-11-24 20:47:45.960Z [-] tahoe versions: allmydata: 1.2.0-r3235, foolscap: 0.3.2, pycryptopp: 0.5.2-1, zfec: 1.4.0-4, twisted: 2.5.0, nevow: 0.9.32, python: 2.5.1, platform: Linux-Ubuntu_7.04-i686-32bit, simplejson: 1.4, pyopenssl: 0.6, setuptools: 0.7a1 Hm. Passing buildslaves: 2008-11-24 20:53:04.924Z [-] tahoe versions: allmydata: 1.2.0-r3235, foolscap: 0.3.2, pycryptopp: 0.5.12, zfec: 1.4.2, twisted: 2.5.0, nevow: 0.9.18, python: 2.5.1, platform: CYGWIN_NT-5.1-1.5.25-0.156-4-2-i686-32bit-WindowsPE, simplejson: 2.0.5, pyopenssl: 0.6, setuptools: 0.7a1 2008-11-24 20:52:39.742Z [-] tahoe versions: allmydata: 1.2.0-r3235, foolscap: 0.3.2, pycryptopp: 0.5.12, zfec: 1.4.2, twisted: 8.1.0, nevow: 0.9.32, python: 2.4.4c1, platform: Linux-Ubuntu_6.10-i686-32bit, simplejson: 2.0.5, pyopenssl: 0.7, setuptools: 0.7a1 2008-11-24 09:54:08.473Z [-] tahoe versions: allmydata: 1.2.0-r3235, foolscap: 0.3.2, pycryptopp: 0.5.12, zfec: 1.1, twisted: 8.1.0, nevow: 0.9.32, python: 2.4.4, platform: Linux-debian_4.0-i686-32bit, simplejson: 2.0.5, pyopenssl: 0.7, setuptools: 0.7a1 2008-11-24 20:44:52.993Z [-] tahoe versions: allmydata: 1.2.0-r3235, foolscap: 0.3.2, pycryptopp: 0.5.2-1, zfec: 1.4.0-4, twisted: 2.5.0, nevow: 0.9.32, python: 2.5.1, platform: Linux-Ubuntu_7.10-i686-32bit, simplejson: 1.7.1, pyopenssl: 0.6, setuptools: 0.7a1 2008-11-24 20:51:45.037Z [-] tahoe versions: allmydata: 1.2.0-r3235, foolscap: 0.3.1, pycryptopp: 0.5.12, zfec: 1.4.2, twisted: 2.5.0, nevow: 0.9.32, python: 2.5.2, platform: Linux-Ubuntu_8.04-i686-32bit, simplejson: 2.0.5, pyopenssl: 0.6, setuptools: 0.6c8 2008-11-24 20:52:14.527Z [-] tahoe versions: allmydata: 1.2.0-r3235, foolscap: 0.3.2, pycryptopp: 0.5.12, zfec: 1.4.2, twisted: 8.1.0, nevow: 0.9.32, python: 2.6.0, platform: Linux-Ubuntu_8.04-i686-32bit_ELF, simplejson: 2.0.5, pyopenssl: 0.7, setuptools: 0.7a1 2008-11-24 20:46:17.288Z [-] tahoe versions: allmydata: 1.2.0-r3235, foolscap: 0.3.2, pycryptopp: 0.5.4, zfec: 1.4.2, twisted: 8.0.1, nevow: 0.9.32, python: 2.4.3, platform: SunOS-5.11-i86pc-i386-32bit, simplejson: 2.0.5, pyopenssl: 0.7, setuptools: 0.7a1 2008-11-24 20:49:20.293Z [-] tahoe versions: allmydata: 1.2.0-r3235, foolscap: 0.3.2, pycryptopp: 0.5.12, zfec: 1.4.2, twisted: 2.5.0, nevow: 0.9.32, python: 2.5.1, platform: Windows-XP-5.1.2600-SP2, simplejson: 2.0.5, pyopenssl: 0.6, setuptools: 0.7a1 2008-11-24 20:46:38.028Z [-] tahoe versions: allmydata: 1.2.0-r3235, foolscap: 0.3.2, pycryptopp: 0.5.12, zfec: 1.4.2, twisted: 8.1.0, nevow: 0.9.32, python: 2.5.1, platform: Darwin-9.5.0-i386-32bit, simplejson: 2.0.5, pyopenssl: 0.6, setuptools: 0.6c8 2008-11-24 20:45:18.952Z [-] tahoe versions: allmydata: 1.2.0-r3235, foolscap: 0.3.2, pycryptopp: 0.5.1, zfec: 1.4.1-2, twisted: 8.1.0, nevow: 0.9.31, python: 2.5.2, platform: Linux-Ubuntu_8.04-x86_64-64bit, simplejson: 2.0.3, pyopenssl: 0.7, setuptools: 0.7a1 2008-11-19 23:36:30.603Z [-] tahoe versions: allmydata: 1.2.0-r3225, foolscap: 0.3.2, pycryptopp: 0.5.12, zfec: 1.4.2, twisted: 8.1.0, nevow: 0.9.32, python: 2.6.0, platform: Linux-_-x86_64-64bit_ELF, simplejson: 2.0.4, pyopenssl: 0.8, setuptools: 0.7a1

zooko commented

2008-11-25 15:16:27 +00:00

Sigh. Forgot wiki markup

2008-11-24 20:53:04.924Z [-] tahoe versions: allmydata: 1.2.0-r3235, foolscap: 0.3.2, pycryptopp: 0.5.12, zfec: 1.4.2, twisted: 2.5.0, nevow: 0.9.18, python: 2.5.1, platform: CYGWIN_NT-5.1-1.5.25-0.156-4-2-i686-32bit-WindowsPE, simplejson: 2.0.5, pyopenssl: 0.6, setuptools: 0.7a1
2008-11-24 20:52:39.742Z [-] tahoe versions: allmydata: 1.2.0-r3235, foolscap: 0.3.2, pycryptopp: 0.5.12, zfec: 1.4.2, twisted: 8.1.0, nevow: 0.9.32, python: 2.4.4c1, platform: Linux-Ubuntu_6.10-i686-32bit, simplejson: 2.0.5, pyopenssl: 0.7, setuptools: 0.7a1
2008-11-24 09:54:08.473Z [-] tahoe versions: allmydata: 1.2.0-r3235, foolscap: 0.3.2, pycryptopp: 0.5.12, zfec: 1.1, twisted: 8.1.0, nevow: 0.9.32, python: 2.4.4, platform: Linux-debian_4.0-i686-32bit, simplejson: 2.0.5, pyopenssl: 0.7, setuptools: 0.7a1
2008-11-24 20:44:52.993Z [-] tahoe versions: allmydata: 1.2.0-r3235, foolscap: 0.3.2, pycryptopp: 0.5.2-1, zfec: 1.4.0-4, twisted: 2.5.0, nevow: 0.9.32, python: 2.5.1, platform: Linux-Ubuntu_7.10-i686-32bit, simplejson: 1.7.1, pyopenssl: 0.6, setuptools: 0.7a1
2008-11-24 20:51:45.037Z [-] tahoe versions: allmydata: 1.2.0-r3235, foolscap: 0.3.1, pycryptopp: 0.5.12, zfec: 1.4.2, twisted: 2.5.0, nevow: 0.9.32, python: 2.5.2, platform: Linux-Ubuntu_8.04-i686-32bit, simplejson: 2.0.5, pyopenssl: 0.6, setuptools: 0.6c8
2008-11-24 20:52:14.527Z [-] tahoe versions: allmydata: 1.2.0-r3235, foolscap: 0.3.2, pycryptopp: 0.5.12, zfec: 1.4.2, twisted: 8.1.0, nevow: 0.9.32, python: 2.6.0, platform: Linux-Ubuntu_8.04-i686-32bit_ELF, simplejson: 2.0.5, pyopenssl: 0.7, setuptools: 0.7a1
2008-11-24 20:46:17.288Z [-] tahoe versions: allmydata: 1.2.0-r3235, foolscap: 0.3.2, pycryptopp: 0.5.4, zfec: 1.4.2, twisted: 8.0.1, nevow: 0.9.32, python: 2.4.3, platform: SunOS-5.11-i86pc-i386-32bit, simplejson: 2.0.5, pyopenssl: 0.7, setuptools: 0.7a1
2008-11-24 20:49:20.293Z [-] tahoe versions: allmydata: 1.2.0-r3235, foolscap: 0.3.2, pycryptopp: 0.5.12, zfec: 1.4.2, twisted: 2.5.0, nevow: 0.9.32, python: 2.5.1, platform: Windows-XP-5.1.2600-SP2, simplejson: 2.0.5, pyopenssl: 0.6, setuptools: 0.7a1
2008-11-24 20:46:38.028Z [-] tahoe versions: allmydata: 1.2.0-r3235, foolscap: 0.3.2, pycryptopp: 0.5.12, zfec: 1.4.2, twisted: 8.1.0, nevow: 0.9.32, python: 2.5.1, platform: Darwin-9.5.0-i386-32bit, simplejson: 2.0.5, pyopenssl: 0.6, setuptools: 0.6c8
2008-11-24 20:45:18.952Z [-] tahoe versions: allmydata: 1.2.0-r3235, foolscap: 0.3.2, pycryptopp: 0.5.1, zfec: 1.4.1-2, twisted: 8.1.0, nevow: 0.9.31, python: 2.5.2, platform: Linux-Ubuntu_8.04-x86_64-64bit, simplejson: 2.0.3, pyopenssl: 0.7, setuptools: 0.7a1
2008-11-19 23:36:30.603Z [-] tahoe versions: allmydata: 1.2.0-r3225, foolscap: 0.3.2, pycryptopp: 0.5.12, zfec: 1.4.2, twisted: 8.1.0, nevow: 0.9.32, python: 2.6.0, platform: Linux-_-x86_64-64bit_ELF, simplejson: 2.0.4, pyopenssl: 0.8, setuptools: 0.7a1

Sigh. Forgot wiki markup ``` 2008-11-24 20:53:04.924Z [-] tahoe versions: allmydata: 1.2.0-r3235, foolscap: 0.3.2, pycryptopp: 0.5.12, zfec: 1.4.2, twisted: 2.5.0, nevow: 0.9.18, python: 2.5.1, platform: CYGWIN_NT-5.1-1.5.25-0.156-4-2-i686-32bit-WindowsPE, simplejson: 2.0.5, pyopenssl: 0.6, setuptools: 0.7a1 2008-11-24 20:52:39.742Z [-] tahoe versions: allmydata: 1.2.0-r3235, foolscap: 0.3.2, pycryptopp: 0.5.12, zfec: 1.4.2, twisted: 8.1.0, nevow: 0.9.32, python: 2.4.4c1, platform: Linux-Ubuntu_6.10-i686-32bit, simplejson: 2.0.5, pyopenssl: 0.7, setuptools: 0.7a1 2008-11-24 09:54:08.473Z [-] tahoe versions: allmydata: 1.2.0-r3235, foolscap: 0.3.2, pycryptopp: 0.5.12, zfec: 1.1, twisted: 8.1.0, nevow: 0.9.32, python: 2.4.4, platform: Linux-debian_4.0-i686-32bit, simplejson: 2.0.5, pyopenssl: 0.7, setuptools: 0.7a1 2008-11-24 20:44:52.993Z [-] tahoe versions: allmydata: 1.2.0-r3235, foolscap: 0.3.2, pycryptopp: 0.5.2-1, zfec: 1.4.0-4, twisted: 2.5.0, nevow: 0.9.32, python: 2.5.1, platform: Linux-Ubuntu_7.10-i686-32bit, simplejson: 1.7.1, pyopenssl: 0.6, setuptools: 0.7a1 2008-11-24 20:51:45.037Z [-] tahoe versions: allmydata: 1.2.0-r3235, foolscap: 0.3.1, pycryptopp: 0.5.12, zfec: 1.4.2, twisted: 2.5.0, nevow: 0.9.32, python: 2.5.2, platform: Linux-Ubuntu_8.04-i686-32bit, simplejson: 2.0.5, pyopenssl: 0.6, setuptools: 0.6c8 2008-11-24 20:52:14.527Z [-] tahoe versions: allmydata: 1.2.0-r3235, foolscap: 0.3.2, pycryptopp: 0.5.12, zfec: 1.4.2, twisted: 8.1.0, nevow: 0.9.32, python: 2.6.0, platform: Linux-Ubuntu_8.04-i686-32bit_ELF, simplejson: 2.0.5, pyopenssl: 0.7, setuptools: 0.7a1 2008-11-24 20:46:17.288Z [-] tahoe versions: allmydata: 1.2.0-r3235, foolscap: 0.3.2, pycryptopp: 0.5.4, zfec: 1.4.2, twisted: 8.0.1, nevow: 0.9.32, python: 2.4.3, platform: SunOS-5.11-i86pc-i386-32bit, simplejson: 2.0.5, pyopenssl: 0.7, setuptools: 0.7a1 2008-11-24 20:49:20.293Z [-] tahoe versions: allmydata: 1.2.0-r3235, foolscap: 0.3.2, pycryptopp: 0.5.12, zfec: 1.4.2, twisted: 2.5.0, nevow: 0.9.32, python: 2.5.1, platform: Windows-XP-5.1.2600-SP2, simplejson: 2.0.5, pyopenssl: 0.6, setuptools: 0.7a1 2008-11-24 20:46:38.028Z [-] tahoe versions: allmydata: 1.2.0-r3235, foolscap: 0.3.2, pycryptopp: 0.5.12, zfec: 1.4.2, twisted: 8.1.0, nevow: 0.9.32, python: 2.5.1, platform: Darwin-9.5.0-i386-32bit, simplejson: 2.0.5, pyopenssl: 0.6, setuptools: 0.6c8 2008-11-24 20:45:18.952Z [-] tahoe versions: allmydata: 1.2.0-r3235, foolscap: 0.3.2, pycryptopp: 0.5.1, zfec: 1.4.1-2, twisted: 8.1.0, nevow: 0.9.31, python: 2.5.2, platform: Linux-Ubuntu_8.04-x86_64-64bit, simplejson: 2.0.3, pyopenssl: 0.7, setuptools: 0.7a1 2008-11-19 23:36:30.603Z [-] tahoe versions: allmydata: 1.2.0-r3225, foolscap: 0.3.2, pycryptopp: 0.5.12, zfec: 1.4.2, twisted: 8.1.0, nevow: 0.9.32, python: 2.6.0, platform: Linux-_-x86_64-64bit_ELF, simplejson: 2.0.4, pyopenssl: 0.8, setuptools: 0.7a1 ```

francois commented

2008-11-25 16:37:57 +00:00

Author

Owner

Yes, seems plausible.

I tried patching setup.py to require simplejson 2.0.5.

francois@slave3:~/tahoe$ darcs diff
diff -rN old-tahoe/setup.py new-tahoe/setup.py                
210a211,212
> setup_requires.append('simplejson >= 2.0.5')
>

But this fail because simplejson-1.8.1 is already installed system-wide.

python setup.py build_tahoe
Traceback (most recent call last):
  File "setup.py", line 451, in ?
    zip_safe=False, # We prefer unzipped for easier access.
  File "/usr/lib/python2.4/distutils/core.py", line 110, in setup
    _setup_distribution = dist = klass(attrs)
  File "/home/francois/tahoe/support/lib/python2.4/site-packages/setuptools-0.6c10dev.egg/setuptools/dist.py", line 219, in __init__ 
    self.fetch_build_eggs(attrs.pop('setup_requires'))
  File "/home/francois/tahoe/support/lib/python2.4/site-packages/setuptools-0.6c10dev.egg/setuptools/dist.py", line 242, in fetch_build_eggs
    for dist in working_set.resolve(
  File "/home/francois/tahoe/support/lib/python2.4/site-packages/setuptools-0.6c10dev.egg/pkg_resources.py", line 528, in resolve
    raise VersionConflict(dist,req) # XXX put more info here
pkg_resources.VersionConflict: (simplejson 1.8.1 (/usr/lib/python2.4/site-packages/simplejson-1.8.1-py2.4-linux-i686.egg), Requirement.parse('simplejson>=2.0.5'))
make[1]: *** [build-once] Error 1

Would it be possible to remove this system-wide egg ?

Yes, seems plausible. I tried patching setup.py to require simplejson 2.0.5. ``` francois@slave3:~/tahoe$ darcs diff diff -rN old-tahoe/setup.py new-tahoe/setup.py 210a211,212 > setup_requires.append('simplejson >= 2.0.5') > ``` But this fail because simplejson-1.8.1 is already installed system-wide. ``` python setup.py build_tahoe Traceback (most recent call last): File "setup.py", line 451, in ? zip_safe=False, # We prefer unzipped for easier access. File "/usr/lib/python2.4/distutils/core.py", line 110, in setup _setup_distribution = dist = klass(attrs) File "/home/francois/tahoe/support/lib/python2.4/site-packages/setuptools-0.6c10dev.egg/setuptools/dist.py", line 219, in __init__ self.fetch_build_eggs(attrs.pop('setup_requires')) File "/home/francois/tahoe/support/lib/python2.4/site-packages/setuptools-0.6c10dev.egg/setuptools/dist.py", line 242, in fetch_build_eggs for dist in working_set.resolve( File "/home/francois/tahoe/support/lib/python2.4/site-packages/setuptools-0.6c10dev.egg/pkg_resources.py", line 528, in resolve raise VersionConflict(dist,req) # XXX put more info here pkg_resources.VersionConflict: (simplejson 1.8.1 (/usr/lib/python2.4/site-packages/simplejson-1.8.1-py2.4-linux-i686.egg), Requirement.parse('simplejson>=2.0.5')) make[1]: *** [build-once] Error 1 ``` Would it be possible to remove this system-wide egg ?

francois commented

2008-11-25 17:19:00 +00:00

Author

Owner

I can confirm that upgrading simplejson to version 2.0.5 does fix this bug on slave3.

This was verified by installing a "clean" python environment with "virtualenv --no-site-package" after applying attached patch.

I can confirm that upgrading simplejson to version 2.0.5 does fix this bug on slave3. This was verified by installing a "clean" python environment with "virtualenv --no-site-package" after applying attached patch.

francois commented

2008-11-25 17:19:39 +00:00

Author

Owner

Attachment require-simplejson.patch (92385 bytes) added

Require simplejson version >= 2.0.5

**Attachment** require-simplejson.patch (92385 bytes) added Require simplejson version >= 2.0.5

require-simplejson.patch

90 KiB

zooko commented

2008-11-25 20:07:34 +00:00

Regarding the versioning conflict, see #530 (use setuptools's --multi-version mode).

Thanks for the fix! I'll apply your patch post haste!

Regarding the versioning conflict, see #530 (use setuptools's --multi-version mode). Thanks for the fix! I'll apply your patch post haste!

zooko commented

2008-11-25 20:09:45 +00:00

But waitasecond, how is it that one of the successful buildbots was using simplejson 2.0.4 and another was using simplejson 1.7.1?

zooko commented

2008-11-25 20:11:44 +00:00

And one of the failing buildbots (dapper) was using simplejson 1.8.1?

zooko commented

2008-11-25 22:05:39 +00:00

Ok, changeset:5ebd7319822836ed, which raises the setuptools requirement from >= 1.4 to > 1.8.1, made the buildbots turn green. I don't really know why, but I'm closing this ticket as "fixed". Thanks a lot, Francois!

(P.S. Please pursue your earlier line of thought of "What exactly is the policy for handling strings within the Tahoe codebase?".)

Ok, changeset:5ebd7319822836ed, which raises the setuptools requirement from >= 1.4 to > 1.8.1, made the buildbots turn green. I don't really know why, but I'm closing this ticket as "fixed". Thanks a lot, Francois! (P.S. Please pursue your earlier line of thought of "What exactly is the policy for handling strings within the Tahoe codebase?".)

zooko added the

r/fixed

label 2008-11-25 22:05:46 +00:00

zooko closed this issue

2008-11-25 22:05:46 +00:00

zooko commented

2008-11-25 22:59:00 +00:00

For anybody who is investigating the history of which version of simplejson different versions of tahoe have required:

changeset:f76a81fda216e4fb
changeset:49dccbd4668a2b12
changeset:1f2e3fc912bb897e
changeset:b0dd88158a3e5b0f
changeset:44c73492704144e8

For anybody who is investigating the history of which version of simplejson different versions of tahoe have required: changeset:f76a81fda216e4fb changeset:49dccbd4668a2b12 changeset:1f2e3fc912bb897e changeset:b0dd88158a3e5b0f changeset:44c73492704144e8

warner commented

2008-11-26 02:45:37 +00:00

So, our policy for handling strings should be:

each variable holds either a bytestring or a unicode string
no variables are uncertain
all function arguments accept one or the other, never both

and comments, variable names, and argument names should help us make this
distinction.

I have a vague feeling that our code was somehow asking simplejson to do
something unreasonable.

In particular, that test code above
((@@http://allmydata.org/trac/tahoe/ticket/534#comment:368988@@)) is asking simplejson
to take a bytestring with non-ASCII characters and JSONify it, which is an
unreasonable thing to ask of it (JSON can encode unicode, but not arbitrary
bytestrings). The input to simplejson.dumps() should always be unicode
objects or bytestrings with only ASCII characters.

The tahoe_cp code must take responsibility for transforming the sys.argv
bytestring (which, hopefully, is UTF-8 encoded) into a unicode object before
doing anything else with it. When it is then put into the webapi set_children
body (which is defined to be a JSON-encoded dictionary), simplejson.dumps()
should be able to handle it. Likewise, if the sys.argv name gets used in a
URL (maybe for tahoe_cp, more likely for tahoe_put), it must be encoded into
UTF-8 and then URL-encoded.

I'm glad this fix seems to improve matters, but I'd like to make sure that
our code is doing the right thing.

So, our policy for handling strings should be: * each variable holds either a bytestring or a unicode string * no variables are uncertain * all function arguments accept one or the other, never both and comments, variable names, and argument names should help us make this distinction. I have a vague feeling that our code was somehow asking simplejson to do something unreasonable. In particular, that test code above ((@@http://allmydata.org/trac/tahoe/ticket/534#[comment:368988](/tahoe-lafs/trac/issues/534#issuecomment-368988)@@)) is asking simplejson to take a bytestring with non-ASCII characters and JSONify it, which is an unreasonable thing to ask of it (JSON can encode unicode, but not arbitrary bytestrings). The input to simplejson.dumps() should always be unicode objects or bytestrings with only ASCII characters. The tahoe_cp code must take responsibility for transforming the sys.argv bytestring (which, hopefully, is UTF-8 encoded) into a unicode object before doing anything else with it. When it is then put into the webapi set_children body (which is defined to be a JSON-encoded dictionary), simplejson.dumps() should be able to handle it. Likewise, if the sys.argv name gets used in a URL (maybe for tahoe_cp, more likely for tahoe_put), it must be encoded into UTF-8 and then URL-encoded. I'm glad this fix seems to improve matters, but I'd like to make sure that our code is doing the right thing.

zooko commented

2008-12-08 04:33:55 +00:00

I'm re-opening this ticket because, per Brian's most recent comment, it seems like the test code is asking simplejson to do something unreasonable, and because per the new #555 (tahoe .deb cannot be installed on hardy: simplejson dependency is too new), it is problematic for Tahoe to require simplejson > 1.7.1.

zooko commented

2008-12-08 04:34:44 +00:00

See also changeset:9d729109d2d4c24a where I change the requirement to simplejson >= 1.7.1 (again).

See also changeset:9d729109d2d4c24a where I change the requirement to `simplejson >= 1.7.1` (again).

tahoe-lafs removed the

r/fixed

label 2008-12-08 10:59:40 +00:00

francois reopened this issue

2008-12-08 10:59:40 +00:00

zooko commented

2008-12-08 14:33:40 +00:00

So there is something about our Ubuntu Dapper buildslave that makes it fail allmydata.test.test_cli.Cp.test_unicode_filename even when other buildslaves using the same version of simplejson pass. Anyway, I just went and experimented on the Dapper buildslave, by easy_install'ing different versions of simplejson from here: http://pypi.python.org/simple/simplejson

With simplejson v1.7.5, v1.8, v1.8.1, and v1.9 this unit test fails. With simplejson v2.0 it passes.

I still haven't understood what this test does or why Brian thinks it is an invalid test.

So there is something about our Ubuntu Dapper buildslave that makes it fail `allmydata.test.test_cli.Cp.test_unicode_filename` even when other buildslaves using the same version of `simplejson` pass. Anyway, I just went and experimented on the Dapper buildslave, by `easy_install`'ing different versions of `simplejson` from here: <http://pypi.python.org/simple/simplejson> With `simplejson` v1.7.5, v1.8, v1.8.1, and v1.9 this unit test fails. With `simplejson` v2.0 it passes. I still haven't understood what this test does or why Brian thinks it is an invalid test.

warner commented

2008-12-09 02:13:30 +00:00

I'm sorry if it sounded like I was accusing test_unicode_filename of
doing something invalid. I think test_unicode_filename is ok (although I
haven't looked at it very closely yet).

My comment about "that test code above .. asking .. an unreasonable thing"
was specifically about the test script which Francois posted to this ticket:

# encoding=utf-8

import simplejson

a = "Ärtonwall"   # <- LATIN CAPITAL LETTER A WITH DIAERESIS (U+00C4)
b = simplejson.dumps(a)

print type(a), a
print type(b), b

This code is passing a bytestring with non-ASCII characters to
simplejson.dumps, but JSON is not capable of encoding non-ASCII bytestrings.
So it's pushing simplejson.dumps beyond its normal territory, into the range
where things are not usually as well tested (i.e. how does dumps() respond to
invalid inputs?). I'm not surprised that its response will vary a lot from
one minor version to the next: error cases are usually driven by bug reports.

I'd find it more reasonable if that code instead used:

a = u"Ärtonwall"

The tahoe_cp.py code should be careful to never send non-ASCII
bytestrings to simplejson.dumps. In fact, it should never send bytestrings to
dumps() at all, but instead should be sending unicode objects (well, at least
for filenames. for other tokens it's pretty benign).

So the analysis that I want to do on this issue is to trace the code in
tahoe_cp.py, from the point where a filename is read off the local disk
(using os.listdir), to the point where it is passed to simplejson.dumps. This
is an area of considerable disagreement within the Python community (the
python-dev list has been very busy in the last week with a discussion of how
unicode filenames and environment variables ought to be handled in the
just-released Python 3.0). I fear we're going to need some platform-specific
code here, but I think it's safe for us to declare that on linux at least we
expect os.listdir to give us bytestrings with UTF-8 encoded names.

Note that this raises the question of what 'tahoe cp' ought to do when a
filename on local disk has non-UTF8 non-ASCII bytes in it: apparently
(according to one of Glyph's recent posts, IIRC) some systems (KDE?) stuff
the high-bit-set bytes into some magical reserved unicode space, so that they
can at least survive a roundtrip. It might not be a bad idea to use this same
technique: if filename.decode("utf-8") on the response from os.listdir
raises UnicodeDecodeError, then map it into this special space instead.

I'm sorry if it sounded like I was accusing `test_unicode_filename` of doing something invalid. I think `test_unicode_filename` is ok (although I haven't looked at it very closely yet). My comment about "that test code above .. asking .. an unreasonable thing" was specifically about the test script which Francois posted to this ticket: ```/usr/bin/env python # encoding=utf-8 import simplejson a = "Ärtonwall" # <- LATIN CAPITAL LETTER A WITH DIAERESIS (U+00C4) b = simplejson.dumps(a) print type(a), a print type(b), b ``` This code is passing a bytestring with non-ASCII characters to simplejson.dumps, but JSON is not capable of encoding non-ASCII bytestrings. So it's pushing simplejson.dumps beyond its normal territory, into the range where things are not usually as well tested (i.e. how does dumps() respond to invalid inputs?). I'm not surprised that its response will vary a lot from one minor version to the next: error cases are usually driven by bug reports. I'd find it more reasonable if that code instead used: ``` a = u"Ärtonwall" ``` The `tahoe_cp.py` code should be careful to never send non-ASCII bytestrings to simplejson.dumps. In fact, it should never send bytestrings to dumps() at all, but instead should be sending unicode objects (well, at least for filenames. for other tokens it's pretty benign). So the analysis that I want to do on this issue is to trace the code in tahoe_cp.py, from the point where a filename is read off the local disk (using os.listdir), to the point where it is passed to simplejson.dumps. This is an area of considerable disagreement within the Python community (the python-dev list has been very busy in the last week with a discussion of how unicode filenames and environment variables ought to be handled in the just-released Python 3.0). I fear we're going to need some platform-specific code here, but I think it's safe for us to declare that on linux at least we expect os.listdir to give us bytestrings with UTF-8 encoded names. Note that this raises the question of what 'tahoe cp' ought to do when a filename on local disk has non-UTF8 non-ASCII bytes in it: apparently (according to one of Glyph's recent posts, IIRC) some systems (KDE?) stuff the high-bit-set bytes into some magical reserved unicode space, so that they can at least survive a roundtrip. It might not be a bad idea to use this same technique: if `filename.decode("utf-8")` on the response from os.listdir raises [UnicodeDecodeError](wiki/UnicodeDecodeError), then map it into this special space instead.

francois commented

2008-12-22 11:39:48 +00:00

Author

Owner

I fully agree with Brian's comment on my test script being broken.

However, the test_unicode_filename test looks reasonable because it calls do_cli with UTF-8 bytestrings, in much the same way as if directly called on the command line under Linux.

I'm going to investigate to find if and where bytestrings are being sent to simplejson, following up on warner.

I fully agree with Brian's comment on my test script being broken. However, the test_unicode_filename test looks reasonable because it calls do_cli with UTF-8 bytestrings, in much the same way as if directly called on the command line under Linux. I'm going to investigate to find if and where bytestrings are being sent to simplejson, following up on [warner](/tahoe-lafs/trac/issues/534#issuecomment-368985).

zooko commented

2008-12-24 20:43:40 +00:00

Per recent discussion http://allmydata.org/pipermail/tahoe-dev/2008-December/000948.html , I've reverted some recent changes in changeset:883e51b02dae732a, and marked the cli unicode test as todo in changeset:9f117dbe8f7260a0.

Per recent discussion <http://allmydata.org/pipermail/tahoe-dev/2008-December/000948.html> , I've reverted some recent changes in changeset:883e51b02dae732a, and marked the cli unicode test as `todo` in changeset:9f117dbe8f7260a0.

francois commented

2009-02-17 11:57:45 +00:00

Author

Owner

Attachment stringutils.dpatch (194898 bytes) added

**Attachment** stringutils.dpatch (194898 bytes) added

stringutils.dpatch

190 KiB

zooko commented

2009-02-17 18:02:10 +00:00

Francois:

Thanks for the patch! I'm glad that it adds a test of mkdir as well.

Two things: 1. We prefer functions which are stricter about their inputs, so could you change to_unicode() so that it raises an exception if the argument is not a str? Like this:

from util.assertutil import precondition

def to_unicode(s):
    precondition(isinstance(s, str), s)

Could you please add some doc to CLI.txt explaining that it currently works only for utf-8 strings on argv or from the filesystem, and that it tries to detect if your filesystem is providing utf-8 and raise an exception if not. Oh, and also there is a third thing:
What happens if I run a command-line like "tahoe cp A B" and sys.getdefaultencoding() returns something other than utf-8? Do I see a Python backtrace on stdout? Please make it so that this causes an explanatory "USAGE"-style string.

Oh, and

Isn't sys.getdefaultencoding() just telling how the Python interpreter is configured, not the underlying operating system? Oh, actually look: there is a "sys.getfilesystemencoding()":

http://docs.python.org/library/sys.html#sys.getfilesystemencoding

So I guess 4 is to use sys.getfilesystemencoding() instead of sys.getdefaultencoding() when trying to determine what encoding we get back from os.listdir(), and 5 is to document in CLI.txt the distinction between filesystem encoding and command-line-arguments encoding.

Phewf. Too bad this isn't easier.

Thanks a lot for your help, François!

Regards,

Zooko

Francois: Thanks for the patch! I'm glad that it adds a test of mkdir as well. Two things: 1. We prefer functions which are stricter about their inputs, so could you change to_unicode() so that it raises an exception if the argument is not a str? Like this: ``` from util.assertutil import precondition def to_unicode(s): precondition(isinstance(s, str), s) ``` 2. Could you please add some doc to CLI.txt explaining that it currently works only for utf-8 strings on argv or from the filesystem, and that it tries to detect if your filesystem is providing utf-8 and raise an exception if not. Oh, and also there is a third thing: 3. What happens if I run a command-line like "tahoe cp A B" and sys.getdefaultencoding() returns something other than utf-8? Do I see a Python backtrace on stdout? Please make it so that this causes an explanatory "USAGE"-style string. Oh, and 4. Isn't sys.getdefaultencoding() just telling how the Python interpreter is configured, not the underlying operating system? Oh, actually look: there is a "sys.getfilesystemencoding()": <http://docs.python.org/library/sys.html#sys.getfilesystemencoding> So I guess 4 is to use sys.getfilesystemencoding() instead of sys.getdefaultencoding() when trying to determine what encoding we get back from os.listdir(), and 5 is to document in CLI.txt the distinction between filesystem encoding and command-line-arguments encoding. Phewf. Too bad this isn't easier. Thanks a lot for your help, François! Regards, Zooko

zooko commented

2009-02-17 18:23:05 +00:00

See also François's post to tahoe-dev, which also mentions #565 (unicode arguments on the command-line) and #629 ('tahoe backup' doesn't tolerate 8-bit filenames).

See also [François's post to tahoe-dev](http://allmydata.org/pipermail/tahoe-dev/2009-February/001200.html), which also mentions #565 (unicode arguments on the command-line) and #629 ('tahoe backup' doesn't tolerate 8-bit filenames).

francois commented

2009-02-24 01:09:43 +00:00

Author

Owner

Ok, the latest patch bundle (unicode.dpatch) will hopefully fix all those unicode issues (including #629) in a much cleaner way.

All encoding conversion code is now located in stringutils.py. It provides a single location where we might be able to use better encoding detection methods such as those discussed on the mailing-list.

Methods currently in used are:

for command line arguments (sys.argv)
- convert from UTF-8 to unicode
- if it fails, returns an UsageError
for text display (sys.stdout)
- convert from unicode to sys.stdout.encoding and replace non-representable characters by '?'
for filename encoding on the filesystem (os.listdir(), open())
- convert between sys.getfilesystemencoding() and unicode
- if it fails, returns an UsageError

Many precondition checks have been added to ensure that filenames are treated as unicode objects.

Ok, the latest patch bundle (unicode.dpatch) will hopefully fix all those unicode issues (including #629) in a much cleaner way. All encoding conversion code is now located in stringutils.py. It provides a single location where we might be able to use better encoding detection methods such as those discussed on the mailing-list. Methods currently in used are: * for command line arguments (sys.argv) * convert from UTF-8 to unicode * if it fails, returns an [UsageError](wiki/UsageError) * for text display (sys.stdout) * convert from unicode to sys.stdout.encoding and replace non-representable characters by '?' * for filename encoding on the filesystem (os.listdir(), open()) * convert between sys.getfilesystemencoding() and unicode * if it fails, returns an [UsageError](wiki/UsageError) Many precondition checks have been added to ensure that filenames are treated as unicode objects.

francois commented

2009-02-24 11:50:41 +00:00

Author

Owner

Attachment unicode.dpatch (37990 bytes) added

**Attachment** unicode.dpatch (37990 bytes) added

unicode.dpatch

37 KiB

francois commented

2009-02-24 11:53:37 +00:00

Author

Owner

Uploaded a new version of unicode.dpatch which include a patch to fix a conflict introduced by changeset:5d57da93fd336692.

francois commented

2009-02-25 09:31:55 +00:00

Author

Owner

Attachment unicode-v2.dpatch (40174 bytes) added

**Attachment** unicode-v2.dpatch (40174 bytes) added

unicode-v2.dpatch

39 KiB

francois commented

2009-02-25 09:32:27 +00:00

Author

Owner

Uploaded a new version of my patch (unicode-v2.dpatch) which fix a conflict introduced by changeset:7e8958671b2e07f4.

francois commented

2009-03-29 14:10:43 +00:00

Author

Owner

Ok, we had plenty of discussion on the mailing-list. Zooko posted a summary available at http://allmydata.org/pipermail/tahoe-dev/2009-March/001379.html

Basically, there are two cases:

Filenames which, based on locale settings, can be decoded into unicode strings
Filenames which, based on locale settings, cannot be decode into unicode strings

Case 1 is what's currently implemented by my patches.

Case 2 requires to optionally add two attributes to each file, a bytestring and a guessed encoding. As Brian pointed out, the bytesting attribute must be encoded as base32 to survive to a JSON trip.

Now, I first want to focus on case 1 while raising an helpful error on case 2.

Does it sound any good ?

Ok, we had plenty of discussion on the mailing-list. Zooko posted a summary available at <http://allmydata.org/pipermail/tahoe-dev/2009-March/001379.html> Basically, there are two cases: 1. Filenames which, based on locale settings, can be decoded into unicode strings 2. Filenames which, based on locale settings, cannot be decode into unicode strings Case 1 is what's currently implemented by my patches. Case 2 requires to optionally add two attributes to each file, a bytestring and a guessed encoding. As Brian pointed out, the bytesting attribute must be encoded as base32 to survive to a JSON trip. Now, I first want to focus on case 1 while raising an helpful error on case 2. Does it sound any good ?

zooko commented

2009-03-31 03:48:44 +00:00

Francois: thanks for working on this! I was planning to amend your patch myself, but I'll let you do it.

Here is my most recent idea about how this should be done:

http://allmydata.org/pipermail/tahoe-dev/2009-March/001379.html

Except that this isn't my most recent idea after all. I amended my intent a little, as prompted by pointed questions from nejucomo on IRC, and by looking at the actual source code where directories are processed:

http://allmydata.org/trac/tahoe/browser/src/allmydata/dirnode.py?rev=6e57576f2eeedcb8#L168

Then I tried to write down my ideas in detail and this forced me to realize that they were incomplete and wrong and I had to amend them a whole lot more in order to finish this letter. Finally, I asked JP Calderone for help, and he helped me understand how to write filenames back into a local Linux filesystem without risking that the user will accidentally overwrite their local files with tahoe files (because the tahoe files were written out under different representation than they were displayed), and how to do normalization, and how to cheaply ensure that silent misdecodings could be repaired in some future generation.

Okay, here's the best design yet:

I think that the unicode representation of the filename should continue to be the unique key in the directory (which current Tahoe 1.3.0 requires).

So there should be a data structure with a required "filename" part, and a required "failed_decode" flag, and an optional "alleged_encoding" part. The "filename" part is the canonical value of the filename, but we recognize that sometimes we can't actually get the real filename into unicode form. If our attempt to interpret the filename into unicode fails, then we set the "failed_decode" flag and put the iso-8859-1-decoding of it into the "filename" part.

Here are the steps of reading a filename from the filesystem and adding that filename into an existing Tahoe directory.

On Windows or Mac read the filename with the unicode APIs. Normalize the string with filename = unicodedata.normalize('NFC', filename). Leave out the "alleged_encoding" part. Set the "failed_decode" flag to False.
On Linux read the filename with the string APIs to get "bytes" and call sys.getfilesystemencoding() to get "alleged_encoding". Then, call bytes.decode(alleged_encoding, 'strict') to try to get a unicode object.

2.a. If this decoding succeeds then normalize the unicode filename with filename = unicodedata.normalize('NFC', filename), store the resulting filename and the alleged_encoding, and set the "failed_decode" to False. (Storing the alleged_encoding is for the benefit of future generations, who may discover that the decoding was actually wrong even though it didn't raise an error, and who could then use the alleged_encoding to undo the damage. For example Shawn Willden has a prototype tool which lets a human examine the filename as decoded with different encodings and pick the one that means something in a language they know.)

2.b. If this decoding fails, then we decode it again with bytes.decode('iso-8859-1', 'strict'). Do not normalize it. Put the resulting unicode object into the "filename" part, set the "failed_decode" flag to True, and leave the "alleged_encoding" field out. This is a case of mojibake:

http://en.wikipedia.org/wiki/Mojibake

The reason to go the mojibake route is that it preserves the information, and in theory someone could later decode it and figure out the original filename. This has actually happened at least once, as shown by the photograph on that wikipedia page of the package which was delivered to the Russian recipient. Mojibake! (footnote 1)

How does that sound?

Phewf. Okay, now for the trip in the other direction. Suppose you have a Tahoe filename object, and you need to create a file in the local filesystem, because for example the user runs "tahoe cp -r $DIRCAP/subdir .". There are four cases:

Case 1: You are using a unicode-safe filesystem such as Windows or Mac, and you have a unicode object with failed_decode=False.

This is easy: use the Python unicode filesystem APIs to create the file and be happy.

Case 2: You are using a unicode-safe filesystem and you have a unicode object with failed_decode=True.

This is easy: use the Python unicode filesystem APIs to create the file, passing the latin-1-decoded filename (mojibake!).

Case 3: You are using a plain-bytes filesystem such as Linux, and you have a unicode object with failed_decode=False.

This is easy: use the Python unicode filesystem APIs to create the file.

Case 4: You are using a plain-bytes filesystem such as Linux, and you have a unicode object with failed_decode=True.

Now we should encode the filename using iso-8859-1 to get a sequence of bytes, and then write those bytes into the filesystem using the Python string filesystem API. This is no worse than any alternative, and in the case that the target filesystem has the same encoding as the original filesystem (such as because it is the same as the original filesystem, or because it is owned by a friend of the owner of the original filesystem), then this will restore the file to its proper name.

By the way, please see David Wheeler's recent proposal to start enforcing filename constraints in Linux: http://lwn.net/Articles/325304 . His proposals include changing Linux to require utf-8-encoding of all filenames.

Regards,

Zooko

footnote 1: I know that Alberto Berti has previously argued on tahoe-dev and on IRC that mojibake is less clean than the alternative of using bytes.decode(alleged_encoding, 'replace'). The latter is lossy, but it more clearly shows to the user that some or all of the filename couldn't be decoded. Alberto and others had convinced me of the wisdom of this, and I actually wrote this entire document specifying the 'decode-with-replace' approach instead of the mojibake approach, but I eventually realized that it wouldn't work. For one thing it was rather complicated to decide how to handle multiple filenames that all decode-with-replace to the same unicode name (you could imagine a whole directory full of files all named '????' because the locale is wrong). But the real killer is what to do when you are going to write the file back into the local filesystem. If you write a decoded-with-replace file back, then this means a round-trip from linux to tahoe and back can mess up all of your filenames. If you write the actual original bytes into the filesystem, then this means that you might accidentally overwrite files with a "tahoe cp", since "tahoe ls" just shows files with "???" in their names, but "tahoe cp" writes files out with actual characters instead of question marks.

Francois: thanks for working on this! I was planning to amend your patch myself, but I'll let you do it. Here is my most recent idea about how this should be done: <http://allmydata.org/pipermail/tahoe-dev/2009-March/001379.html> Except that this *isn't* my most recent idea after all. I amended my intent a little, as prompted by pointed questions from nejucomo on IRC, and by looking at the actual source code where directories are processed: <http://allmydata.org/trac/tahoe/browser/src/allmydata/dirnode.py?rev=6e57576f2eeedcb8#L168> Then I tried to write down my ideas in detail and this forced me to realize that they were incomplete and wrong and I had to amend them a whole lot more in order to finish this letter. Finally, I asked JP Calderone for help, and he helped me understand how to write filenames back into a local Linux filesystem without risking that the user will accidentally overwrite their local files with tahoe files (because the tahoe files were written out under different representation than they were displayed), and how to do normalization, and how to cheaply ensure that silent misdecodings could be repaired in some future generation. Okay, here's the best design yet: I think that the unicode representation of the filename should continue to be the unique key in the directory (which current Tahoe 1.3.0 requires). So there should be a data structure with a required "filename" part, and a required "failed_decode" flag, and an optional "alleged_encoding" part. The "filename" part is the canonical value of the filename, but we recognize that sometimes we can't actually get the *real* filename into unicode form. If our attempt to interpret the filename into unicode fails, then we set the "failed_decode" flag and put the iso-8859-1-decoding of it into the "filename" part. Here are the steps of reading a filename from the filesystem and adding that filename into an existing Tahoe directory. 1. On Windows or Mac read the filename with the unicode APIs. Normalize the string with filename = unicodedata.normalize('NFC', filename). Leave out the "alleged_encoding" part. Set the "failed_decode" flag to False. 2. On Linux read the filename with the string APIs to get "bytes" and call sys.getfilesystemencoding() to get "alleged_encoding". Then, call bytes.decode(alleged_encoding, 'strict') to try to get a unicode object. 2.a. If this decoding succeeds then normalize the unicode filename with filename = unicodedata.normalize('NFC', filename), store the resulting filename and the alleged_encoding, and set the "failed_decode" to False. (Storing the alleged_encoding is for the benefit of future generations, who may discover that the decoding was actually wrong even though it didn't raise an error, and who could then use the alleged_encoding to undo the damage. For example Shawn Willden has a prototype tool which lets a human examine the filename as decoded with different encodings and pick the one that means something in a language they know.) 2.b. If this decoding fails, then we decode it again with bytes.decode('iso-8859-1', 'strict'). Do not normalize it. Put the resulting unicode object into the "filename" part, set the "failed_decode" flag to True, and leave the "alleged_encoding" field out. This is a case of mojibake: <http://en.wikipedia.org/wiki/Mojibake> The reason to go the mojibake route is that it preserves the information, and in theory someone could later decode it and figure out the original filename. This has actually happened at least once, as shown by the photograph on that wikipedia page of the package which was delivered to the Russian recipient. Mojibake! (footnote 1) How does that sound? Phewf. Okay, now for the trip in the other direction. Suppose you have a Tahoe filename object, and you need to create a file in the local filesystem, because for example the user runs "tahoe cp -r $DIRCAP/subdir .". There are four cases: Case 1: You are using a unicode-safe filesystem such as Windows or Mac, and you have a unicode object with failed_decode=False. This is easy: use the Python unicode filesystem APIs to create the file and be happy. Case 2: You are using a unicode-safe filesystem and you have a unicode object with failed_decode=True. This is easy: use the Python unicode filesystem APIs to create the file, passing the latin-1-decoded filename (mojibake!). Case 3: You are using a plain-bytes filesystem such as Linux, and you have a unicode object with failed_decode=False. This is easy: use the Python unicode filesystem APIs to create the file. Case 4: You are using a plain-bytes filesystem such as Linux, and you have a unicode object with failed_decode=True. Now we should *encode* the filename using iso-8859-1 to get a sequence of bytes, and then write those bytes into the filesystem using the Python string filesystem API. This is no worse than any alternative, and in the case that the target filesystem has the same encoding as the original filesystem (such as because it is the *same* as the original filesystem, or because it is owned by a friend of the owner of the original filesystem), then this will restore the file to its proper name. By the way, please see David Wheeler's recent proposal to start enforcing filename constraints in Linux: <http://lwn.net/Articles/325304> . His proposals include changing Linux to require utf-8-encoding of all filenames. Regards, Zooko footnote 1: I know that Alberto Berti has previously argued on tahoe-dev and on IRC that mojibake is less clean than the alternative of using bytes.decode(alleged_encoding, 'replace'). The latter is lossy, but it more clearly shows to the user that some or all of the filename couldn't be decoded. Alberto and others had convinced me of the wisdom of this, and I actually wrote this entire document specifying the 'decode-with-replace' approach instead of the mojibake approach, but I eventually realized that it wouldn't work. For one thing it was rather complicated to decide how to handle multiple filenames that all decode-with-replace to the same unicode name (you could imagine a whole directory full of files all named '????' because the locale is wrong). But the real killer is what to do when you are going to write the file back into the local filesystem. If you write a decoded-with-replace file back, then this means a round-trip from linux to tahoe and back can mess up all of your filenames. If you write the actual original bytes into the filesystem, then this means that you might accidentally overwrite files with a "tahoe cp", since "tahoe ls" just shows files with "???" in their names, but "tahoe cp" writes files out with actual characters instead of question marks.

warner commented

2009-04-01 07:26:51 +00:00

Zooko: this sounds great to me!

zooko commented

2009-04-01 14:20:40 +00:00

We're going to go ahead and release tahoe-1.4.0 ASAP, but this will be one of the first improvements to land in tahoe-1.5.0, I hope! :-)

zooko added this to the 1.5.0 milestone 2009-04-01 14:20:40 +00:00

francois commented

2009-04-01 19:41:52 +00:00

Author

Owner

Attachment unicode-v3.dpatch (54865 bytes) added

**Attachment** unicode-v3.dpatch (54865 bytes) added

unicode-v3.dpatch

54 KiB

francois commented

2009-04-01 19:44:26 +00:00

Author

Owner

Ok, here an amended version of all the previous patches.

It would be great if people could give it a try under Windows and MacOS X and publish their experiences here.

It currently only handle to simple case where filename encoding is coherent with what's detected by Python (sys.getfilesystemencoding()).

Ok, here an amended version of all the previous patches. It would be great if people could give it a try under Windows and MacOS X and publish their experiences here. It currently only handle to simple case where filename encoding is coherent with what's detected by Python (sys.getfilesystemencoding()).

zooko commented

2009-04-08 23:26:31 +00:00

I'm reviewing your most recent patch, François.

I'll be posting my observations in separate comments as I understand more of the patch.

Here's the first observation:

The patch seems to assume that the terminal handles either ascii or utf-8 on stdout, but what about terminals that handle a different encoding, such as Windows cmd.exe (which presumably handles whatever the current Windows codepage is, or else utf-16le)? Apparently sys.stdout.encoding will tell us what python thinks it should use if you pass a unicode string to it with print myunicstr or sys.stdout.write(myunicstr).

In any case the documentation should explain this -- that what you see when you run tahoe ls will depend on the configuration of your terminal. Hm, this also suggests that it isn't correct for tahoe to have a unicode_to_stdout() function and instead we should just rely on the python sys.stdout encoding behavior. What do you think?

I guess one place where I would be willing to second-guess python on this is, if the sys.stdout.encoding says the encoding is ascii or says that it doesn't know what the encoding is, then pre-encode your unicode strings with utf-8 (or, if on Windows, with utf-16le), before printing them or sys.stdout.write()'ing them. This is because of the following set of reasons:

A misconfigured environment will result in python defaulting to ascii when utf-8 will actually work better (I just now discovered that my own Mac laptop on which I am writing this was so misconfigured, and when I tried to fix it I then misconfigured it in a different way that had the same result! The first was: LANG and LC_ALL were being cleared out in my .bash_profile, the second was: I set LANG and LC_ALL to en_DK.UTF-8, but this version of Mac doesn't support that locale, so I had to change it to en_US.UTF-8.)
Terminals that actually can't handle utf-8 and can only handle ascii are increasingly rare.
If there is something that can handle only ascii and you give it utf-8, you'll be emitting garbage instead of raising an exception, which might be better in some cases. On the other hand I suppose it could be worse in others. (Especially when it happens to produce control characters and screws up your terminal emulator...)

I'm not entirely sure that this second-guessing of python is really going to yield better results more often than it yields worse results, and it is certainly more code, so I would also be happy with just emitting unicode objects to stdout and letting python and the local system config do the work from there.

small details and English spelling and editing:
s/Tahoe v1.3.1/Tahoe v1.5.0/
s/aliase/alias/
s/commande/command/
s/moderns/modern/

I'm reviewing your most recent patch, François. I'll be posting my observations in separate comments as I understand more of the patch. Here's the first observation: The patch seems to assume that the terminal handles either `ascii` or `utf-8` on stdout, but what about terminals that handle a different encoding, such as Windows `cmd.exe` (which presumably handles whatever the current Windows codepage is, or else `utf-16le`)? Apparently `sys.stdout.encoding` will tell us what python thinks it should use if you pass a unicode string to it with `print myunicstr` or `sys.stdout.write(myunicstr)`. In any case the documentation should explain this -- that what you see when you run `tahoe ls` will depend on the configuration of your terminal. Hm, this also suggests that it isn't correct for tahoe to have a `unicode_to_stdout()` function and instead we should just rely on the python `sys.stdout` encoding behavior. What do you think? I guess one place where I would be willing to second-guess python on this is, if the `sys.stdout.encoding` says the encoding is `ascii` or says that it doesn't know what the encoding is, then pre-encode your unicode strings with `utf-8` (or, if on Windows, with `utf-16le`), before printing them or `sys.stdout.write()`'ing them. This is because of the following set of reasons: 1. A misconfigured environment will result in python defaulting to `ascii` when `utf-8` will actually work better (I just now discovered that my own Mac laptop on which I am writing this was so misconfigured, and when I tried to fix it I then misconfigured it in a different way that had the same result! The first was: `LANG` and `LC_ALL` were being cleared out in my `.bash_profile`, the second was: I set `LANG` and `LC_ALL` to `en_DK.UTF-8`, but this version of Mac doesn't support that locale, so I had to change it to `en_US.UTF-8`.) 2. Terminals that actually can't handle `utf-8` and can only handle `ascii` are increasingly rare. 3. If there *is* something that can handle only `ascii` and you give it `utf-8`, you'll be emitting garbage instead of raising an exception, which might be better in some cases. On the other hand I suppose it could be worse in others. (Especially when it happens to produce control characters and screws up your terminal emulator...) I'm not entirely sure that this second-guessing of python is really going to yield better results more often than it yields worse results, and it is certainly more code, so I would also be happy with just emitting unicode objects to stdout and letting python and the local system config do the work from there. small details and English spelling and editing: s/Tahoe v1.3.1/Tahoe v1.5.0/ s/aliase/alias/ s/commande/command/ s/moderns/modern/

zooko commented

2009-04-09 02:54:59 +00:00

Oh, I see that your unicode_to_stdout function is exactly like the builtin python one except that it uses 'replace' mode. That's interesting. I guess maybe that is an improvement over the builtin python behavior (for our particular uses), and it also provides a function where we can add further second-guessing of python such as falling back to utf-8 or utf-16le.

Oh, I see that your `unicode_to_stdout` function is exactly like the builtin python one except that it uses `'replace'` mode. That's interesting. I guess maybe that *is* an improvement over the builtin python behavior (for our particular uses), and it also provides a function where we can add further second-guessing of python such as falling back to `utf-8` or `utf-16le`.

zooko commented

2009-04-09 03:02:29 +00:00

I'm trying to understand this patch better by splitting it up into smaller patches. Here is a version of it which strips out all the unicode_to_stdout() (which I subsequently realized is not a bad idea). The next step probably ought to be splitting this one into yet smaller patches -- such as everything to do with aliases in one, and everything to do with files in another, and everything to do with cmdline args in a third. Not sure if these can be cleanly separated (since filenames and aliases can come in through the cmdline args, for example), but I might try and see if I understand it better by the attempt. :-)

I'm trying to understand this patch better by splitting it up into smaller patches. Here is a version of it which strips out all the `unicode_to_stdout()` (which I subsequently realized is not a bad idea). The next step probably ought to be splitting this one into yet smaller patches -- such as everything to do with aliases in one, and everything to do with files in another, and everything to do with cmdline args in a third. Not sure if these can be cleanly separated (since filenames and aliases can come in through the cmdline args, for example), but I might try and see if I understand it better by the attempt. :-)

zooko commented

2009-04-09 03:59:02 +00:00

Attachment unicode-v3-minus-the-stdout-parts.patch.txt (31445 bytes) added

**Attachment** unicode-v3-minus-the-stdout-parts.patch.txt (31445 bytes) added

unicode-v3-minus-the-stdout-parts.patch.txt

31 KiB

zooko commented

2009-04-09 04:01:34 +00:00

And here is a version of it where I stripped out the aliases parts. (Note: I definitely want the aliases part, I'm just trying to separate them out to understand it better, and I'm posting the intermediate work here just in case François or anyone wants to see what I'm doing.

And here is a version of it where I stripped out the aliases parts. (Note: I definitely *want* the aliases part, I'm just trying to separate them out to understand it better, and I'm posting the intermediate work here just in case François or anyone wants to see what I'm doing.

zooko commented

2009-04-09 04:02:14 +00:00

Attachment unicode-v3-minus-the-stdout-and-aliases-parts.patch.txt (29387 bytes) added

**Attachment** unicode-v3-minus-the-stdout-and-aliases-parts.patch.txt (29387 bytes) added

unicode-v3-minus-the-stdout-and-aliases-parts.patch.txt

29 KiB

zooko commented

2009-04-09 04:04:46 +00:00

Okay and here I've stripped out the argv parts:

zooko commented

2009-04-09 04:09:10 +00:00

Attachment unicode-v3-minus-the-stdout-and-aliases-and-argv-parts.patch.txt (23518 bytes) added

**Attachment** unicode-v3-minus-the-stdout-and-aliases-and-argv-parts.patch.txt (23518 bytes) added

unicode-v3-minus-the-stdout-and-aliases-and-argv-parts.patch.txt

23 KiB

zooko commented

2009-04-09 04:54:37 +00:00

Here I've stripped out the parts having to do with encoding for URLs. Also I noticed some kludgy broken tweaks that I had put in during earlier steps of this process and stripped those out too.

zooko commented

2009-04-09 04:56:05 +00:00

Attachment unicode-v3-minus-the-stdout-and-aliases-and-argv-and-url-parts.patch.txt (16919 bytes) added

**Attachment** unicode-v3-minus-the-stdout-and-aliases-and-argv-and-url-parts.patch.txt (16919 bytes) added

unicode-v3-minus-the-stdout-and-aliases-and-argv-and-url-parts.patch.txt

16 KiB

zooko commented

2009-04-10 16:51:16 +00:00

Hm. I just learned that the windows-1252 encoding is a superset of the iso-8859-1 a.k.a. latin-1 encoding:

http://en.wikipedia.org/wiki/Windows-1252

The difference is that some bytes which are mapped to control characters in iso-8859-1 are mapped to characters in windows-1252. (Also maybe some of the characters are in a different order but that doesn't matter for this purpose.)

Does that mean that when doing the mojibake fallback when decoding fails, if we decode with windows-1252 instead of iso-8859-1 then we'll have fewer control characters in the resulting unicode string? That sounds like an improvement.

Hm. I just learned that the `windows-1252` encoding is a superset of the `iso-8859-1` a.k.a. `latin-1` encoding: <http://en.wikipedia.org/wiki/Windows-1252> The difference is that some bytes which are mapped to control characters in `iso-8859-1` are mapped to characters in `windows-1252`. (Also maybe some of the characters are in a different order but that doesn't matter for this purpose.) Does that mean that when doing the mojibake fallback when decoding fails, if we decode with `windows-1252` instead of `iso-8859-1` then we'll have fewer control characters in the resulting unicode string? That sounds like an improvement.

zooko commented

2009-04-28 16:08:48 +00:00

Hm, so there is this idea by Markus Kuhn called utf-8b. utf-8b decoding is just like utf-8 decoding, except that if the input string turns out not to be valid utf-8 encoding, then utf-8b stores the invalid bytes of the string as invalid code points in the resulting unicode object. This means that utf8b_encode(utf8b_decode(x)) == x for any x (not just for x's which are utf-8-encodings of a unicode string).

I wonder if utf-8b provides a simpler/cleaner way to accomplish the above. It would look like this. Take the design written in (@@http://allmydata.org/trac/tahoe/ticket/534#comment:369015@@) and change step 2 to be like this:

On Linux read the filename with the string APIs to get "bytes" and call sys.getfilesystemencoding() to get "alleged_encoding". If the alleged encoding is ascii or utf-8, or if it absent or invalid or denotes a codec that we don't have an implementation for, then set alleged_encoding = 'utf-8b' instead. Then, call bytes.decode(alleged_encoding, 'strict') to try to get a unicode object.

2.a. If this decoding succeeds then normalize the unicode filename with filename = unicodedata.normalize('NFC', filename), store the resulting filename and if the encoding that was used was not utf-8b then store the alleged_encoding. (If the encoding that was used was utf-8b, then don't store the alleged_encoding -- utf-8b is the default and we can save space by omitting it.)

2.b. If this decoding fails, then we decode it with bytes.decode('utf-8b'). Do not normalize it. Put the resulting unicode object into the "filename" part. Do not store an "alleged_encoding".

Using utf-8b to store bytes from a failed decoding instead of iso-8859-1 means that if the name or part of the name is actually ascii or utf-8, then it will be (at least partially) legible. It also means that we can omit the "failed_decode" flag, because it makes no difference whether the filename was originally alleged to be in koi8-r, but failed to decode using the koi8-r codec, and so was instead decoded using utf-8b, or whether the filename was originally alleged to be in ascii or utf-8, and was decoded using utf-8b. (Right? I think that's right.)

An implementation, including a Python codec module, by Eric S. Tiedemann (1966-2008; I miss him):
http://hyperreal.org/~est/utf-8b

An implementation for GNU iconv by Ben Sittler:
http://bsittler.livejournal.com/10381.html

A PEP by Martin v. Löwis to automatically use utf-8b whenever you would otherwise use utf-8:
http://www.python.org/dev/peps/pep-0383

Hm, so there is this idea by Markus Kuhn called `utf-8b`. `utf-8b` decoding is just like `utf-8` decoding, except that if the input string turns out not to be valid `utf-8` encoding, then `utf-8b` stores the invalid bytes of the string as invalid code points in the resulting unicode object. This means that `utf8b_encode(utf8b_decode(x)) == x` for any `x` (not just for `x`'s which are `utf-8`-encodings of a unicode string). I wonder if `utf-8b` provides a simpler/cleaner way to accomplish the above. It would look like this. Take the design written in (@@http://allmydata.org/trac/tahoe/ticket/534#[comment:369015](/tahoe-lafs/trac/issues/534#issuecomment-369015)@@) and change step 2 to be like this: 2. On Linux read the filename with the string APIs to get "bytes" and call `sys.getfilesystemencoding()` to get "alleged_encoding". If the alleged encoding is `ascii` or `utf-8`, or if it absent or invalid or denotes a codec that we don't have an implementation for, then set `alleged_encoding = 'utf-8b'` instead. Then, call `bytes.decode(alleged_encoding, 'strict')` to try to get a unicode object. 2.a. If this decoding succeeds then normalize the unicode filename with `filename = unicodedata.normalize('NFC', filename)`, store the resulting filename and if the encoding that was used was *not* `utf-8b` then store the alleged_encoding. (If the encoding that was used was `utf-8b`, then don't store the alleged_encoding -- `utf-8b` is the default and we can save space by omitting it.) 2.b. If this decoding fails, then we decode it with `bytes.decode('utf-8b')`. Do not normalize it. Put the resulting unicode object into the "filename" part. Do not store an "alleged_encoding". Using `utf-8b` to store bytes from a failed decoding instead of `iso-8859-1` means that if the name or part of the name is actually `ascii` or `utf-8`, then it will be (at least partially) legible. It also means that we can omit the "failed_decode" flag, because it makes no difference whether the filename was originally alleged to be in `koi8-r`, but failed to decode using the `koi8-r` codec, and so was instead decoded using `utf-8b`, or whether the filename was originally alleged to be in `ascii` or `utf-8`, and was decoded using `utf-8b`. (Right? I think that's right.) An implementation, including a Python codec module, by Eric S. Tiedemann (1966-2008; I miss him): <http://hyperreal.org/~est/utf-8b> An implementation for GNU iconv by Ben Sittler: <http://bsittler.livejournal.com/10381.html> A PEP by Martin v. Löwis to automatically use `utf-8b` whenever you would otherwise use `utf-8`: <http://www.python.org/dev/peps/pep-0383>

zooko commented

2009-04-28 18:00:54 +00:00

I wrote: """Using utf-8b to store bytes from a failed decoding instead of iso-8859-1 means ... that we can omit the "failed_decode" flag, because it makes no difference whether the filename was originally alleged to be in koi8-r, but failed to decode using the koi8-r codec, and so was instead decoded using utf-8b, or whether the filename was originally alleged to be in ascii or utf-8, and was decoded using utf-8b. (Right? I think that's right.)"""

Oh no, I am wrong about this because of the existence of byte-oriented systems where the filesystem encoding is not utf-8. When outputting a filename into such a system, you ought to check the "failed_decode" flag and, if it is set, reconstitute the original bytes before proceeding to emit the name using the byte-oriented API.

Here is some code which attempts to explain what I mean. It doesn't actually run -- for example it is missing its import statements -- but writing it helped me think this through a little more:

# A wrapper around the Python Standard Library's filename access functions to
# provide a uniform API for all platforms and to prevent lossy en/de-coding.

class Fname:
    def __init__(self, name, failed_decode=False, alleged_encoding=None):
        self.name = name
        self.failed_decode = failed_decode
        self.alleged_encoding = alleged_encoding

if platform.system() in ('Linux', 'Solaris'):
    # on byte-oriented filesystems, such as Linux and Solaris

    def unicode_to_fs(fn):
        """ Encode an unicode object to bytes. """
        precondition(isinstance(fn, Fname), fn)
        precondition(isinstance(fn.name, unicode), fn.name)

        if fn.failed_decode:
            # This means that the unicode string in .name is not actually the
            # result of a successful decoding with a suggested codec, but is
            # instead the result of stuffing the bytes into a unicode by dint
            # of the utf-8b trick.  This means that on a byte-oriented system,
            # you shouldn't treat the .name as a unicode string containing
            # chars, but instead you should get the original bytes back out of
            # it.
            return fn.name.encode('utf-8b', 'python-replace')
        else:
            fsencoding = sys.getfilesystemencoding()
            if fsencoding in (None, '', 'ascii', 'utf-8'):
                fsencoding = 'utf-8b'
            try:
                return fn.name.encode(encoding, 'python-escape')
            except UnicodeEncodeError:
                raise usage.UsageError("Filename '%s' cannot be encoded using  \
the current encoding of your filesystem (%s). Please configure your locale \
correctly or rename this file." % (s, sys.getfilesystemencoding()))

    def fs_to_unicode(bytesfn):
        """ Decode bytes from the filesystem to a unicode object. """
        precondition(isinstance(bytesfn, str), str)

        alleged_encoding = sys.getfilesystemencoding()
        if alleged_encoding in (None, '', 'ascii', 'utf-8'):
            alleged_encoding = 'utf-8b'
            
        try:
            unicodefn = bytesfn.decode(alleged_encoding, 'strict')
        except UnicodeEncodeError:
            unicodefn = bytesfn.decode('utf-8b', 'python-escape')
            return Fname(unicodefn)
        else:
            unicodefn = unicodedata.normalize('NFC', unicodefn)
            if alleged_encoding == 'utf-8b':
                return Fname(unicodefn)
            else:
                return Fname(unicodefn, alleged_encoding)

    def listdir(fn):
        assert isinstance(fn, Fname), fn
        assert isinstance(fn.name, unicode), fn.name
        bytesfn = unicode_to_fs(fn.name)
        res = os.listdir(bytesfn)
        return([fs_to_unicode(fn) for fn in res])

else:
    # on unicode-oriented filesystems, such as Mac and Windows
    def listdir(fn):
        assert isinstance(fn, Fname), fn
        assert isinstance(fn.name, unicode), fn.name
        return [Fname(n) for n in os.listdir(fn.name)]

Also attached to this ticket...

I wrote: """Using utf-8b to store bytes from a failed decoding instead of iso-8859-1 means ... that we can omit the "failed_decode" flag, because it makes no difference whether the filename was originally alleged to be in koi8-r, but failed to decode using the koi8-r codec, and so was instead decoded using utf-8b, or whether the filename was originally alleged to be in ascii or utf-8, and was decoded using utf-8b. (Right? I think that's right.)""" Oh no, I am wrong about this because of the existence of byte-oriented systems where the filesystem encoding is not `utf-8`. When outputting a filename into such a system, you ought to check the "failed_decode" flag and, if it is set, reconstitute the original bytes before proceeding to emit the name using the byte-oriented API. Here is some code which attempts to explain what I mean. It doesn't actually run -- for example it is missing its `import` statements -- but writing it helped me think this through a little more: ``` # A wrapper around the Python Standard Library's filename access functions to # provide a uniform API for all platforms and to prevent lossy en/de-coding. class Fname: def __init__(self, name, failed_decode=False, alleged_encoding=None): self.name = name self.failed_decode = failed_decode self.alleged_encoding = alleged_encoding if platform.system() in ('Linux', 'Solaris'): # on byte-oriented filesystems, such as Linux and Solaris def unicode_to_fs(fn): """ Encode an unicode object to bytes. """ precondition(isinstance(fn, Fname), fn) precondition(isinstance(fn.name, unicode), fn.name) if fn.failed_decode: # This means that the unicode string in .name is not actually the # result of a successful decoding with a suggested codec, but is # instead the result of stuffing the bytes into a unicode by dint # of the utf-8b trick. This means that on a byte-oriented system, # you shouldn't treat the .name as a unicode string containing # chars, but instead you should get the original bytes back out of # it. return fn.name.encode('utf-8b', 'python-replace') else: fsencoding = sys.getfilesystemencoding() if fsencoding in (None, '', 'ascii', 'utf-8'): fsencoding = 'utf-8b' try: return fn.name.encode(encoding, 'python-escape') except UnicodeEncodeError: raise usage.UsageError("Filename '%s' cannot be encoded using \ the current encoding of your filesystem (%s). Please configure your locale \ correctly or rename this file." % (s, sys.getfilesystemencoding())) def fs_to_unicode(bytesfn): """ Decode bytes from the filesystem to a unicode object. """ precondition(isinstance(bytesfn, str), str) alleged_encoding = sys.getfilesystemencoding() if alleged_encoding in (None, '', 'ascii', 'utf-8'): alleged_encoding = 'utf-8b' try: unicodefn = bytesfn.decode(alleged_encoding, 'strict') except UnicodeEncodeError: unicodefn = bytesfn.decode('utf-8b', 'python-escape') return Fname(unicodefn) else: unicodefn = unicodedata.normalize('NFC', unicodefn) if alleged_encoding == 'utf-8b': return Fname(unicodefn) else: return Fname(unicodefn, alleged_encoding) def listdir(fn): assert isinstance(fn, Fname), fn assert isinstance(fn.name, unicode), fn.name bytesfn = unicode_to_fs(fn.name) res = os.listdir(bytesfn) return([fs_to_unicode(fn) for fn in res]) else: # on unicode-oriented filesystems, such as Mac and Windows def listdir(fn): assert isinstance(fn, Fname), fn assert isinstance(fn.name, unicode), fn.name return [Fname(n) for n in os.listdir(fn.name)] ``` Also attached to this ticket...

zooko commented

2009-04-28 18:11:37 +00:00

Attachment fsencoding.py (3005 bytes) added

PEP 383'ish implementation of listdir()

**Attachment** fsencoding.py (3005 bytes) added PEP 383'ish implementation of listdir()

fsencoding.py

2.9 KiB

zooko commented

2009-04-29 06:11:02 +00:00

I was referencing this ticket and my (untested) example code here in the PEP 383 thread on python-dev (http://mail.python.org/pipermail/python-dev/2009-April/089170.html ), and I realized that I forgot to set the "failed_decode" flag in my example code. Here is a new version of it that sets the "failed_decode" flag, and that uses utf-8 for the attempted decode and only uses utf-8b when doing the fallback (for clarity -- should make no difference since the attempted decode uses error handling mode 'strict').

I will also attach it.

# A wrapper around the Python Standard Library's filename access functions to
# provide a uniform API for all platforms and to prevent lossy en/de-coding.

class Fname:
    def __init__(self, name, failed_decode=False, alleged_encoding=None):
        self.name = name
        self.failed_decode = failed_decode
        self.alleged_encoding = alleged_encoding

if platform.system() in ('Linux', 'Solaris'):
    # on byte-oriented filesystems, such as Linux and Solaris

    def unicode_to_fs(fn):
        """ Encode an unicode object to bytes. """
        precondition(isinstance(fn, Fname), fn)
        precondition(isinstance(fn.name, unicode), fn.name)

        if fn.failed_decode:
            # This means that the unicode string in .name is not
            # actually the result of a successful decoding with a
            # suggested codec, but is instead the result of stuffing the
            # bytes into a unicode by dint of the utf-8b trick.  This
            # means that on a byte-oriented system, you shouldn't treat
            # the .name as a unicode string containing chars, but
            # instead you should get the original bytes back out of it.
            return fn.name.encode('utf-8b', 'python-replace')
        else:
            fsencoding = sys.getfilesystemencoding()
            if fsencoding in (None, '', 'ascii', 'utf-8'):
                fsencoding = 'utf-8b'
            try:
                return fn.name.encode(encoding, 'python-escape')
            except UnicodeEncodeError:
                raise usage.UsageError("Filename '%s' cannot be \
encoded using the current encoding of your filesystem (%s). Please \
configure your locale correctly or rename this file." %
                                       (s, sys.getfilesystemencoding()))

    def fs_to_unicode(bytesfn):
        """ Decode bytes from the filesystem to a unicode object. """
        precondition(isinstance(bytesfn, str), str)

        alleged_encoding = sys.getfilesystemencoding()
        if alleged_encoding in (None, '', 'ascii'):
            alleged_encoding = 'utf-8'
            
        try:
            unicodefn = bytesfn.decode(alleged_encoding, 'strict')
        except UnicodeEncodeError:
            unicodefn = bytesfn.decode('utf-8b', 'python-escape')
            return Fname(unicodefn, failed_decode=True)
        else:
            unicodefn = unicodedata.normalize('NFC', unicodefn)
            if alleged_encoding == 'utf-8':
                return Fname(unicodefn)
            else:
                return Fname(unicodefn, alleged_encoding)

    def listdir(fn):
        assert isinstance(fn, Fname), fn
        assert isinstance(fn.name, unicode), fn.name
        bytesfn = unicode_to_fs(fn.name)
        res = os.listdir(bytesfn)
        return([fs_to_unicode(fn) for fn in res])

else:
    # on unicode-oriented filesystems, such as Mac and Windows
    def listdir(fn):
        assert isinstance(fn, Fname), fn
        assert isinstance(fn.name, unicode), fn.name
        return [Fname(n) for n in os.listdir(fn.name)]

I was referencing this ticket and my (untested) example code here in the PEP 383 thread on python-dev (<http://mail.python.org/pipermail/python-dev/2009-April/089170.html> ), and I realized that I forgot to set the "failed_decode" flag in my example code. Here is a new version of it that sets the "failed_decode" flag, and that uses `utf-8` for the attempted decode and only uses `utf-8b` when doing the fallback (for clarity -- should make no difference since the attempted decode uses error handling mode 'strict'). I will also attach it. ``` # A wrapper around the Python Standard Library's filename access functions to # provide a uniform API for all platforms and to prevent lossy en/de-coding. class Fname: def __init__(self, name, failed_decode=False, alleged_encoding=None): self.name = name self.failed_decode = failed_decode self.alleged_encoding = alleged_encoding if platform.system() in ('Linux', 'Solaris'): # on byte-oriented filesystems, such as Linux and Solaris def unicode_to_fs(fn): """ Encode an unicode object to bytes. """ precondition(isinstance(fn, Fname), fn) precondition(isinstance(fn.name, unicode), fn.name) if fn.failed_decode: # This means that the unicode string in .name is not # actually the result of a successful decoding with a # suggested codec, but is instead the result of stuffing the # bytes into a unicode by dint of the utf-8b trick. This # means that on a byte-oriented system, you shouldn't treat # the .name as a unicode string containing chars, but # instead you should get the original bytes back out of it. return fn.name.encode('utf-8b', 'python-replace') else: fsencoding = sys.getfilesystemencoding() if fsencoding in (None, '', 'ascii', 'utf-8'): fsencoding = 'utf-8b' try: return fn.name.encode(encoding, 'python-escape') except UnicodeEncodeError: raise usage.UsageError("Filename '%s' cannot be \ encoded using the current encoding of your filesystem (%s). Please \ configure your locale correctly or rename this file." % (s, sys.getfilesystemencoding())) def fs_to_unicode(bytesfn): """ Decode bytes from the filesystem to a unicode object. """ precondition(isinstance(bytesfn, str), str) alleged_encoding = sys.getfilesystemencoding() if alleged_encoding in (None, '', 'ascii'): alleged_encoding = 'utf-8' try: unicodefn = bytesfn.decode(alleged_encoding, 'strict') except UnicodeEncodeError: unicodefn = bytesfn.decode('utf-8b', 'python-escape') return Fname(unicodefn, failed_decode=True) else: unicodefn = unicodedata.normalize('NFC', unicodefn) if alleged_encoding == 'utf-8': return Fname(unicodefn) else: return Fname(unicodefn, alleged_encoding) def listdir(fn): assert isinstance(fn, Fname), fn assert isinstance(fn.name, unicode), fn.name bytesfn = unicode_to_fs(fn.name) res = os.listdir(bytesfn) return([fs_to_unicode(fn) for fn in res]) else: # on unicode-oriented filesystems, such as Mac and Windows def listdir(fn): assert isinstance(fn, Fname), fn assert isinstance(fn.name, unicode), fn.name return [Fname(n) for n in os.listdir(fn.name)] ```

zooko commented

2009-04-29 06:11:26 +00:00

Attachment fsencode.py (3052 bytes) added

**Attachment** fsencode.py (3052 bytes) added

fsencode.py

3 KiB

francois commented

2009-04-29 16:35:39 +00:00

Author

Owner

As promised, here's a darcs patch bundle containing the 17 patches which implements basic Unicode filename support for review purpose.

As promised, here's a darcs patch bundle containing the 17 patches which implements *basic* Unicode filename support for review purpose.

francois commented

2009-04-29 16:36:10 +00:00

Author

Owner

Attachment tahoe-534-bundle.dpatch (70453 bytes) added

**Attachment** tahoe-534-bundle.dpatch (70453 bytes) added

tahoe-534-bundle.dpatch

69 KiB

zooko commented

2009-04-29 16:55:48 +00:00

Hooray! Thanks!

zooko commented

2009-05-01 04:53:47 +00:00

Fixed another small error that I discovered while writing to python-dev. Attaching latest version.

zooko commented

2009-05-01 04:54:47 +00:00

Attachment fsencode.2.py (3052 bytes) added

**Attachment** fsencode.2.py (3052 bytes) added

fsencode.2.py

3 KiB

zooko commented

2009-05-01 05:09:15 +00:00

Fixed a couple more bugs or unnecessaries found by inspection while writing to tahoe-dev.

zooko commented

2009-05-01 05:09:31 +00:00

Attachment fsencode.3.py (2912 bytes) added

**Attachment** fsencode.3.py (2912 bytes) added

fsencode.3.py

2.8 KiB

zooko commented

2009-05-01 15:33:16 +00:00

Yet again I realize that my previous understanding was incomplete or subtly flawed. This encoding stuff is very tricky! Here's my post to python-dev where I postulate that PEP 383 actually does solve most of our problems after all:

http://mail.python.org/pipermail/python-dev/2009-May/089318.html

Please read!

Yet again I realize that my previous understanding was incomplete or subtly flawed. This encoding stuff is very tricky! Here's my post to python-dev where I postulate that PEP 383 actually does solve most of our problems after all: <http://mail.python.org/pipermail/python-dev/2009-May/089318.html> Please read!

zooko commented

2009-05-03 15:14:28 +00:00

Okay, my current design is at the end of this message, including rationale:

http://allmydata.org/pipermail/tahoe-dev/2009-May/001670.html

Here is the summary:

To copy an entry from a local filesystem into Tahoe:

On Windows or Mac read the filename with the unicode APIs.
Normalize the string with filename = unicodedata.normalize('NFC',
filename). Leave the "original_bytes" key and the "failed_decode" flag
out of the metadata.
On Linux or Solaris read the filename with the string APIs, and
store the result in the "original_bytes" part of the metadata. Call
sys.getfilesystemencoding() to get an alleged_encoding. Then, call
bytes.decode(alleged_encoding, 'strict') to try to get a unicode
object.

2.a. If this decoding succeeds then normalize the unicode filename
with filename = unicodedata.normalize('NFC', filename), store the
resulting filename and leave the "failed_decode" flag out of the
metadata.

2.b. If this decoding fails, then we decode it again with
bytes.decode('latin-1', 'strict'). Do not normalize it. Store the
resulting unicode object into the "filename" part, set the
"failed_decode" flag to True. This is mojibake!

(handling collisions) In either case 2.a or 2.b the resulting
unicode string may already be present in the directory. If so, check
the failed_decode flags on the current entry and the new entry. If
they are both set or both unset then the new entry overwrites the old
entry -- they had the same name. If the failed_decode flags differ
then this is a case of collision -- the old entry and the new entry
had (as far as we are concerned) different names that accidentally
generated the same unicode. Alter the new entry's name, for example by
appending "~1" and then trying again and incrementing the number until
it doesn't match any extant entry.

To copy an entry from Tahoe into a local filesystem:

Always use the Python unicode API. The original_bytes field and the
failed_decode field in the metadata are not consulted.

Okay, my current design is at the end of this message, including rationale: <http://allmydata.org/pipermail/tahoe-dev/2009-May/001670.html> Here is the summary: To copy an entry from a local filesystem into Tahoe: 1. On Windows or Mac read the filename with the unicode APIs. Normalize the string with filename = unicodedata.normalize('NFC', filename). Leave the "original_bytes" key and the "failed_decode" flag out of the metadata. 2. On Linux or Solaris read the filename with the string APIs, and store the result in the "original_bytes" part of the metadata. Call sys.getfilesystemencoding() to get an alleged_encoding. Then, call bytes.decode(alleged_encoding, 'strict') to try to get a unicode object. 2.a. If this decoding succeeds then normalize the unicode filename with filename = unicodedata.normalize('NFC', filename), store the resulting filename and leave the "failed_decode" flag out of the metadata. 2.b. If this decoding fails, then we decode it again with bytes.decode('latin-1', 'strict'). Do not normalize it. Store the resulting unicode object into the "filename" part, set the "failed_decode" flag to True. This is mojibake! 3. (handling collisions) In either case 2.a or 2.b the resulting unicode string may already be present in the directory. If so, check the failed_decode flags on the current entry and the new entry. If they are both set or both unset then the new entry overwrites the old entry -- they had the same name. If the failed_decode flags differ then this is a case of collision -- the old entry and the new entry had (as far as we are concerned) different names that accidentally generated the same unicode. Alter the new entry's name, for example by appending "~1" and then trying again and incrementing the number until it doesn't match any extant entry. To copy an entry from Tahoe into a local filesystem: Always use the Python unicode API. The original_bytes field and the failed_decode field in the metadata are not consulted.

zooko commented

2009-06-09 19:48:05 +00:00

François nicely broke his patch up into many small patches and attached the whole set as "tahoe-534-bundle.dpatch" ((68.8 KB) - added by francois 6 weeks ago). Here I'm attaching them as separate attachments so that I can link to them. Also I re-recorded one in order to avoid a conflict in [test_cli.py]source:src/allmydata/test/test_cli.py.

zooko commented

2009-06-09 19:49:58 +00:00

Attachment plumbing for unicode support.darcspatch (11899 bytes) added

**Attachment** plumbing for unicode support.darcspatch (11899 bytes) added

plumbing for unicode support.darcspatch

12 KiB

zooko commented

2009-06-09 19:50:08 +00:00

Attachment tahoe manifest unicode support.darcspatch (9482 bytes) added

**Attachment** tahoe manifest unicode support.darcspatch (9482 bytes) added

tahoe manifest unicode support.darcspatch

9.3 KiB

zooko commented

2009-06-09 19:50:17 +00:00

Attachment tahoe ls unicode support.darcspatch (9627 bytes) added

**Attachment** tahoe ls unicode support.darcspatch (9627 bytes) added

tahoe ls unicode support.darcspatch

9.4 KiB

zooko commented

2009-06-09 19:51:37 +00:00

Attachment tahoe get unicode support.darcspatch (9144 bytes) added

**Attachment** tahoe get unicode support.darcspatch (9144 bytes) added

tahoe get unicode support.darcspatch

8.9 KiB

zooko commented

2009-06-09 19:51:46 +00:00

Attachment tahoe webopen unicode support.darcspatch (8998 bytes) added

**Attachment** tahoe webopen unicode support.darcspatch (8998 bytes) added

tahoe webopen unicode support.darcspatch

8.8 KiB

zooko commented

2009-06-09 19:51:57 +00:00

Attachment tahoe rm unicode support.darcspatch (9074 bytes) added

**Attachment** tahoe rm unicode support.darcspatch (9074 bytes) added

tahoe rm unicode support.darcspatch

8.9 KiB

zooko commented

2009-06-09 19:52:09 +00:00

Attachment tahoe ln unicode support.darcspatch (9074 bytes) added

**Attachment** tahoe ln unicode support.darcspatch (9074 bytes) added

tahoe ln unicode support.darcspatch

8.9 KiB

zooko commented

2009-06-09 19:52:17 +00:00

Attachment tahoe status unicode support.darcspatch (9031 bytes) added

**Attachment** tahoe status unicode support.darcspatch (9031 bytes) added

tahoe status unicode support.darcspatch

8.8 KiB

zooko commented

2009-06-09 19:52:27 +00:00

Attachment tahoe check unicode support.darcspatch (9026 bytes) added

**Attachment** tahoe check unicode support.darcspatch (9026 bytes) added

tahoe check unicode support.darcspatch

8.8 KiB

zooko commented

2009-06-09 19:52:35 +00:00

Attachment tahoe deep-check unicode support.darcspatch (9042 bytes) added

**Attachment** tahoe deep-check unicode support.darcspatch (9042 bytes) added

tahoe deep-check unicode support.darcspatch

8.8 KiB

zooko commented

2009-06-09 19:52:45 +00:00

Attachment tahoe mkdir unicode support.darcspatch (10364 bytes) added

**Attachment** tahoe mkdir unicode support.darcspatch (10364 bytes) added

tahoe mkdir unicode support.darcspatch

10 KiB

zooko commented

2009-06-09 19:52:58 +00:00

Attachment explain the current unicode support in CLI.darcspatch (9891 bytes) added

**Attachment** explain the current unicode support in CLI.darcspatch (9891 bytes) added

explain the current unicode support in CLI.darcspatch

9.7 KiB

zooko commented

2009-06-09 19:53:07 +00:00

Attachment tahoe cp unicode support.darcspatch (15604 bytes) added

**Attachment** tahoe cp unicode support.darcspatch (15604 bytes) added

tahoe cp unicode support.darcspatch

15 KiB

zooko commented

2009-06-09 19:53:18 +00:00

Attachment tahoe put unicode support.darcspatch (11435 bytes) added

**Attachment** tahoe put unicode support.darcspatch (11435 bytes) added

tahoe put unicode support.darcspatch

11 KiB

zooko commented

2009-06-09 19:53:26 +00:00

Attachment aliases unicode support.darcspatch (12817 bytes) added

**Attachment** aliases unicode support.darcspatch (12817 bytes) added

aliases unicode support.darcspatch

12 KiB

zooko commented

2009-06-09 19:53:39 +00:00

Attachment previous cli related patches should.darcspatch (9150 bytes) added

**Attachment** previous cli related patches should.darcspatch (9150 bytes) added

8.9 KiB

zooko commented

2009-06-09 19:53:48 +00:00

Attachment tahoe backup unicode support.darcspatch (15101 bytes) added

**Attachment** tahoe backup unicode support.darcspatch (15101 bytes) added

tahoe backup unicode support.darcspatch

15 KiB

zooko commented

2009-06-21 20:01:35 +00:00

Unsetting the "review" keyword because this patch is not ready to review until François (or someone) reworks it to use os.listdir(a_unicode_thing) on Windows and Mac and os.listdir(a_string_thing) on other platforms.

Unsetting the "review" keyword because this patch is not ready to review until François (or someone) reworks it to use `os.listdir(a_unicode_thing)` on Windows and Mac and `os.listdir(a_string_thing)` on other platforms.

zooko modified the milestone from 1.5.0 to eventually

2009-06-30 18:08:07 +00:00

daira commented

2009-11-23 03:22:23 +00:00

Presumably these are the patches #734 is referring to.

daira commented

2009-12-06 21:18:06 +00:00

To fix #734, unicode_to_stdout in the patched util/stringutil.py should be something like:

def unicode_to_stdout(s):
    """
    Encode an unicode object for representation on stdout.
    """

    if s is None:
        return None
    precondition(isinstance(s, unicode), s)

    try:
        return s.encode(sys.stdout.encoding, 'replace')
    catch LookupError:
        return s.encode('utf-8', 'replace')
}

(This doesn't explain why tahoe_ls.py} was attempting to use None for name it was trying to convert, but that's presumably a separate issue not caused by the patch.)

The reason for the catch [LookupError](wiki/LookupError) is that Python can sometimes set sys.stdout.encoding to an encoding that it does not recognize. For example, if you have Windows cmd.exe set to use UTF-8 by chcp 65001, sys.stdin.encoding and sys.stdout.encoding will be 'cp65001', which is not recognized as being the same as UTF-8. (If the encoding was actually something other than UTF-8, I think that producing mojibake on stdout is better than throwing an exception. Note that terminals that do bad things when they receive control characters are already broken; this isn't making it worse, because plain ASCII can include such control characters.)

There are other issues with the patch; I just wanted to comment on this part while it's swapped into my brain (and people are now calling me to go and play scrabble, which sounds more fun :-)

To fix #734, `unicode_to_stdout` in the patched `util/stringutil.py` should be something like: ``` def unicode_to_stdout(s): """ Encode an unicode object for representation on stdout. """ if s is None: return None precondition(isinstance(s, unicode), s) try: return s.encode(sys.stdout.encoding, 'replace') catch LookupError: return s.encode('utf-8', 'replace') } ``` (This doesn't explain why `tahoe_ls.py`} was attempting to use None for `name` it was trying to convert, but that's presumably a separate issue not caused by the patch.) The reason for the `catch [LookupError](wiki/LookupError)` is that Python can sometimes set `sys.stdout.encoding` to an encoding that it does not recognize. For example, if you have Windows `cmd.exe` set to use UTF-8 by `chcp 65001`, `sys.stdin.encoding` and `sys.stdout.encoding` will be 'cp65001', which is not recognized as being the same as UTF-8. (If the encoding was actually something other than UTF-8, I think that producing mojibake on stdout is better than throwing an exception. Note that terminals that do bad things when they receive control characters are already broken; this isn't making it worse, because plain ASCII can include such control characters.) There are other issues with the patch; I just wanted to comment on this part while it's swapped into my brain (and people are now calling me to go and play scrabble, which sounds more fun :-)

daira commented

2009-12-06 22:29:25 +00:00

Replying to davidsarah:

catch [LookupError](wiki/LookupError):

Python was obviously not fully swapped in -- s/catch/except/.

Replying to [davidsarah](/tahoe-lafs/trac/issues/534#issuecomment-369040): > ` catch [LookupError](wiki/LookupError):` Python was obviously not fully swapped in -- s/`catch`/`except`/.

daira added

p/major

and removed

p/minor

labels 2009-12-07 03:18:12 +00:00

daira changed title from ~~"tahoe cp" command encoding issue~~ to CLI (e.g. tahoe cp) does not handle Unicode file/directory names, either in arguments or the local filesystem

2009-12-07 03:18:12 +00:00

daira changed title from ~~CLI (e.g. tahoe cp) does not handle Unicode file/directory names, either in arguments or the local filesystem~~ to CLI (e.g. tahoe cp) does not correctly handle or display Unicode file/directory names, either in arguments or the local filesystem

2009-12-07 03:19:21 +00:00

daira commented

2009-12-07 03:52:53 +00:00

Sigh. I misread the exception message in #734. It is complaining about sys.stdout.encoding being None when stdout is redirected, not about the input string being None. So the fix for that should be something like

def unicode_to_stdout(s):
    """
    Encode an unicode object for representation on stdout.
    """

    precondition(isinstance(s, unicode), s)

    enc = sys.stdout.encoding
    if enc is None or enc == 'cp65001':
        enc = 'utf-8'

    try:
        return s.encode(enc, 'replace')
    catch LookupError:
        return s.encode('utf-8', 'replace')  # maybe

It might also be a good idea to write a BOM at the start of the output. That would allow the encoding to be autodetected if a file containing redirected output is edited, which is helpful at least on Windows, and should be harmless on other platforms.

At least I did better at scrabble (won by 3 points :-)

Sigh. I misread the exception message in #734. It is complaining about `sys.stdout.encoding` being None when stdout is redirected, not about the input string being None. So the fix for that should be something like ``` def unicode_to_stdout(s): """ Encode an unicode object for representation on stdout. """ precondition(isinstance(s, unicode), s) enc = sys.stdout.encoding if enc is None or enc == 'cp65001': enc = 'utf-8' try: return s.encode(enc, 'replace') catch LookupError: return s.encode('utf-8', 'replace') # maybe ``` It might also be a good idea to write a BOM at the start of the output. That would allow the encoding to be autodetected if a file containing redirected output is edited, which is helpful at least on Windows, and should be harmless on other platforms. At least I did better at scrabble (won by 3 points :-)

zooko modified the milestone from eventually to 1.7.0

2010-01-27 06:07:41 +00:00

francois commented

2010-04-24 00:11:27 +00:00

Author

Owner

Attachment unicode-helper-functions.diff (10347 bytes) added

This patch contains Unicode helper functions (stringutils.py) and associated tests

**Attachment** unicode-helper-functions.diff (10347 bytes) added This patch contains Unicode helper functions (stringutils.py) and associated tests

unicode-helper-functions.diff

10 KiB

francois commented

2010-04-24 00:22:43 +00:00

Author

Owner

I'd be very grateful if someone could review patch attachment:unicode-helper-functions.diff

Then, I'll post a second patch which will modify the CLI code to actually use those helper functions.

I'd be very grateful if someone could review patch [attachment:unicode-helper-functions.diff](/tahoe-lafs/trac/attachments/000078ac-ae73-29c9-79b3-2933aeed2776) Then, I'll post a second patch which will modify the CLI code to actually use those helper functions.

zooko commented

2010-04-24 04:37:10 +00:00

I will review it. If I understand correctly François's strategy is to first write patches to handle the case that the bytes decode without error using the nominal encoding. This avoids all the tricky problems with bytes that don't decode according to the nominal encoding.

I'm excited about having this mocking tool so we can conveniently express platform-specific behavior in a deterministic, automated, cross-platform test.

I will review it. If I understand correctly François's strategy is to first write patches to handle the case that the bytes decode without error using the nominal encoding. This avoids all the tricky problems with bytes that don't decode according to the nominal encoding. I'm excited about having this mocking tool so we can conveniently express platform-specific behavior in a deterministic, automated, cross-platform test.

francois commented

2010-04-25 20:30:19 +00:00

Author

Owner

I plan to add the following text to the NEWS file to tell users the current state of Unicode support.

** Bugfixes

*** Unicode filenames handling

Tahoe CLI commands working on local files, for instance 'tahoe cp' or 'tahoe
backup', have been improved to correctly handle filenames containing non-ASCII
characters.

In the case where Tahoe encounters a filename which cannot be decoded using the
system encoding, an error will be returned and the operation will fail.  Under
Linux, this typically happens when the filesystem contains filenames encoded
with another encoding, for instance latin1, than the system locale, for
instance UTF-8.  In such case, you'll need to fix your system with tools such
as 'convmv' before using Tahoe CLI.

All CLI commands have been improved to support non-ASCII parameters such as 
filenames and aliases on all supported Operating Systems.

I plan to add the following text to the **NEWS** file to tell users the current state of Unicode support. ``` ** Bugfixes *** Unicode filenames handling Tahoe CLI commands working on local files, for instance 'tahoe cp' or 'tahoe backup', have been improved to correctly handle filenames containing non-ASCII characters. In the case where Tahoe encounters a filename which cannot be decoded using the system encoding, an error will be returned and the operation will fail. Under Linux, this typically happens when the filesystem contains filenames encoded with another encoding, for instance latin1, than the system locale, for instance UTF-8. In such case, you'll need to fix your system with tools such as 'convmv' before using Tahoe CLI. All CLI commands have been improved to support non-ASCII parameters such as filenames and aliases on all supported Operating Systems. ```

francois commented

2010-05-04 07:27:42 +00:00

Author

Owner

Attachment unicode-filenames-handling.diff (29300 bytes) added

**Attachment** unicode-filenames-handling.diff (29300 bytes) added

unicode-filenames-handling.diff

29 KiB

francois commented

2010-05-04 07:33:06 +00:00

Author

Owner

Patch attachment:unicode-filenames-handling.diff makes use of helper functions from attachment:unicode-helper-functions.diff to handle Unicode filenames in the CLI.

As for now, only filenames which are encoded in a coherent way according to the system locale are handled. In the other case, an error message is displayed and the current operation is stopped.

Unicode CLI argument in Windows cmd.exe doesn't work. See ticket #565 on this issue.

Patch [attachment:unicode-filenames-handling.diff](/tahoe-lafs/trac/attachments/000078ac-ae73-29c9-79b3-83c5e2386dd7) makes use of helper functions from [attachment:unicode-helper-functions.diff](/tahoe-lafs/trac/attachments/000078ac-ae73-29c9-79b3-2933aeed2776) to handle Unicode filenames in the CLI. As for now, only filenames which are encoded in a coherent way according to the system locale are handled. In the other case, an error message is displayed and the current operation is stopped. Unicode CLI argument in Windows cmd.exe doesn't work. See ticket #565 on this issue.

zooko commented

2010-05-11 15:52:49 +00:00

I really want to see this patch in trunk in the next 48 hours for Tahoe-LAFS v1.7, but I can't review it myself right now. Unassigning myself and writing a note to tahoe-dev pleading for someone else to review it.

zooko commented

2010-05-14 04:36:13 +00:00

Here is my review of attachment:unicode-filenames-handling.diff and attachment:unicode-helper-functions.diff .

Overall this is a good start. I'm glad we have a way to do real thorough tests of encoding (by mocking up stdout, sys, and other components of the system). This patch isn't ready to go into trunk so we won't ship it in v1.7 but at this rate we can definitely get it all polished up for v1.8.

Here are my comments:

I like the NEWS file text. :-)

The unit tests of the helper functions look good! Please add tests of the two cases where listdir_unicode_unix() would raise FilenameEncodingError.

     # If you force Windows cmd.exe set to use UTF-8 by typing 'chcp 65001',
     # sys.stdin.encoding and sys.stdout.encoding will be set to 'cp65001',
     # which is not recognized as being the same as UTF-8.

What does this comment mean? Recognized by whom? The code looks like it detects this and corrects it, so I guess you don't mean recognized by Tahoe-LAFS.

     if enc == 'cp850':
         enc = 'ISO-8859-1'

This looks wrong! cp850 and iso-8859-1 are different encodings. If I am right that this is wrong then it should be possible to write a test case that causes Tahoe-LAFS to incorrectly decode an input as long as this statement is still in the code.

Oh and look what I just found! It appears that cp65001 and utf-8 are also different: http://bugs.python.org/issue6058 (specifically http://bugs.python.org/msg97731 ). Let's see if we can write a unit test that goes red when Tahoe-LAFS (or perhaps the underlying Python libraries) incorrectly equate cp65001 and utf-8.

def argv_to_unicode(s):
...
         return unicode(s, get_stdout_encoding())
...
         raise usageError("Argument '%s' cannot be decoded as UTF-8." % s)

Is stdout encoding guaranteed to be the same as argv encoding? Maybe we should rename get_stdout_encoding() or at least add a comment to argv_to_unicode() explaining that we use stdout encoding for this. Also the UsageError has the wrong text. Also it has the wrong name of the exception class! This proves that the line is not covered by unit tests. Please add a unit test of what happens with something that can't be decoded as the stdout encoding. :-)

def unicode_to_stdout(s):
...
    except LookupError:
        return s.encode('utf-8', 'replace')  # maybe

If we're going to do this fallback let's have a unit test of it. This means a test of the case that get_stdout_encoding() returns an encoding that Python doesn't know how to do, right!? Maybe let's not have a test of that and also not have that code at all and let it raise LookupError in that case. :-)

"encoutered" -> "encountered"

Finish the test for open_unicode().

Add a test that the normalization is useful, i.e. a test that is red if we skip the unicodedata.normalize() step.

There are a couple of hard-coded 'utf-8' and a literal 'ascii' in attachment:unicode-filenames-handling.diff . Some of them may be correct, for example if it is encoding to a URL, but in that case maybe we should invoke unicode_to_url() instead (mostly for documentation purposes). Some of them appear to be wrong to me -- aliases files shouldn't be hard-coded to utf-8... Oh wait, yes they should, right? Then please add doc stating so. Finally the encoding for stdout in tahoe_manifest.py definitely seems wrong to be hard-coded to 'utf-8'. Add a unit test showing that the current code with the hard-coded 'utf-8' fails on a system where stdout needs a different encoding.

In test_cli.py, if the test_listdir_unicode_bad() test doesn't make sense on the current platform, then raise SkipTest instead of returning.
Finally, use a code coverage tool (I recommend trialcoverage) to see if any of the lines changed by these patches are not-fully-tested by these tests.

Here is my review of [attachment:unicode-filenames-handling.diff](/tahoe-lafs/trac/attachments/000078ac-ae73-29c9-79b3-83c5e2386dd7) and [attachment:unicode-helper-functions.diff](/tahoe-lafs/trac/attachments/000078ac-ae73-29c9-79b3-2933aeed2776) . Overall this is a good start. I'm glad we have a way to do real thorough tests of encoding (by mocking up stdout, sys, and other components of the system). This patch isn't ready to go into trunk so we won't ship it in v1.7 but at this rate we can definitely get it all polished up for v1.8. Here are my comments: I like the NEWS file text. :-) The unit tests of the helper functions look good! Please add tests of the two cases where `listdir_unicode_unix()` would raise `FilenameEncodingError`. ``` # If you force Windows cmd.exe set to use UTF-8 by typing 'chcp 65001', # sys.stdin.encoding and sys.stdout.encoding will be set to 'cp65001', # which is not recognized as being the same as UTF-8. ``` What does this comment mean? Recognized by whom? The code looks like it detects this and corrects it, so I guess you don't mean recognized by Tahoe-LAFS. ``` if enc == 'cp850': enc = 'ISO-8859-1' ``` This looks wrong! cp850 and iso-8859-1 are different encodings. If I am right that this is wrong then it should be possible to write a test case that causes Tahoe-LAFS to incorrectly decode an input as long as this statement is still in the code. Oh and look what I just found! It appears that cp65001 and utf-8 are also different: <http://bugs.python.org/issue6058> (specifically <http://bugs.python.org/msg97731> ). Let's see if we can write a unit test that goes red when Tahoe-LAFS (or perhaps the underlying Python libraries) incorrectly equate cp65001 and utf-8. ``` def argv_to_unicode(s): ... return unicode(s, get_stdout_encoding()) ... raise usageError("Argument '%s' cannot be decoded as UTF-8." % s) ``` Is stdout encoding guaranteed to be the same as argv encoding? Maybe we should rename `get_stdout_encoding()` or at least add a comment to `argv_to_unicode()` explaining that we use stdout encoding for this. Also the UsageError has the wrong text. Also it has the wrong name of the exception class! This proves that the line is not covered by unit tests. Please add a unit test of what happens with something that can't be decoded as the stdout encoding. :-) ``` def unicode_to_stdout(s): ... except LookupError: return s.encode('utf-8', 'replace') # maybe ``` If we're going to do this fallback let's have a unit test of it. This means a test of the case that `get_stdout_encoding()` returns an encoding that Python doesn't know how to do, right!? Maybe let's not have a test of that and also not have that code at all and let it raise LookupError in that case. :-) "encoutered" -> "encountered" Finish the test for `open_unicode()`. Add a test that the normalization is useful, i.e. a test that is red if we skip the `unicodedata.normalize()` step. There are a couple of hard-coded `'utf-8'` and a literal `'ascii'` in [attachment:unicode-filenames-handling.diff](/tahoe-lafs/trac/attachments/000078ac-ae73-29c9-79b3-83c5e2386dd7) . Some of them may be correct, for example if it is encoding to a URL, but in that case maybe we should invoke `unicode_to_url()` instead (mostly for documentation purposes). Some of them appear to be wrong to me -- aliases files shouldn't be hard-coded to utf-8... Oh wait, yes they should, right? Then please add doc stating so. Finally the encoding for stdout in tahoe_manifest.py definitely seems wrong to be hard-coded to 'utf-8'. Add a unit test showing that the current code with the hard-coded 'utf-8' fails on a system where stdout needs a different encoding. In test_cli.py, if the `test_listdir_unicode_bad()` test doesn't make sense on the current platform, then raise `SkipTest` instead of returning. Finally, use a code coverage tool (I recommend [trialcoverage](http://pypi.python.org/pypi/trialcoverage)) to see if any of the lines changed by these patches are not-fully-tested by these tests.

zooko modified the milestone from 1.7.0 to 1.8.0

2010-05-14 04:36:13 +00:00

francois commented

2010-05-16 22:06:49 +00:00

Author

Owner

Zooko, thank you very much for the review.

Replying to zooko:

Here is my review of attachment:unicode-filenames-handling.diff and
attachment:unicode-helper-functions.diff .

Overall this is a good start. I'm glad we have a way to do real
thorough tests of encoding (by mocking up stdout, sys, and other
components of the system). This patch isn't ready to go into trunk so
we won't ship it in v1.7 but at this rate we can definitely get it all
polished up for v1.8.

Please find an updated version of the patches which takes all your
comments into account and has now 100% test coverage.

The unit tests of the helper functions look good! Please add tests of
the two cases where listdir_unicode_unix() would raise
FilenameEncodingError.

Done

     # If you force Windows cmd.exe set to use UTF-8 by typing 'chcp 65001',
     # sys.stdin.encoding and sys.stdout.encoding will be set to 'cp65001',
     # which is not recognized as being the same as UTF-8.
What does this comment mean? Recognized by whom? The code looks like
it detects this and corrects it, so I guess you don't mean recognized
by Tahoe-LAFS.
     if enc == 'cp850':
         enc = 'ISO-8859-1'
This looks wrong! cp850 and iso-8859-1 are different encodings. If I
am right that this is wrong then it should be possible to write a test
case that causes Tahoe-LAFS to incorrectly decode an input as long as
this statement is still in the code.

Oh and look what I just found! It appears that cp65001 and utf-8 are
also different: http://bugs.python.org/issue6058 (specifically
http://bugs.python.org/msg97731 ). Let's see if we can write a unit
test that goes red when Tahoe-LAFS (or perhaps the underlying Python
libraries) incorrectly equate cp65001 and utf-8.

The idea behind this piece code came from comment:369040 and comment:369046.
Anyway, I removed it completely because getting correct Unicode
command-line arguments under Windows will require calling the Unicode
Windows API function GetCommandLineW, see ticket #565.

In the meantime, I've marked all Windows test cases as todo.

def argv_to_unicode(s):
...
         return unicode(s, get_stdout_encoding())
...
         raise usageError("Argument '%s' cannot be decoded as UTF-8." % s)
Is stdout encoding guaranteed to be the same as argv encoding? Maybe
we should rename get_stdout_encoding() or at least add a comment
to argv_to_unicode() explaining that we use stdout encoding for
this. Also the UsageError has the wrong text. Also it has the wrong
name of the exception class! This proves that the line is not covered
by unit tests. Please add a unit test of what happens with something
that can't be decoded as the stdout encoding. :-)

The function was renamed get_term_encoding(), the wrong comment was
rewritten and the error message now contains the guessed terminal
encoding.

def unicode_to_stdout(s):
...
    except [LookupError](wiki/LookupError):
        return s.encode('utf-8', 'replace')  # maybe
If we're going to do this fallback let's have a unit test of it. This
means a test of the case that get_stdout_encoding() returns an
encoding that Python doesn't know how to do, right!? Maybe let's not
have a test of that and also not have that code at all and let it
raise LookupError in that case. :-)

According to the email message linked from ticket #734, users probably
prefers having a slightly modified ouput in the terminal (unknown
characters replaced by '?') than an exception.

"encoutered" -> "encountered"

Fixed :)

Finish the test for open_unicode().

Done.

Add a test that the normalization is useful, i.e. a test that is red
if we skip the unicodedata.normalize() step.

Done in test_unicode_normalization().

There are a couple of hard-coded 'utf-8' and a literal
'ascii' in attachment:unicode-filenames-handling.diff . Some of
them may be correct, for example if it is encoding to a URL, but in
that case maybe we should invoke unicode_to_url() instead
(mostly for documentation purposes).

I already invoke unicode_to_url() instead of
.encode('utf-8') in every place. The two remaining hard-coded
'utf-8' occurrences are for the alias file encoding.

Hard-coded 'ascii' is used for representing capabilities, it might
be better to create Capability objects instead of using str
but that's most likely out of the scope of this patch.

Some of them appear to be wrong to me -- aliases files shouldn't be
hard-coded to utf-8... Oh wait, yes they should, right? Then please
add doc stating so. Finally the encoding for stdout in
tahoe_manifest.py definitely seems wrong to be hard-coded to 'utf-8'.
Add a unit test showing that the current code with the hard-coded
'utf-8' fails on a system where stdout needs a different encoding.

Yes, the alias file is read and written as UTF-8. A note was added in
frontends/CLI.txt.

In test_cli.py, if the test_listdir_unicode_bad() test doesn't
make sense on the current platform, then raise SkipTest instead
of returning.

Done.

Finally, use a code coverage tool (I recommend
[http://pypi.python.org/pypi/trialcoverage trialcoverage]) to see if
any of the lines changed by these patches are not-fully-tested by
these tests.

I had trouble making use of test-coverage Makefile target, so I
finally used coverage.py directly.

Zooko, thank you very much for the review. Replying to [zooko](/tahoe-lafs/trac/issues/534#issuecomment-369053): > Here is my review of [attachment:unicode-filenames-handling.diff](/tahoe-lafs/trac/attachments/000078ac-ae73-29c9-79b3-83c5e2386dd7) and > [attachment:unicode-helper-functions.diff](/tahoe-lafs/trac/attachments/000078ac-ae73-29c9-79b3-2933aeed2776) . > > Overall this is a good start. I'm glad we have a way to do real > thorough tests of encoding (by mocking up stdout, sys, and other > components of the system). This patch isn't ready to go into trunk so > we won't ship it in v1.7 but at this rate we can definitely get it all > polished up for v1.8. Please find an updated version of the patches which takes all your comments into account and has now 100% test coverage. > The unit tests of the helper functions look good! Please add tests of > the two cases where `listdir_unicode_unix()` would raise > `FilenameEncodingError`. Done > ``` > # If you force Windows cmd.exe set to use UTF-8 by typing 'chcp 65001', > # sys.stdin.encoding and sys.stdout.encoding will be set to 'cp65001', > # which is not recognized as being the same as UTF-8. > ``` > What does this comment mean? Recognized by whom? The code looks like > it detects this and corrects it, so I guess you don't mean recognized > by Tahoe-LAFS. > > ``` > if enc == 'cp850': > enc = 'ISO-8859-1' > ``` > This looks wrong! cp850 and iso-8859-1 are different encodings. If I > am right that this is wrong then it should be possible to write a test > case that causes Tahoe-LAFS to incorrectly decode an input as long as > this statement is still in the code. > > Oh and look what I just found! It appears that cp65001 and utf-8 are > also different: <http://bugs.python.org/issue6058> (specifically > <http://bugs.python.org/msg97731> ). Let's see if we can write a unit > test that goes red when Tahoe-LAFS (or perhaps the underlying Python > libraries) incorrectly equate cp65001 and utf-8. The idea behind this piece code came from [comment:369040](/tahoe-lafs/trac/issues/534#issuecomment-369040) and [comment:369046](/tahoe-lafs/trac/issues/534#issuecomment-369046). Anyway, I removed it completely because getting correct Unicode command-line arguments under Windows will require calling the Unicode Windows API function `GetCommandLineW`, see ticket #565. In the meantime, I've marked all Windows test cases as todo. > ``` > def argv_to_unicode(s): > ... > return unicode(s, get_stdout_encoding()) > ... > raise usageError("Argument '%s' cannot be decoded as UTF-8." % s) > ``` > Is stdout encoding guaranteed to be the same as argv encoding? Maybe > we should rename `get_stdout_encoding()` or at least add a comment > to `argv_to_unicode()` explaining that we use stdout encoding for > this. Also the UsageError has the wrong text. Also it has the wrong > name of the exception class! This proves that the line is not covered > by unit tests. Please add a unit test of what happens with something > that can't be decoded as the stdout encoding. :-) The function was renamed get_term_encoding(), the wrong comment was rewritten and the error message now contains the guessed terminal encoding. > ``` > def unicode_to_stdout(s): > ... > except [LookupError](wiki/LookupError): > return s.encode('utf-8', 'replace') # maybe > ``` > If we're going to do this fallback let's have a unit test of it. This > means a test of the case that `get_stdout_encoding()` returns an > encoding that Python doesn't know how to do, right!? Maybe let's not > have a test of that and also not have that code at all and let it > raise LookupError in that case. :-) According to the email message linked from ticket #734, users probably prefers having a slightly modified ouput in the terminal (unknown characters replaced by '?') than an exception. > "encoutered" -> "encountered" Fixed :) > Finish the test for `open_unicode()`. Done. > Add a test that the normalization is useful, i.e. a test that is red > if we skip the `unicodedata.normalize()` step. Done in `test_unicode_normalization()`. > There are a couple of hard-coded `'utf-8'` and a literal > `'ascii'` in [attachment:unicode-filenames-handling.diff](/tahoe-lafs/trac/attachments/000078ac-ae73-29c9-79b3-83c5e2386dd7) . Some of > them may be correct, for example if it is encoding to a URL, but in > that case maybe we should invoke `unicode_to_url()` instead > (mostly for documentation purposes). I already invoke `unicode_to_url()` instead of `.encode('utf-8')` in every place. The two remaining hard-coded `'utf-8'` occurrences are for the alias file encoding. Hard-coded `'ascii'` is used for representing capabilities, it might be better to create `Capability` objects instead of using `str` but that's most likely out of the scope of this patch. > Some of them appear to be wrong to me -- aliases files shouldn't be > hard-coded to utf-8... Oh wait, yes they should, right? Then please > add doc stating so. Finally the encoding for stdout in > tahoe_manifest.py definitely seems wrong to be hard-coded to 'utf-8'. > Add a unit test showing that the current code with the hard-coded > 'utf-8' fails on a system where stdout needs a different encoding. Yes, the alias file is read and written as UTF-8. A note was added in `frontends/CLI.txt`. > In test_cli.py, if the `test_listdir_unicode_bad()` test doesn't > make sense on the current platform, then raise `SkipTest` instead > of returning. Done. > Finally, use a code coverage tool (I recommend > [http://pypi.python.org/pypi/trialcoverage trialcoverage]) to see if > any of the lines changed by these patches are not-fully-tested by > these tests. I had trouble making use of `test-coverage` Makefile target, so I finally used `coverage.py` directly.

francois commented

2010-05-17 08:07:19 +00:00

Author

Owner

Attachment unicode-helper-functions-v2.diff (12927 bytes) added

**Attachment** unicode-helper-functions-v2.diff (12927 bytes) added

unicode-helper-functions-v2.diff

13 KiB

francois commented

2010-05-17 08:07:44 +00:00

Author

Owner

Attachment unicode-filenames-handling-v2.diff (30979 bytes) added

**Attachment** unicode-filenames-handling-v2.diff (30979 bytes) added

unicode-filenames-handling-v2.diff

30 KiB

zooko commented

2010-05-17 14:35:46 +00:00

Let's define capabilities as being utf-8 encoded for forward-compatibility purposes. (No new unit test is required to test this change.)

Can you (or have you already) written a unit test which exercises the fallback: except [LookupError](wiki/LookupError)?: return s.encode('utf-8', 'replace')?

As we discussed on IRC, this version doesn't do any lossy decoding from local filesystem into unicode -- if a filename can't be decoded with the declared sys.getfilesystemencoding() then an exception is raised. Is this exercised by a unit test?

I can't think of any other issues -- we just need someone to review the latest patch.

Let's define capabilities as being utf-8 encoded for forward-compatibility purposes. (No new unit test is required to test this change.) Can you (or have you already) written a unit test which exercises the fallback: `except [LookupError](wiki/LookupError)?: return s.encode('utf-8', 'replace')`? As we discussed on IRC, this version doesn't do any lossy decoding from local filesystem into unicode -- if a filename can't be decoded with the declared `sys.getfilesystemencoding()` then an exception is raised. Is this exercised by a unit test? I can't think of any other issues -- we just need someone to review the latest patch.

francois commented

2010-05-17 23:14:23 +00:00

Author

Owner

Replying to zooko:

Let's define capabilities as being utf-8 encoded for forward-compatibility purposes. (No new unit test is required to test this change.)

Shouldn't capabilities be defined as plain ASCII?

Can you (or have you already) written a unit test which exercises the fallback: except [LookupError](wiki/LookupError)?: return s.encode('utf-8', 'replace')?

Yes, in test_stringutils.py line 70:

  @patch('sys.stdout') 
  def test_unicode_to_stdout(self, mock): 
    # Encoding koi8-r cannot represent 'è' 
    mock.encoding = 'koi8-r' 
    self.failUnlessEqual(unicode_to_stdout(u'lumière'), 'lumi?re')

As we discussed on IRC, this version doesn't do any lossy decoding from local filesystem into unicode -- if a filename can't be decoded with the declared sys.getfilesystemencoding() then an exception is raised. Is this exercised by a unit test?

Yes, line 98 of test_stringutils.py:

  @patch('sys.getfilesystemencoding') 
    def test_open_unicode(self, mock): 
    mock.return_value = 'ascii' 
  
    self.failUnlessRaises(FilenameEncodingError, 
                          open_unicode, 
                          u'lumière')

I can't think of any other issues -- we just need someone to review the latest patch.

Replying to [zooko](/tahoe-lafs/trac/issues/534#issuecomment-369056): > Let's define capabilities as being utf-8 encoded for forward-compatibility purposes. (No new unit test is required to test this change.) Shouldn't capabilities be defined as plain ASCII? > Can you (or have you already) written a unit test which exercises the fallback: `except [LookupError](wiki/LookupError)?: return s.encode('utf-8', 'replace')`? Yes, in `test_stringutils.py` line 70: ``` @patch('sys.stdout') def test_unicode_to_stdout(self, mock): # Encoding koi8-r cannot represent 'è' mock.encoding = 'koi8-r' self.failUnlessEqual(unicode_to_stdout(u'lumière'), 'lumi?re') ``` > As we discussed on IRC, this version doesn't do any lossy decoding from local filesystem into unicode -- if a filename can't be decoded with the declared `sys.getfilesystemencoding()` then an exception is raised. Is this exercised by a unit test? Yes, line 98 of `test_stringutils.py`: ``` @patch('sys.getfilesystemencoding') def test_open_unicode(self, mock): mock.return_value = 'ascii' self.failUnlessRaises(FilenameEncodingError, open_unicode, u'lumière') ``` > I can't think of any other issues -- we just need someone to review the latest patch.

zooko commented

2010-05-18 03:15:10 +00:00

Replying to [francois]comment:93:

Replying to zooko:

Let's define capabilities as being utf-8 encoded for forward-compatibility purposes. (No new unit test is required to test this change.)

Shouldn't capabilities be defined as plain ASCII?

Future caps might include non-ascii characters. And some of the current code might be able to process caps of a future version. (Forward-compatibility) There's no reason not to define the current ones as utf-8 is there?

Replying to [francois]comment:93: > Replying to [zooko](/tahoe-lafs/trac/issues/534#issuecomment-369056): > > > Let's define capabilities as being utf-8 encoded for forward-compatibility purposes. (No new unit test is required to test this change.) > > Shouldn't capabilities be defined as plain ASCII? Future caps might include non-ascii characters. And some of the current code might be able to process caps of a future version. (Forward-compatibility) There's no reason not to define the current ones as utf-8 is there?

zooko commented

2010-05-18 12:57:28 +00:00

USSJoin and drewp had this to say on IRC:

<USSJoin> Zooko: Still need someone to look at that?		        [22:19]
<drewp> that does -not- look like anything i'd commit 10 hours before a
	release (unless it has already had a long testing period). Maybe with
	an env var guard, i would
<drewp> oh i see, it's had 1+ year of work already		        [22:20]
<davidsarah> the latest patches haven't, though			        [22:21]
<davidsarah> the approach changed quite a bit
<USSJoin> Zooko, Oh Hudson Master: Do the tests run? :-)	        [22:23]
<USSJoin> Zooko: Only other thing I'd add is that I'm concerned about the
	  preconditions— does that library fail more gracefully than a simple
	  assert()? If not, the whole pancake house is going to fall down,
	  very very hard.
<USSJoin> Other than that, it's been incredibly well-harassed by the whole
	  Tahoe team, I see. So I'd +1.				        [22:27]

USSJoin and drewp had this to say on IRC: ``` <USSJoin> Zooko: Still need someone to look at that? [22:19] <drewp> that does -not- look like anything i'd commit 10 hours before a release (unless it has already had a long testing period). Maybe with an env var guard, i would <drewp> oh i see, it's had 1+ year of work already [22:20] <davidsarah> the latest patches haven't, though [22:21] <davidsarah> the approach changed quite a bit <USSJoin> Zooko, Oh Hudson Master: Do the tests run? :-) [22:23] <USSJoin> Zooko: Only other thing I'd add is that I'm concerned about the preconditions— does that library fail more gracefully than a simple assert()? If not, the whole pancake house is going to fall down, very very hard. <USSJoin> Other than that, it's been incredibly well-harassed by the whole Tahoe team, I see. So I'd +1. [22:27] ```

francois commented

2010-05-18 23:19:29 +00:00

Author

Owner

Replying to [zooko]comment:94:

Future caps might include non-ascii characters. And some of the current code might be able to process caps of a future version. (Forward-compatibility) There's no reason not to define the current ones as utf-8 is there?

Ok, I see your point, I'll modify the two .encode('ascii') into .encode('utf-8') into file src/allmydata/scripts/common.py and write a unit test which add an alias to capability from the future ('fi-蜔쳨欝遃䝦舜琇襇邤䍏㵦☚✸킾궑蒴犏띎냔㳆㼿졨浴䒉ΐ屝稜퍙鉧迴') as proposed and tries to get it back.

Replying to [zooko]comment:94: > Future caps might include non-ascii characters. And some of the current code might be able to process caps of a future version. (Forward-compatibility) There's no reason not to define the current ones as utf-8 is there? Ok, I see your point, I'll modify the two `.encode('ascii')` into `.encode('utf-8')` into file `src/allmydata/scripts/common.py` and write a unit test which add an alias to capability from the future ('fi-蜔쳨欝遃䝦舜琇襇邤䍏㵦☚✸킾궑蒴犏띎냔㳆㼿졨浴䒉ΐ屝稜퍙鉧迴') as proposed and tries to get it back.

francois commented

2010-05-18 23:26:26 +00:00

Author

Owner

Attachment unicode-filenames-handling-v3.2.diff (30969 bytes) added

**Attachment** unicode-filenames-handling-v3.2.diff (30969 bytes) added

unicode-filenames-handling-v3.2.diff

30 KiB

francois commented

2010-05-19 00:41:41 +00:00

Author

Owner

Attachment additionnal-tests.diff (1127 bytes) added

**Attachment** additionnal-tests.diff (1127 bytes) added

additionnal-tests.diff

1.1 KiB

francois commented

2010-05-19 00:52:37 +00:00

Author

Owner

Replying to zooko:

Zooko: Only other thing I'd add is that I'm concerned about the
preconditions— does that library fail more gracefully than a simple
assert()? If not, the whole pancake house is going to fall down,
very very hard.

The preconditions are only there for extra safety to prevent programmers from using the library in unexpected ways. In normal usage, they should not be triggered unless because of a bug.

The only user visible exception for now is FilenameEncodingError. It will probably get catched and logged in the future to avoid breaking a complete backup session because of a single badly encoded file.

Replying to [zooko](/tahoe-lafs/trac/issues/534#issuecomment-369059): > <USSJoin> Zooko: Only other thing I'd add is that I'm concerned about the > preconditions— does that library fail more gracefully than a simple > assert()? If not, the whole pancake house is going to fall down, > very very hard. The preconditions are only there for extra safety to prevent programmers from using the library in unexpected ways. In normal usage, they should not be triggered unless because of a bug. The only user visible exception for now is `FilenameEncodingError`. It will probably get catched and logged in the future to avoid breaking a complete backup session because of a single badly encoded file.

francois commented

2010-05-19 00:56:31 +00:00

Author

Owner

Ok, those are the patches which should be applied to trunk if the review is considered sufficient enough.

Ok, those are the patches which should be applied to trunk if the review is considered sufficient enough. * [attachment:unicode-helper-functions-v2.diff](/tahoe-lafs/trac/attachments/000078ac-ae73-29c9-79b3-71b849e66b19) * [attachment:unicode-filenames-handling-v3.2.diff](/tahoe-lafs/trac/attachments/000078ac-ae73-29c9-79b3-99173ffe2bc7) * [attachment:additionnal-tests.diff](/tahoe-lafs/trac/attachments/000078ac-ae73-29c9-79b3-29f19769b9d1)

tahoe-lafs modified the milestone from 1.8.0 to 1.7.0

2010-05-19 00:56:31 +00:00

zooko commented

2010-05-19 02:58:21 +00:00

This patch seems to be a little bit messed up by the "END OF DESCRIPTION" line being accidentally included in the description part, apparently because of whitespace at the beginning of the "END OF DESCRIPTION": http://tahoe-lafs.org/trac/tahoe-lafs/attachment/ticket/534/unicode-filenames-handling-v3.2.diff

This patch seems to be a little bit messed up by the "***END OF DESCRIPTION***" line being accidentally included in the description part, apparently because of whitespace at the beginning of the "***END OF DESCRIPTION***": <http://tahoe-lafs.org/trac/tahoe-lafs/attachment/ticket/534/unicode-filenames-handling-v3.2.diff>

zooko commented

2010-05-19 03:39:39 +00:00

Also that patch doesn't apply cleanly to current trunk. Could you please attach your darcs patch (the result of darcs send -o darcs.patch.txt)?
http://tahoe-lafs.org/trac/tahoe-lafs/wiki/Patches

Also that patch doesn't apply cleanly to current trunk. Could you please attach your darcs patch (the result of `darcs send -o darcs.patch.txt`)? <http://tahoe-lafs.org/trac/tahoe-lafs/wiki/Patches>

zooko commented

2010-05-19 04:23:41 +00:00

Make that darcs send -o unicode-filenames-handling-v3.2.dpatch.

Make that `darcs send -o unicode-filenames-handling-v3.2.dpatch`.

zooko commented

2010-05-19 04:28:18 +00:00

I applied the patch attachment:unicode-helper-functions-v2.diff and ran the tests and got some failures:

Ran 42 tests in 0.155s

FAILED (expectedFailures=6, failures=2, unexpectedSuccesses=12, successes=22)

The failure were:

[FAIL]: allmydata.test.test_stringutils.StringUtilsErrors.test_listdir_unicode

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/mock-0.6.0-py2.5.egg/mock.py", line 196, in patched
    return func(*args, **keywargs)
  File "/Users/wonwinmcbrootles/playground/tahoe-lafs/trunk/ticket534/src/allmydata/test/test_stringutils.py", line 89, in test_listdir_unicode
    u'/dummy')
twisted.trial.unittest.FailTest: <type 'exceptions.TypeError'> raised instead of FilenameEncodingError:
 Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/Twisted-10.0.0-py2.5-macosx-10.3-i386.egg/twisted/internet/defer.py", line 117, in maybeDeferred
    result = f(*args, **kw)
  File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/Twisted-10.0.0-py2.5-macosx-10.3-i386.egg/twisted/internet/utils.py", line 191, in runWithWarningsSuppressed
    result = f(*a, **kw)
  File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/mock-0.6.0-py2.5.egg/mock.py", line 196, in patched
    return func(*args, **keywargs)
  File "/Users/wonwinmcbrootles/playground/tahoe-lafs/trunk/ticket534/src/allmydata/test/test_stringutils.py", line 89, in test_listdir_unicode
    u'/dummy')
--- <exception caught here> ---
  File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/Twisted-10.0.0-py2.5-macosx-10.3-i386.egg/twisted/trial/unittest.py", line 260, in failUnlessRaises
    result = f(*args, **kwargs)
  File "/Users/wonwinmcbrootles/playground/tahoe-lafs/trunk/ticket534/src/allmydata/util/stringutils.py", line 117, in listdir_unicode
    return [unicodedata.normalize('NFC', fname) for fname in dirlist]
exceptions.TypeError: normalize() argument 2 must be unicode, not str

and

[FAIL]: allmydata.test.test_stringutils.StringUtilsErrors.test_open_unicode

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/mock-0.6.0-py2.5.egg/mock.py", line 196, in patched
    return func(*args, **keywargs)
  File "/Users/wonwinmcbrootles/playground/tahoe-lafs/trunk/ticket534/src/allmydata/test/test_stringutils.py", line 104, in test_open_unicode
    u'lumière')
twisted.trial.unittest.FailTest: <type 'exceptions.IOError'> raised instead of FilenameEncodingError:
 Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/Twisted-10.0.0-py2.5-macosx-10.3-i386.egg/twisted/internet/defer.py", line 117, in maybeDeferred
    result = f(*args, **kw)
  File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/Twisted-10.0.0-py2.5-macosx-10.3-i386.egg/twisted/internet/utils.py", line 191, in runWithWarningsSuppressed
    result = f(*a, **kw)
  File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/mock-0.6.0-py2.5.egg/mock.py", line 196, in patched
    return func(*args, **keywargs)
  File "/Users/wonwinmcbrootles/playground/tahoe-lafs/trunk/ticket534/src/allmydata/test/test_stringutils.py", line 104, in test_open_unicode
    u'lumière')
--- <exception caught here> ---
  File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/Twisted-10.0.0-py2.5-macosx-10.3-i386.egg/twisted/trial/unittest.py", line 260, in failUnlessRaises
    result = f(*args, **kwargs)
  File "/Users/wonwinmcbrootles/playground/tahoe-lafs/trunk/ticket534/src/allmydata/util/stringutils.py", line 128, in open_unicode
    return open(path, mode)
exceptions.IOError: [Errno 2] No such file or directory: u'lumi\xe8re'

This is on Mac OS 10.4.11.

I applied the patch [attachment:unicode-helper-functions-v2.diff](/tahoe-lafs/trac/attachments/000078ac-ae73-29c9-79b3-71b849e66b19) and ran the tests and got some failures: ``` Ran 42 tests in 0.155s FAILED (expectedFailures=6, failures=2, unexpectedSuccesses=12, successes=22) ``` The failure were: ``` [FAIL]: allmydata.test.test_stringutils.StringUtilsErrors.test_listdir_unicode Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/mock-0.6.0-py2.5.egg/mock.py", line 196, in patched return func(*args, **keywargs) File "/Users/wonwinmcbrootles/playground/tahoe-lafs/trunk/ticket534/src/allmydata/test/test_stringutils.py", line 89, in test_listdir_unicode u'/dummy') twisted.trial.unittest.FailTest: <type 'exceptions.TypeError'> raised instead of FilenameEncodingError: Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/Twisted-10.0.0-py2.5-macosx-10.3-i386.egg/twisted/internet/defer.py", line 117, in maybeDeferred result = f(*args, **kw) File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/Twisted-10.0.0-py2.5-macosx-10.3-i386.egg/twisted/internet/utils.py", line 191, in runWithWarningsSuppressed result = f(*a, **kw) File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/mock-0.6.0-py2.5.egg/mock.py", line 196, in patched return func(*args, **keywargs) File "/Users/wonwinmcbrootles/playground/tahoe-lafs/trunk/ticket534/src/allmydata/test/test_stringutils.py", line 89, in test_listdir_unicode u'/dummy') --- <exception caught here> --- File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/Twisted-10.0.0-py2.5-macosx-10.3-i386.egg/twisted/trial/unittest.py", line 260, in failUnlessRaises result = f(*args, **kwargs) File "/Users/wonwinmcbrootles/playground/tahoe-lafs/trunk/ticket534/src/allmydata/util/stringutils.py", line 117, in listdir_unicode return [unicodedata.normalize('NFC', fname) for fname in dirlist] exceptions.TypeError: normalize() argument 2 must be unicode, not str ``` and ``` [FAIL]: allmydata.test.test_stringutils.StringUtilsErrors.test_open_unicode Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/mock-0.6.0-py2.5.egg/mock.py", line 196, in patched return func(*args, **keywargs) File "/Users/wonwinmcbrootles/playground/tahoe-lafs/trunk/ticket534/src/allmydata/test/test_stringutils.py", line 104, in test_open_unicode u'lumière') twisted.trial.unittest.FailTest: <type 'exceptions.IOError'> raised instead of FilenameEncodingError: Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/Twisted-10.0.0-py2.5-macosx-10.3-i386.egg/twisted/internet/defer.py", line 117, in maybeDeferred result = f(*args, **kw) File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/Twisted-10.0.0-py2.5-macosx-10.3-i386.egg/twisted/internet/utils.py", line 191, in runWithWarningsSuppressed result = f(*a, **kw) File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/mock-0.6.0-py2.5.egg/mock.py", line 196, in patched return func(*args, **keywargs) File "/Users/wonwinmcbrootles/playground/tahoe-lafs/trunk/ticket534/src/allmydata/test/test_stringutils.py", line 104, in test_open_unicode u'lumière') --- <exception caught here> --- File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/Twisted-10.0.0-py2.5-macosx-10.3-i386.egg/twisted/trial/unittest.py", line 260, in failUnlessRaises result = f(*args, **kwargs) File "/Users/wonwinmcbrootles/playground/tahoe-lafs/trunk/ticket534/src/allmydata/util/stringutils.py", line 128, in open_unicode return open(path, mode) exceptions.IOError: [Errno 2] No such file or directory: u'lumi\xe8re' ``` This is on Mac OS 10.4.11.

zooko commented

2010-05-19 04:48:48 +00:00

I applied attachment:additionnal-tests.diff and ran the tests and got a failure:

allmydata.test.test_cli
  CreateAlias
    test_create ...                                                     [ERROR]

===============================================================================
[ERROR]: allmydata.test.test_cli.CreateAlias.test_create

Traceback (most recent call last):
  File "/Users/wonwinmcbrootles/playground/tahoe-lafs/trunk/ticket534/src/allmydata/test/test_cli.py", line 600, in <lambda>
    get_aliases(self.get_clientdir())[u"études"] + "/lumière.txt"))
exceptions.KeyError: u'\xe9tudes'
-------------------------------------------------------------------------------
Ran 1 tests in 1.064s

FAILED (errors=1)

I applied [attachment:additionnal-tests.diff](/tahoe-lafs/trac/attachments/000078ac-ae73-29c9-79b3-29f19769b9d1) and ran the tests and got a failure: ``` allmydata.test.test_cli CreateAlias test_create ... [ERROR] =============================================================================== [ERROR]: allmydata.test.test_cli.CreateAlias.test_create Traceback (most recent call last): File "/Users/wonwinmcbrootles/playground/tahoe-lafs/trunk/ticket534/src/allmydata/test/test_cli.py", line 600, in <lambda> get_aliases(self.get_clientdir())[u"études"] + "/lumière.txt")) exceptions.KeyError: u'\xe9tudes' ------------------------------------------------------------------------------- Ran 1 tests in 1.064s FAILED (errors=1) ```

zooko commented

2010-05-19 06:30:16 +00:00

The issue of caps with non-ASCII chars has been spun out into #1051 (capabilities from the future could have non-ascii characters).

zooko commented

2010-05-19 07:02:35 +00:00

Replying to francois:

Ok, those are the patches which should be applied to trunk if the review is considered sufficient enough.

attachment:unicode-helper-functions-v2.diff

attachment:unicode-filenames-handling-v3.2.diff

attachment:additionnal-tests.diff

Reviewing these three now.

Some tests fail on my Mac (see comment:369067 and comment:369068). Can we use the magic of mockery to make the same tests fail in the same way on every platform where they are run? (Oh yeah and also then fix them. :-))
In unicode_to_stdout() I think that if we get UnicodeEncodeError then we should add the 'replace' flag but using the same encoding (the return value of get_term_encoding()), but we should not catch and try to handle LookupError. Oh, in fact that means that we should omit the try: … except: block entirely and just do return s.encode(get_term_encoding(), 'replace'). What do you think? If you think we should do something different then construct a unit test that we fail unless we do something different. :-)
Consider this code:

        # Use unicode strings when calling os functions 
        if sys.getfilesystemencoding() == "ANSI_X3.4-1968": 
            fn1 = os.path.join(self.basedir, u"Artonwall") 
        else: 
            fn1 = os.path.join(self.basedir, u"Ärtonwall")

Is this if statement obsolete now that we have proper mockery-based tests? Or is do_cli() a higher layer of functionality which should be tested in addition to the current tests? Or should do_cli() itself get mockerified so that it will run the same on all platforms? (Maybe it is good to test the actual non-mockerified underlying platform sometimes?) What is the intent of this if -- to test how do_cli() handles non-ASCII chars on those filesystems which can handle them? If that is the case then add a comment to that effect and make the test actually test whether the underlying system can encode Ä rather than whether the underlying system is ANSI_X3.4-1968. (For example, I guess a Japanese system in SHIFT-JIS isn't ANSI_X3.4-1968 and also can't represent Ä.

Way to go on adding a test whose sole job is to go red if we remove the call to unicodedata.normalize(). I just went and removed that call to see that test go red and finally gained a dim understanding of what that call is for. :-)

Replying to [francois](/tahoe-lafs/trac/issues/534#issuecomment-369062): > Ok, those are the patches which should be applied to trunk if the review is considered sufficient enough. > > * [attachment:unicode-helper-functions-v2.diff](/tahoe-lafs/trac/attachments/000078ac-ae73-29c9-79b3-71b849e66b19) > * [attachment:unicode-filenames-handling-v3.2.diff](/tahoe-lafs/trac/attachments/000078ac-ae73-29c9-79b3-99173ffe2bc7) > * [attachment:additionnal-tests.diff](/tahoe-lafs/trac/attachments/000078ac-ae73-29c9-79b3-29f19769b9d1) Reviewing these three now. * Some tests fail on my Mac (see [comment:369067](/tahoe-lafs/trac/issues/534#issuecomment-369067) and [comment:369068](/tahoe-lafs/trac/issues/534#issuecomment-369068)). Can we use the magic of mockery to make the same tests fail in the same way on every platform where they are run? (Oh yeah and also then fix them. :-)) * In `unicode_to_stdout()` I think that if we get `UnicodeEncodeError` then we should add the `'replace'` flag but using the same encoding (the return value of `get_term_encoding()`), but we should not catch and try to handle `LookupError`. Oh, in fact that means that we should omit the `try: … except:` block entirely and just do `return s.encode(get_term_encoding(), 'replace')`. What do you think? If you think we should do something different then construct a unit test that we fail unless we do something different. :-) * Consider this code: ``` # Use unicode strings when calling os functions if sys.getfilesystemencoding() == "ANSI_X3.4-1968": fn1 = os.path.join(self.basedir, u"Artonwall") else: fn1 = os.path.join(self.basedir, u"Ärtonwall") ``` Is this `if` statement obsolete now that we have proper mockery-based tests? Or is `do_cli()` a higher layer of functionality which should be tested in addition to the current tests? Or should `do_cli()` itself get mockerified so that it will run the same on all platforms? (Maybe it is good to test the actual non-mockerified underlying platform sometimes?) What is the intent of this `if` -- to test how `do_cli()` handles non-ASCII chars on those filesystems which can handle them? If that is the case then add a comment to that effect and make the test actually test whether the underlying system can encode `Ä` rather than whether the underlying system is `ANSI_X3.4-1968`. (For example, I guess a Japanese system in SHIFT-JIS isn't `ANSI_X3.4-1968` and also can't represent `Ä`. * Way to go on adding a test whose sole job is to go red if we remove the call to `unicodedata.normalize()`. I just went and removed that call to see that test go red and finally gained a dim understanding of what that call is for. :-)

zooko commented

2010-05-19 07:03:44 +00:00

Unsetting the "review-needed" flag until François addresses the comments so far.

zooko commented

2010-05-19 07:14:07 +00:00

To test this feature on your system apply François's most recent patches (from comment:369062 unless he has posted new ones since then) and try to upload, download, move, rename, and backup files with non-ASCII characters in their names. Also directories. Use the WUI, the CLI, the WAPI, or tahoe backup as you prefer. For bonus points try installing the SFTP patches from #1037 and test filenames with non-ASCII characters through SFTP, sshfs, Nautilus, etc. :-)

To test this feature on your system apply François's most recent patches (from [comment:369062](/tahoe-lafs/trac/issues/534#issuecomment-369062) unless he has posted new ones since then) and try to upload, download, move, rename, and backup files with non-ASCII characters in their names. Also directories. Use the WUI, the CLI, the WAPI, or `tahoe backup` as you prefer. For bonus points try installing the SFTP patches from #1037 and test filenames with non-ASCII characters through SFTP, sshfs, Nautilus, etc. :-)

freestorm commented

2010-05-19 23:37:53 +00:00

Author

Owner

I did some tests on my Windows Box.

The directory TEST contain this filename (in one line):

LONG_TEST_STRING !#$%&'()+,-.012345789=@ABCDEFGHIJKQRSTYZbcdefghijklmnopqrstuvwxyz{}~ÇüéâäàåçêëèïîìÄÅÉæÆôöòûùÿÖÜø£Ø×áíóúñÑªº¿®¬½¼¡«»ÁÂÀ©¢¥ãÃ¤ðÐÊËÈÍÎÏ¦ÌÓßÔÒõÕµþÞÚÛÙýÝ¯´±¾¶§÷¸°¨·¹³² _1.txt

Test 1:

D:\rdfc>D:\tahoe-fdz_534\bin\tahoe.exe cp --recursive TEST ROOT:test_534

Success: files copied

=> I copied back this file on my disk and the md5sum was the same as the source file.

Test 2:

D:\rdfc\TEST>D:\tahoe-fdz_534\bin\tahoe.exe mkdir ROOT:à_testé

URI:DIR2:wskq3uymjffqq6zidua7xrc3oq:crnp2xhec76wwjwghx2ddae5nenxv365id7nx56tepwdzwlet3ha
=> no errors,but next test fails.

Test 3:

D:\rdfc\TEST>D:\tahoe-fdz_534\bin\tahoe.exe ls ROOT:

Ó_testÚBR
''=> '''Ó'''_test'''Ú''' <> '''à'''_test'''é'''BR
''PS: same output with the WUI''

'''Test 4:'''BR

D:\rdfc\TEST>D:\tahoe-fdz_534\bin\tahoe.exe mkdir "ROOT:LONG_TEST_STRING !#$%&'()+,-.012345789=@ABCDEFGHIJKQRSTYZbcdefghijklmnopqrstuvwxyz{}~ÇüéâäàåçêëèïîìÄÅÉæÆôöòûùÿÖÜø£Ø×áíóúñÑªº¿®¬½¼¡«»ÁÂÀ©¢¥ãÃ¤ðÐÊËÈÍÎÏ¦ÌÓßÔÒõÕµþÞÚÛÙýÝ¯´±¾¶§÷
¸°¨·¹³²"BR
URI:DIR2:64cxbqxkd2rj4lxpuragkltrpu:5rjnyv3cnvdavg7uprmkb2f2fvncasdi6t6k3ert3t5htbpffaca

=> no errors,but next test fails.

Test 5:

D:\rdfc\TEST>D:\tahoe-fdz_534\bin\tahoe.exe ls ROOT:

LONG_TEST_STRING !#$%&'()+,-.012345789=@ABCDEFGHIJKQRSTYZbcdefghijklmnopqrstuvwxyz{}~'''Ã³ÚÔõÓÕþÛÙÞ´¯ý─┼╔µã¶÷‗¹¨ Í▄°úÏÎßÝ¾·±Ð¬║┐«¼¢╝í½╗┴┬└®óÑÒ├ñð╩╦╚═╬¤ª╠Ë▀ÈÊ§ıÁ■Ì┌█┘²¦»┤¡▒¥Âº¸©░¿À╣│▓'''BR

''=> result is the same as test 3, not the same name as source.''BR

'''test 6:'''BR

D:\rdfc>D:\tahoe-fdz_534\bin\tahoe.exe backup TEST ROOT:test_534

1 files uploaded (0 reused), 0 files skipped, 1 directories created (0 reused), 0 directories skipped backup done, elapsed time: 0:00:15BR

'''Test 7:'''BR
D:\rdfc\TEST>D:\tahoe-fdz_534\bin\tahoe.exe ls ROOT:test_534/Archives/2010-05-19_22:08:36Z

LONG_TEST_STRING !#$%&'()+,-.012345789=@ABCDEFGHIJKQRSTYZbcdefghijklmnopqrstuvwxyz{}~ÇüéâäàåçêëèïîìÄÅÉæÆôöòûùÿÖÜø£Ø×áíóúñÑªº¿®¬½¼¡«»ÁÂÀ©¢¥ãÃ¤ðÐÊËÈÍÎÏ¦ÌÓßÔÒõÕµþÞÚÛÙýÝ¯´±¾¶§÷¸°¨·¹³² _1.txt

=> the filename is the same as source.

Fred

I did some tests on my Windows Box. The directory TEST contain this filename (in one line): `LONG_TEST_STRING !#$%&'()+,-.012345789=@ABCDEFGHIJKQRSTYZbcdefghijklmnopqrstuvwxyz{}~ÇüéâäàåçêëèïîìÄÅÉæÆôöòûùÿÖÜø£Ø×áíóúñÑªº¿®¬½¼¡«»ÁÂÀ©¢¥ãÃ¤ðÐÊËÈÍÎÏ¦ÌÓßÔÒõÕµþÞÚÛÙýÝ¯´±¾¶§÷¸°¨·¹³² _1.txt` **Test 1:** D:\rdfc>D:\tahoe-fdz_534\bin\tahoe.exe cp --recursive TEST ROOT:test_534 `Success: files copied` *=> I copied back this file on my disk and the md5sum was the same as the source file.* **Test 2:** D:\rdfc\TEST>D:\tahoe-fdz_534\bin\tahoe.exe mkdir ROOT:à_testé `URI:DIR2:wskq3uymjffqq6zidua7xrc3oq:crnp2xhec76wwjwghx2ddae5nenxv365id7nx56tepwdzwlet3ha` *=> no errors,but next test fails.* **Test 3:** D:\rdfc\TEST>D:\tahoe-fdz_534\bin\tahoe.exe ls ROOT: `Ó_testÚ`[[BR]] ''=> '''Ó'''_test'''Ú''' <> '''à'''_test'''é'''[[BR]] ''PS: same output with the WUI'' '''Test 4:'''[[BR]] D:\rdfc\TEST>D:\tahoe-fdz_534\bin\tahoe.exe mkdir "ROOT:LONG_TEST_STRING !#$%&'()+,-.012345789=@ABCDEFGHIJKQRSTYZbcdefghijklmnopqrstuvwxyz{}~ÇüéâäàåçêëèïîìÄÅÉæÆôöòûùÿÖÜø£Ø×áíóúñÑªº¿®¬½¼¡«»ÁÂÀ©¢¥ãÃ¤ðÐÊËÈÍÎÏ¦ÌÓßÔÒõÕµþÞÚÛÙýÝ¯´±¾¶§÷ ¸°¨·¹³²"[[BR]] `URI:DIR2:64cxbqxkd2rj4lxpuragkltrpu:5rjnyv3cnvdavg7uprmkb2f2fvncasdi6t6k3ert3t5htbpffaca` *=> no errors,but next test fails.* **Test 5:** D:\rdfc\TEST>D:\tahoe-fdz_534\bin\tahoe.exe ls ROOT: `LONG_TEST_STRING !#$%&'()+,-.012345789=@ABCDEFGHIJKQRSTYZbcdefghijklmnopqrstuvwxyz{}~'''Ã³ÚÔõÓÕþÛÙÞ´¯ý─┼╔µã¶÷‗¹¨ Í▄°úÏÎßÝ¾·±Ð¬║┐«¼¢╝í½╗┴┬└®óÑÒ├ñð╩╦╚═╬¤ª╠Ë▀ÈÊ§ıÁ■Ì┌█┘²¦»┤¡▒¥Âº¸©░¿À╣│▓'''`[[BR]] ''=> result is the same as test 3, not the same name as source.''[[BR]] '''test 6:'''[[BR]] D:\rdfc>D:\tahoe-fdz_534\bin\tahoe.exe backup TEST ROOT:test_534 `1 files uploaded (0 reused), 0 files skipped, 1 directories created (0 reused), 0 directories skipped backup done, elapsed time: 0:00:15`[[BR]] '''Test 7:'''[[BR]] D:\rdfc\TEST>D:\tahoe-fdz_534\bin\tahoe.exe ls ROOT:test_534/Archives/2010-05-19_22:08:36Z `LONG_TEST_STRING !#$%&'()+,-.012345789=@ABCDEFGHIJKQRSTYZbcdefghijklmnopqrstuvwxyz{}~ÇüéâäàåçêëèïîìÄÅÉæÆôöòûùÿÖÜø£Ø×áíóúñÑªº¿®¬½¼¡«»ÁÂÀ©¢¥ãÃ¤ðÐÊËÈÍÎÏ¦ÌÓßÔÒõÕµþÞÚÛÙýÝ¯´±¾¶§÷¸°¨·¹³² _1.txt` *=> the filename is the same as source.* Fred

francois commented

2010-05-20 00:43:46 +00:00

Author

Owner

Replying to [zooko]comment:105:

Some tests fail on my Mac (see comment:369067 and comment:369068). Can we
use the magic of mockery to make the same tests fail in the same way
on every platform where they are run? (Oh yeah and also then fix them.
:-))

IMHO, tests in test_cli.py should exercise the actual usage of the
CLI tools and thus should not use mocking at all. Whereas, tests
test_stringutils.py exercise the stringutils ''library'' and use
mocking to simulate all possible platforms.

In unicode_to_stdout() I think that if we get
UnicodeEncodeError then we should add the 'replace' flag
but using the same encoding (the return value of
get_term_encoding()), but we should not catch and try to handle
LookupError. Oh, in fact that means that we should omit the
try: … except: block entirely and just do return s.encode(get_term_encoding(), 'replace'). What do you think? If
you think we should do something different then construct a unit test
that we fail unless we do something different. :-)

You're right, this exception handling is not very pretty. I chose your
proposition and modified get_term_encoding() to return
'ascii' for sys.stdout.encoding value is None. A new
unit test has been written as well.

def get_term_encoding():
    """
    Returns expected encoding for writing to the terminal and reading
    arguments from the command-line.
    """

    if sys.stdout.encoding == None:
        return 'ascii'
    else:
        return sys.stdout.encoding

Consider this code:
        # Use unicode strings when calling os functions 
        if sys.getfilesystemencoding() == "ANSI_X3.4-1968": 
            fn1 = os.path.join(self.basedir, u"Artonwall") 
        else: 
            fn1 = os.path.join(self.basedir, u"Ärtonwall") 
Is this if statement obsolete now that we have proper
mockery-based tests? Or is do_cli() a higher layer of
functionality which should be tested in addition to the current tests?
Or should do_cli() itself get mockerified so that it will run
the same on all platforms? (Maybe it is good to test the actual
non-mockerified underlying platform sometimes?) What is the intent of
this if -- to test how do_cli() handles non-ASCII chars on
those filesystems which can handle them? If that is the case then add
a comment to that effect and make the test actually test whether the
underlying system can encode Ä rather than whether the
underlying system is ANSI_X3.4-1968. (For example, I guess a
Japanese system in SHIFT-JIS isn't ANSI_X3.4-1968 and also can't
represent Ä.

Yes, this statement is obsolete and has been removed. I'd say that the
test should be skipped if it is not possible to write abritrary
filenames on the filesystem or pass arbitrary arguments on the command
line on the platform on which the tests are being run.

Replying to [zooko]comment:105: > Some tests fail on my Mac (see [comment:369067](/tahoe-lafs/trac/issues/534#issuecomment-369067) and [comment:369068](/tahoe-lafs/trac/issues/534#issuecomment-369068)). Can we > use the magic of mockery to make the same tests fail in the same way > on every platform where they are run? (Oh yeah and also then fix them. > :-)) IMHO, tests in `test_cli.py` should exercise the actual usage of the CLI tools and thus should not use mocking at all. Whereas, tests `test_stringutils.py` exercise the stringutils ''library'' and use mocking to simulate all possible platforms. > In `unicode_to_stdout()` I think that if we get > `UnicodeEncodeError` then we should add the `'replace'` flag > but using the same encoding (the return value of > `get_term_encoding()`), but we should not catch and try to handle > `LookupError`. Oh, in fact that means that we should omit the > `try: … except:` block entirely and just do ```return > s.encode(get_term_encoding(), 'replace')```. What do you think? If > you think we should do something different then construct a unit test > that we fail unless we do something different. :-) You're right, this exception handling is not very pretty. I chose your proposition and modified `get_term_encoding()` to return `'ascii'` for `sys.stdout.encoding` value is `None`. A new unit test has been written as well. ``` def get_term_encoding(): """ Returns expected encoding for writing to the terminal and reading arguments from the command-line. """ if sys.stdout.encoding == None: return 'ascii' else: return sys.stdout.encoding ``` > Consider this code: > ``` > # Use unicode strings when calling os functions > if sys.getfilesystemencoding() == "ANSI_X3.4-1968": > fn1 = os.path.join(self.basedir, u"Artonwall") > else: > fn1 = os.path.join(self.basedir, u"Ärtonwall") > ``` > Is this `if` statement obsolete now that we have proper > mockery-based tests? Or is `do_cli()` a higher layer of > functionality which should be tested in addition to the current tests? > Or should `do_cli()` itself get mockerified so that it will run > the same on all platforms? (Maybe it is good to test the actual > non-mockerified underlying platform sometimes?) What is the intent of > this `if` -- to test how `do_cli()` handles non-ASCII chars on > those filesystems which can handle them? If that is the case then add > a comment to that effect and make the test actually test whether the > underlying system can encode `Ä` rather than whether the > underlying system is `ANSI_X3.4-1968`. (For example, I guess a > Japanese system in SHIFT-JIS isn't `ANSI_X3.4-1968` and also can't > represent `Ä`. Yes, this statement is obsolete and has been removed. I'd say that the test should be skipped if it is not possible to write abritrary filenames on the filesystem or pass arbitrary arguments on the command line on the platform on which the tests are being run.

francois commented

2010-05-20 00:54:48 +00:00

Author

Owner

Attachment unicode-helper-functions-v4.diff (13891 bytes) added

**Attachment** unicode-helper-functions-v4.diff (13891 bytes) added

unicode-helper-functions-v4.diff

14 KiB

francois commented

2010-05-20 00:55:07 +00:00

Author

Owner

Attachment unicode-filenames-handling-v4.diff (32422 bytes) added

**Attachment** unicode-filenames-handling-v4.diff (32422 bytes) added

unicode-filenames-handling-v4.diff

32 KiB

francois commented

2010-05-20 00:55:19 +00:00

Author

Owner

Attachment unicode-bundle-v4.darcspatch (61069 bytes) added

**Attachment** unicode-bundle-v4.darcspatch (61069 bytes) added

unicode-bundle-v4.darcspatch

60 KiB

francois commented

2010-05-20 01:00:14 +00:00

Author

Owner

The latest patches which take comment comment:109 into account have been uploaded. They run without failure under Ubuntu 10.04 and Windows XP.

And a darcs patch which correctly apply to trunk and contains the two previous patches and a third one which add a dependency on mock library (#1016).

attachment:unicode-bundle-v4.darcspatch

The latest patches which take comment comment:109 into account have been uploaded. They run without failure under Ubuntu 10.04 and Windows XP. * [attachment:unicode-helper-functions-v4.diff](/tahoe-lafs/trac/attachments/000078ac-ae73-29c9-79b3-9b21e643c9ce) * [attachment:unicode-filenames-handling-v4.diff](/tahoe-lafs/trac/attachments/000078ac-ae73-29c9-79b3-2a0176d9bbfd) And a darcs patch which correctly apply to trunk and contains the two previous patches and a third one which add a dependency on `mock` library (#1016). * [attachment:unicode-bundle-v4.darcspatch](/tahoe-lafs/trac/attachments/000078ac-ae73-29c9-79b3-e44c31327a24)

zooko commented

2010-05-20 05:31:15 +00:00

The latest attachment:unicode-bundle-v4.darcspatch still fails unit test on my Mac. It occurred to me that I can program the buildbots to run your branch and report! This could be useful! Here is your branch on a new trac instance:

http://tahoe-lafs.org/trac/tahoe-lafs-ticket534/log/

and here is the result of me forcing all the buildslaves to try to build and test your branch at 22:12:09 just now:

http://tahoe-lafs.org/buildbot/waterfall?show_events=true&branch=ticket534&reload=none

:-)

Now you can see how your branch fares on all our buildslaves!

Send me your ssh pub key and I'll set it up so that you can push patches directly to that branch on http://tahoe-lafs.org and then push a button on the buildbot web page to force all buildslaves to build your latest patches.

The latest [attachment:unicode-bundle-v4.darcspatch](/tahoe-lafs/trac/attachments/000078ac-ae73-29c9-79b3-e44c31327a24) still fails unit test on my Mac. It occurred to me that I can program the buildbots to run your branch and report! This could be useful! Here is your branch on a new trac instance: <http://tahoe-lafs.org/trac/tahoe-lafs-ticket534/log/> and here is the result of me forcing all the buildslaves to try to build and test your branch at 22:12:09 just now: <http://tahoe-lafs.org/buildbot/waterfall?show_events=true&branch=ticket534&reload=none> :-) Now you can see how your branch fares on all our buildslaves! Send me your ssh pub key and I'll set it up so that you can push patches directly to that branch on <http://tahoe-lafs.org> and then push a button on the buildbot web page to force all buildslaves to build your latest patches.

francois commented

2010-05-22 14:07:43 +00:00

Author

Owner

Ok, all tests now pass on all supported platforms or are skipped if irrelevant to the platform.

So, please review the 7 new patches which have been added after unicode-helper-functions-v4.diff and unicode-filenames-handling-v4.diff.

Ok, all tests now pass on all supported platforms or are skipped if irrelevant to the platform. So, please review the [7 new patches](http://tahoe-lafs.org/trac/tahoe-lafs-ticket534/log/?action=stop_on_copy&mode=stop_on_copy&rev=4329&stop_rev=4319&limit=100) which have been added after [unicode-helper-functions-v4.diff](/tahoe-lafs/trac/attachments/000078ac-ae73-29c9-79b3-9b21e643c9ce) and [unicode-filenames-handling-v4.diff](/tahoe-lafs/trac/attachments/000078ac-ae73-29c9-79b3-2a0176d9bbfd).

francois commented

2010-06-04 22:39:08 +00:00

Author

Owner

The patches were commited in trunk by Zooko: changeset:b2542b87085be095 -> changeset:0eb4d83937a97094.

I'm now reviewing the three patches for this functionnality which were added by Zooko.

About patch changeset:442008a6905d7374, the change in {tahoe_manifest.py} makes sense.

However, I don't agree with the changes made in test_cli.py. Because calling method self.do_cli with UTF-8 encoded arguments only makes sense of the platform on which the test is run does also encodes CLI arguments in UTF-8. This assumption is wrong in all other
cases, such as terminals which only support ASCII characters. This is probably why the tests are failing on many buildbots.

About patch changeset:db8a6f3aa6d78122, using locale.getpreferredencoding()}} instead ```sys.stdout.encoding looks like a good idea. Changes in functions skip_non_unicode_fs and skip_non_unicode_stdout looks also good. Same comment than before on the usage of self.do_cli.

Patch changeset:5bcca5151eb897e6, Hey this one is easy, OK ;)

The patches were commited in trunk by Zooko: changeset:b2542b87085be095 -> changeset:0eb4d83937a97094. I'm now reviewing the three patches for this functionnality which were added by Zooko. About patch changeset:442008a6905d7374, the change in {tahoe_manifest.py} makes sense. However, I don't agree with the changes made in `test_cli.py`. Because calling method `self.do_cli` with UTF-8 encoded arguments only makes sense of the platform on which the test is run does also encodes CLI arguments in UTF-8. This assumption is wrong in all other cases, such as terminals which only support ASCII characters. This is probably why the tests are failing on many buildbots. About patch changeset:db8a6f3aa6d78122, using `locale.getpreferredencoding()}} instead ```sys.stdout.encoding` looks like a good idea. Changes in functions `skip_non_unicode_fs` and `skip_non_unicode_stdout` looks also good. Same comment than before on the usage of `self.do_cli`. Patch changeset:5bcca5151eb897e6, Hey this one is easy, OK ;)

zooko commented

2010-06-07 04:51:43 +00:00

David-Sarah added to the tests some nice extensions. They and Brian Warner and I chatted on IRC and agreed that ```tahoe ls}}] ought to report the fact that a filename was unrepresentable if it was, and which filename (as represented by an escaped/quoted representation such as \u1234 for each unicode code point). David-Sarah implemented that. David-Sarah posted their new patch here and I just reviewed it: http://allmydata.org/trac/tahoe-lafs-ticket534/changeset/4434

François: please by all means review this patch and any other new patches from David-Sarah on the topic of unicode!

Oh. I found one thing in my review: open_unicode() should not do anything that the builtin open() function doesn't already do. Fortunately now that we have such thorough tests we should be able to verify that by redefining it like this:

def open_unicode(filename, mode):
    return open(filename, mode)

and see if the tests pass.
In David-Sarah's latest patches, they added a call to os.path.expanduser() into open_unicode(). I think that putting expanduser() into some central location so that we'll never forget to use it makes sense, so perhaps we'll end up renaming open_unicode() to open_cu():

def open_xu(filename, mode):
    return open(os.path.expanduser(filename, mode))

David-Sarah added to the tests some nice extensions. They and Brian Warner and I chatted on IRC and agreed that ```tahoe ls}}] ought to report the fact that a filename was unrepresentable if it was, and which filename (as represented by an escaped/quoted representation such as \u1234 for each unicode code point). David-Sarah implemented that. David-Sarah posted their new patch here and I just reviewed it: <http://allmydata.org/trac/tahoe-lafs-ticket534/changeset/4434> François: please by all means review this patch and any other new patches from David-Sarah on the topic of unicode! Oh. I found one thing in my review: `open_unicode()` should not do anything that the builtin `open()` function doesn't already do. Fortunately now that we have such thorough tests we should be able to verify that by redefining it like this: ``` def open_unicode(filename, mode): return open(filename, mode) ``` and see if the tests pass. In David-Sarah's latest patches, they added a call to `os.path.expanduser()` into `open_unicode()`. I think that putting `expanduser()` into some central location so that we'll never forget to use it makes sense, so perhaps we'll end up renaming `open_unicode()` to `open_cu()`: ``` def open_xu(filename, mode): return open(os.path.expanduser(filename, mode)) ```

francois commented

2010-06-08 10:12:56 +00:00

Author

Owner

I'm currently reviewing changeset changeset:80252debcd94fc28.

In util/stringutils.py:

- def open_unicode(path, mode='r'): 
+ def open_unicode(path, mode):

Why shouldn't open_unicode behave just like open by default?

I see that a bunch of preconditions were left out, especially in to_str but that's hopefully not too dangerous.

I'm currently reviewing changeset changeset:80252debcd94fc28. In util/stringutils.py: ``` - def open_unicode(path, mode='r'): + def open_unicode(path, mode): ``` Why shouldn't `open_unicode` behave just like `open` by default? I see that a bunch of preconditions were left out, especially in `to_str` but that's hopefully not too dangerous.

francois commented

2010-06-08 10:52:20 +00:00

Author

Owner

OK, I reviewed changeset:65b6f4e3ce4cdf13, changeset:529c9f673a89f7fa, changeset:8b014372b1ab9460, changeset:731e3d68dff0f8c9 and changeset:3c883e6e44e82d1e.

I don't really understand why test_listdir_unicode_bad was removed in changeset:3c883e6e44e82d1e.

OK, I reviewed changeset:65b6f4e3ce4cdf13, changeset:529c9f673a89f7fa, changeset:8b014372b1ab9460, changeset:731e3d68dff0f8c9 and changeset:3c883e6e44e82d1e. I don't really understand why `test_listdir_unicode_bad` was removed in changeset:3c883e6e44e82d1e.

zooko commented

2010-06-09 04:22:53 +00:00

Attachment misc-update.dpatch (46461 bytes) added

**Attachment** misc-update.dpatch (46461 bytes) added

misc-update.dpatch

45 KiB

daira commented

2010-06-16 04:13:54 +00:00

Replying to francois:

I don't really understand why test_listdir_unicode_bad was removed in changeset:3c883e6e44e82d1e.

The test was failing on Mac OS X, which allows badly-encoded filenames to be written, but then lists them using %-escaping. This is reasonable, but couldn't have been predicted, and it's impossible to predict what behaviour other filesystems might have that would break the test. So Zooko and I decided that the test was invalid.

Note that the test_listdir_unicode test in StringUtilsNonUnicodePlatform mocks the return from os.listdir and checks that listdir_unicode raises a FilenameEncodingError when it should. We decided that this was a sufficient test, without actually trying to create the badly encoded filenames on the local filesystem.

Replying to [francois](/tahoe-lafs/trac/issues/534#issuecomment-369081): > I don't really understand why `test_listdir_unicode_bad` was removed in changeset:3c883e6e44e82d1e. The test was failing on Mac OS X, which allows badly-encoded filenames to be written, but then lists them using %-escaping. This is reasonable, but couldn't have been predicted, and it's impossible to predict what behaviour other filesystems might have that would break the test. So Zooko and I decided that the test was invalid. Note that the `test_listdir_unicode` test in `StringUtilsNonUnicodePlatform` mocks the return from `os.listdir` and checks that `listdir_unicode` raises a `FilenameEncodingError` when it should. We decided that this was a sufficient test, without actually trying to create the badly encoded filenames on the local filesystem.

daira commented

2010-06-16 04:17:26 +00:00

Replying to francois:

I'm currently reviewing changeset changeset:80252debcd94fc28.

In util/stringutils.py:
- def open_unicode(path, mode='r'): 
+ def open_unicode(path, mode): 
Why shouldn't open_unicode behave just like open by default?

We should never be opening files in text mode.

I see that a bunch of preconditions were left out, especially in to_str but that's hopefully not too dangerous.

to_str is intended to accept either a Unicode or (UTF-8) byte string. It is used to coerce strings from the output of simplejson.loads, which might be either.

Replying to [francois](/tahoe-lafs/trac/issues/534#issuecomment-369080): > I'm currently reviewing changeset changeset:80252debcd94fc28. > > In util/stringutils.py: > > ``` > - def open_unicode(path, mode='r'): > + def open_unicode(path, mode): > ``` > > Why shouldn't `open_unicode` behave just like `open` by default? We should never be opening files in text mode. > I see that a bunch of preconditions were left out, especially in `to_str` but that's hopefully not too dangerous. `to_str` is intended to accept either a Unicode or (UTF-8) byte string. It is used to coerce strings from the output of `simplejson.loads`, which might be either.

daira added the

r/fixed

label 2010-06-16 04:18:21 +00:00

daira closed this issue

2010-06-16 04:18:21 +00:00

daira commented

2010-06-16 04:19:38 +00:00

Fixed modulo the normalization issue in ticket #1076, which we intend to address for 1.7final.

Rows
Columns

CLI (e.g. tahoe cp) does not correctly handle or display Unicode file/directory names, either in arguments or the local filesystem #534