CLI (e.g. tahoe cp) does not correctly handle or display Unicode file/directory names, either in arguments or the local filesystem #534
Labels
No labels
c/code
c/code-dirnodes
c/code-encoding
c/code-frontend
c/code-frontend-cli
c/code-frontend-ftp-sftp
c/code-frontend-magic-folder
c/code-frontend-web
c/code-mutable
c/code-network
c/code-nodeadmin
c/code-peerselection
c/code-storage
c/contrib
c/dev-infrastructure
c/docs
c/operational
c/packaging
c/unknown
c/website
kw:2pc
kw:410
kw:9p
kw:ActivePerl
kw:AttributeError
kw:DataUnavailable
kw:DeadReferenceError
kw:DoS
kw:FileZilla
kw:GetLastError
kw:IFinishableConsumer
kw:K
kw:LeastAuthority
kw:Makefile
kw:RIStorageServer
kw:StringIO
kw:UncoordinatedWriteError
kw:about
kw:access
kw:access-control
kw:accessibility
kw:accounting
kw:accounting-crawler
kw:add-only
kw:aes
kw:aesthetics
kw:alias
kw:aliases
kw:aliens
kw:allmydata
kw:amazon
kw:ambient
kw:annotations
kw:anonymity
kw:anonymous
kw:anti-censorship
kw:api_auth_token
kw:appearance
kw:appname
kw:apport
kw:archive
kw:archlinux
kw:argparse
kw:arm
kw:assertion
kw:attachment
kw:auth
kw:authentication
kw:automation
kw:avahi
kw:availability
kw:aws
kw:azure
kw:backend
kw:backoff
kw:backup
kw:backupdb
kw:backward-compatibility
kw:bandwidth
kw:basedir
kw:bayes
kw:bbfreeze
kw:beta
kw:binaries
kw:binutils
kw:bitcoin
kw:bitrot
kw:blacklist
kw:blocker
kw:blocks-cloud-deployment
kw:blocks-cloud-merge
kw:blocks-magic-folder-merge
kw:blocks-merge
kw:blocks-raic
kw:blocks-release
kw:blog
kw:bom
kw:bonjour
kw:branch
kw:branding
kw:breadcrumbs
kw:brians-opinion-needed
kw:browser
kw:bsd
kw:build
kw:build-helpers
kw:buildbot
kw:builders
kw:buildslave
kw:buildslaves
kw:cache
kw:cap
kw:capleak
kw:captcha
kw:cast
kw:centos
kw:cffi
kw:chacha
kw:charset
kw:check
kw:checker
kw:chroot
kw:ci
kw:clean
kw:cleanup
kw:cli
kw:cloud
kw:cloud-backend
kw:cmdline
kw:code
kw:code-checks
kw:coding-standards
kw:coding-tools
kw:coding_tools
kw:collection
kw:compatibility
kw:completion
kw:compression
kw:confidentiality
kw:config
kw:configuration
kw:configuration.txt
kw:conflict
kw:connection
kw:connectivity
kw:consistency
kw:content
kw:control
kw:control.furl
kw:convergence
kw:coordination
kw:copyright
kw:corruption
kw:cors
kw:cost
kw:coverage
kw:coveralls
kw:coveralls.io
kw:cpu-watcher
kw:cpyext
kw:crash
kw:crawler
kw:crawlers
kw:create-container
kw:cruft
kw:crypto
kw:cryptography
kw:cryptography-lib
kw:cryptopp
kw:csp
kw:curl
kw:cutoff-date
kw:cycle
kw:cygwin
kw:d3
kw:daemon
kw:darcs
kw:darcsver
kw:database
kw:dataloss
kw:db
kw:dead-code
kw:deb
kw:debian
kw:debug
kw:deep-check
kw:defaults
kw:deferred
kw:delete
kw:deletion
kw:denial-of-service
kw:dependency
kw:deployment
kw:deprecation
kw:desert-island
kw:desert-island-build
kw:design
kw:design-review-needed
kw:detection
kw:dev-infrastructure
kw:devpay
kw:directory
kw:directory-page
kw:dirnode
kw:dirnodes
kw:disconnect
kw:discovery
kw:disk
kw:disk-backend
kw:distribute
kw:distutils
kw:dns
kw:do_http
kw:doc-needed
kw:docker
kw:docs
kw:docs-needed
kw:dokan
kw:dos
kw:download
kw:downloader
kw:dragonfly
kw:drop-upload
kw:duplicity
kw:dusty
kw:earth-dragon
kw:easy
kw:ec2
kw:ecdsa
kw:ed25519
kw:egg-needed
kw:eggs
kw:eliot
kw:email
kw:empty
kw:encoding
kw:endpoint
kw:enterprise
kw:enum34
kw:environment
kw:erasure
kw:erasure-coding
kw:error
kw:escaping
kw:etag
kw:etch
kw:evangelism
kw:eventual
kw:example
kw:excess-authority
kw:exec
kw:exocet
kw:expiration
kw:extensibility
kw:extension
kw:failure
kw:fedora
kw:ffp
kw:fhs
kw:figleaf
kw:file
kw:file-descriptor
kw:filename
kw:filesystem
kw:fileutil
kw:fips
kw:firewall
kw:first
kw:floatingpoint
kw:flog
kw:foolscap
kw:forward-compatibility
kw:forward-secrecy
kw:forwarding
kw:free
kw:freebsd
kw:frontend
kw:fsevents
kw:ftp
kw:ftpd
kw:full
kw:furl
kw:fuse
kw:garbage
kw:garbage-collection
kw:gateway
kw:gatherer
kw:gc
kw:gcc
kw:gentoo
kw:get
kw:git
kw:git-annex
kw:github
kw:glacier
kw:globalcaps
kw:glossary
kw:google-cloud-storage
kw:google-drive-backend
kw:gossip
kw:governance
kw:grid
kw:grid-manager
kw:gridid
kw:gridsync
kw:grsec
kw:gsoc
kw:gvfs
kw:hackfest
kw:hacktahoe
kw:hang
kw:hardlink
kw:heartbleed
kw:heisenbug
kw:help
kw:helper
kw:hint
kw:hooks
kw:how
kw:how-to
kw:howto
kw:hp
kw:hp-cloud
kw:html
kw:http
kw:https
kw:i18n
kw:i2p
kw:i2p-collab
kw:illustration
kw:image
kw:immutable
kw:impressions
kw:incentives
kw:incident
kw:init
kw:inlineCallbacks
kw:inotify
kw:install
kw:installer
kw:integration
kw:integration-test
kw:integrity
kw:interactive
kw:interface
kw:interfaces
kw:interoperability
kw:interstellar-exploration
kw:introducer
kw:introduction
kw:iphone
kw:ipkg
kw:iputil
kw:ipv6
kw:irc
kw:jail
kw:javascript
kw:joke
kw:jquery
kw:json
kw:jsui
kw:junk
kw:key-value-store
kw:kfreebsd
kw:known-issue
kw:konqueror
kw:kpreid
kw:kvm
kw:l10n
kw:lae
kw:large
kw:latency
kw:leak
kw:leasedb
kw:leases
kw:libgmp
kw:license
kw:licenss
kw:linecount
kw:link
kw:linux
kw:lit
kw:localhost
kw:location
kw:locking
kw:logging
kw:logo
kw:loopback
kw:lucid
kw:mac
kw:macintosh
kw:magic-folder
kw:manhole
kw:manifest
kw:manual-test-needed
kw:map
kw:mapupdate
kw:max_space
kw:mdmf
kw:memcheck
kw:memory
kw:memory-leak
kw:mesh
kw:metadata
kw:meter
kw:migration
kw:mime
kw:mingw
kw:minimal
kw:misc
kw:miscapture
kw:mlp
kw:mock
kw:more-info-needed
kw:mountain-lion
kw:move
kw:multi-users
kw:multiple
kw:multiuser-gateway
kw:munin
kw:music
kw:mutability
kw:mutable
kw:mystery
kw:names
kw:naming
kw:nas
kw:navigation
kw:needs-review
kw:needs-spawn
kw:netbsd
kw:network
kw:nevow
kw:new-user
kw:newcaps
kw:news
kw:news-done
kw:news-needed
kw:newsletter
kw:newurls
kw:nfc
kw:nginx
kw:nixos
kw:no-clobber
kw:node
kw:node-url
kw:notification
kw:notifyOnDisconnect
kw:nsa310
kw:nsa320
kw:nsa325
kw:numpy
kw:objects
kw:old
kw:openbsd
kw:openitp-packaging
kw:openssl
kw:openstack
kw:opensuse
kw:operation-helpers
kw:operational
kw:operations
kw:ophandle
kw:ophandles
kw:ops
kw:optimization
kw:optional
kw:options
kw:organization
kw:os
kw:os.abort
kw:ostrom
kw:osx
kw:osxfuse
kw:otf-magic-folder-objective1
kw:otf-magic-folder-objective2
kw:otf-magic-folder-objective3
kw:otf-magic-folder-objective4
kw:otf-magic-folder-objective5
kw:otf-magic-folder-objective6
kw:p2p
kw:packaging
kw:partial
kw:password
kw:path
kw:paths
kw:pause
kw:peer-selection
kw:performance
kw:permalink
kw:permissions
kw:persistence
kw:phone
kw:pickle
kw:pip
kw:pipermail
kw:pkg_resources
kw:placement
kw:planning
kw:policy
kw:port
kw:portability
kw:portal
kw:posthook
kw:pratchett
kw:preformance
kw:preservation
kw:privacy
kw:process
kw:profile
kw:profiling
kw:progress
kw:proxy
kw:publish
kw:pyOpenSSL
kw:pyasn1
kw:pycparser
kw:pycrypto
kw:pycrypto-lib
kw:pycryptopp
kw:pyfilesystem
kw:pyflakes
kw:pylint
kw:pypi
kw:pypy
kw:pysqlite
kw:python
kw:python3
kw:pythonpath
kw:pyutil
kw:pywin32
kw:quickstart
kw:quiet
kw:quotas
kw:quoting
kw:raic
kw:rainhill
kw:random
kw:random-access
kw:range
kw:raspberry-pi
kw:reactor
kw:readonly
kw:rebalancing
kw:recovery
kw:recursive
kw:redhat
kw:redirect
kw:redressing
kw:refactor
kw:referer
kw:referrer
kw:regression
kw:rekey
kw:relay
kw:release
kw:release-blocker
kw:reliability
kw:relnotes
kw:remote
kw:removable
kw:removable-disk
kw:rename
kw:renew
kw:repair
kw:replace
kw:report
kw:repository
kw:research
kw:reserved_space
kw:response-needed
kw:response-time
kw:restore
kw:retrieve
kw:retry
kw:review
kw:review-needed
kw:reviewed
kw:revocation
kw:roadmap
kw:rollback
kw:rpm
kw:rsa
kw:rss
kw:rst
kw:rsync
kw:rusty
kw:s3
kw:s3-backend
kw:s3-frontend
kw:s4
kw:same-origin
kw:sandbox
kw:scalability
kw:scaling
kw:scheduling
kw:schema
kw:scheme
kw:scp
kw:scripts
kw:sdist
kw:sdmf
kw:security
kw:self-contained
kw:server
kw:servermap
kw:servers-of-happiness
kw:service
kw:setup
kw:setup.py
kw:setup_requires
kw:setuptools
kw:setuptools_darcs
kw:sftp
kw:shared
kw:shareset
kw:shell
kw:signals
kw:simultaneous
kw:six
kw:size
kw:slackware
kw:slashes
kw:smb
kw:sneakernet
kw:snowleopard
kw:socket
kw:solaris
kw:space
kw:space-efficiency
kw:spam
kw:spec
kw:speed
kw:sqlite
kw:ssh
kw:ssh-keygen
kw:sshfs
kw:ssl
kw:stability
kw:standards
kw:start
kw:startup
kw:static
kw:static-analysis
kw:statistics
kw:stats
kw:stats_gatherer
kw:status
kw:stdeb
kw:storage
kw:streaming
kw:strports
kw:style
kw:stylesheet
kw:subprocess
kw:sumo
kw:survey
kw:svg
kw:symlink
kw:synchronous
kw:tac
kw:tahoe-*
kw:tahoe-add-alias
kw:tahoe-admin
kw:tahoe-archive
kw:tahoe-backup
kw:tahoe-check
kw:tahoe-cp
kw:tahoe-create-alias
kw:tahoe-create-introducer
kw:tahoe-debug
kw:tahoe-deep-check
kw:tahoe-deepcheck
kw:tahoe-lafs-trac-stream
kw:tahoe-list-aliases
kw:tahoe-ls
kw:tahoe-magic-folder
kw:tahoe-manifest
kw:tahoe-mkdir
kw:tahoe-mount
kw:tahoe-mv
kw:tahoe-put
kw:tahoe-restart
kw:tahoe-rm
kw:tahoe-run
kw:tahoe-start
kw:tahoe-stats
kw:tahoe-unlink
kw:tahoe-webopen
kw:tahoe.css
kw:tahoe_files
kw:tahoewapi
kw:tarball
kw:tarballs
kw:tempfile
kw:templates
kw:terminology
kw:test
kw:test-and-set
kw:test-from-egg
kw:test-needed
kw:testgrid
kw:testing
kw:tests
kw:throttling
kw:ticket999-s3-backend
kw:tiddly
kw:time
kw:timeout
kw:timing
kw:to
kw:to-be-closed-on-2011-08-01
kw:tor
kw:tor-protocol
kw:torsocks
kw:tox
kw:trac
kw:transparency
kw:travis
kw:travis-ci
kw:trial
kw:trickle
kw:trivial
kw:truckee
kw:tub
kw:tub.location
kw:twine
kw:twistd
kw:twistd.log
kw:twisted
kw:twisted-14
kw:twisted-trial
kw:twitter
kw:twn
kw:txaws
kw:type
kw:typeerror
kw:ubuntu
kw:ucwe
kw:ueb
kw:ui
kw:unclean
kw:uncoordinated-writes
kw:undeletable
kw:unfinished-business
kw:unhandled-error
kw:unhappy
kw:unicode
kw:unit
kw:unix
kw:unlink
kw:update
kw:upgrade
kw:upload
kw:upload-helper
kw:uri
kw:url
kw:usability
kw:use-case
kw:utf-8
kw:util
kw:uwsgi
kw:ux
kw:validation
kw:variables
kw:vdrive
kw:verify
kw:verlib
kw:version
kw:versioning
kw:versions
kw:video
kw:virtualbox
kw:virtualenv
kw:vista
kw:visualization
kw:visualizer
kw:vm
kw:volunteergrid2
kw:volunteers
kw:vpn
kw:wapi
kw:warners-opinion-needed
kw:warning
kw:weapi
kw:web
kw:web.port
kw:webapi
kw:webdav
kw:webdrive
kw:webport
kw:websec
kw:website
kw:websocket
kw:welcome
kw:welcome-page
kw:welcomepage
kw:wiki
kw:win32
kw:win64
kw:windows
kw:windows-related
kw:winscp
kw:workaround
kw:world-domination
kw:wrapper
kw:write-enabler
kw:wui
kw:x86
kw:x86-64
kw:xhtml
kw:xml
kw:xss
kw:zbase32
kw:zetuptoolz
kw:zfec
kw:zookos-opinion-needed
kw:zope
kw:zope.interface
p/blocker
p/critical
p/major
p/minor
p/normal
p/supercritical
p/trivial
r/cannot reproduce
r/duplicate
r/fixed
r/invalid
r/somebody else's problem
r/was already fixed
r/wontfix
r/worksforme
t/defect
t/enhancement
t/task
v/0.2.0
v/0.3.0
v/0.4.0
v/0.5.0
v/0.5.1
v/0.6.0
v/0.6.1
v/0.7.0
v/0.8.0
v/0.9.0
v/1.0.0
v/1.1.0
v/1.10.0
v/1.10.1
v/1.10.2
v/1.10a2
v/1.11.0
v/1.12.0
v/1.12.1
v/1.13.0
v/1.14.0
v/1.15.0
v/1.15.1
v/1.2.0
v/1.3.0
v/1.4.1
v/1.5.0
v/1.6.0
v/1.6.1
v/1.7.0
v/1.7.1
v/1.7β
v/1.8.0
v/1.8.1
v/1.8.2
v/1.8.3
v/1.8β
v/1.9.0
v/1.9.0-s3branch
v/1.9.0a1
v/1.9.0a2
v/1.9.0b1
v/1.9.1
v/1.9.2
v/1.9.2a1
v/cloud-branch
v/unknown
No milestone
No project
No assignees
4 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: tahoe-lafs/trac#534
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
"tahoe cp" doesn't correctly handle unicode filenames.
Steps to reproduce:
Attached patch seems to fix this issue.
Dear Francois:
Thank you very much for the bug report and patch. It looks good to me. Is there any chance you can write a unit test for this bug? The test module for the command-line functions is here:
http://allmydata.org/trac/tahoe/browser/src/allmydata/test/test_cli.py
It looks like there are tests for get and put but not for cp in there.
Please let me know. Thanks!
--Zooko
Attachment cp-encoding.patch (86547 bytes) added
Bugfix + test
This newer patch contains the bugfix itself and associated test in test_cli.py.
Thanks for your advice Zooko !
fixed by changeset:c1f639d230bb32a1 and changeset:5c0c5bfc81e1f0cb.
cygwin, solaris, and yukyu-hardy buildslaves are all failing this test.
Also, the test should probably read back from those files (with a 'tahoe get') to make sure the cp actually worked (instead of perhaps failing silently).
the test is also failing on my personal workstation (linux, debian/sid), in the same way as the other buildslaves:
Ok, I can reproduce this failing test on my environment (Ubuntu hardy, 64bits, LC_ALL=en_US.UTF8).
It is related to the locales environment settings.
Attachment fix-failing-test.2.patch (87347 bytes) added
This patch should hopefully fix failing test on systems with LC_ALL=C
Attachment credit.patch (86319 bytes) added
Add credit entry
Thanks, I'll apply those.
What do you think about the idea of using escaped binary strings instead of native UTF-8 in those filenames, i.e. removing the "coding=utf-8" line from the beginning of the file and using
"\xc3\x84rtonwall"
instead of embedding the A-with-umlaut in the string? I'm wondering if it might make our intentions clearer. At the moment we have a unicode name expressed as a non-unicode type (without the leadingu"
, that string is just a bytestring), and the casual observer might not realize that we intended to get the binary strings into the stuff on disk.Oops, Zooko must have beat me to it.. those patches were already applied.
But, the buildbot is still indicating test failures: see http://allmydata.org/buildbot/waterfall . This time they're on dapper and feisty. That's really weird, I don't know what would cause those to fail but not the others.
Replying to warner:
Yes, this would have been a good idea but the encoding specification looks mandatory for both "\xc3\x84rtonwall" and "Ärtonwall".
Replying to warner:
Unfortunately, I'm not able to reproduce those failures on my dapper and feisty virtual machines. Can you please try running the attached test script (test-unicode.py) on your machines ?
Attachment test-unicode.py (205 bytes) added
Test script for unicode filename issues
test-unicode.py prints "OK" on our dapper machine. I'm investigating in more detail why the unit test allmydata.test.test_cli.Cp.test_unicode_filename fails on that same machine and user.
Well I've gotten as far as determining that the "tahoe_get" spawned by source:src/allmydata/test/test_cli.py@20081114134458-c0148-c011dbf3b2ce743b0237e192c3f5ed33c1b81c49#L569 is getting a 404 result:
Now I need to read a bed-time story to a 7-year-old.
So I switched from working on the dapper buildslave to my own workstation -- yukyuk, running Ubuntu Hardy.
The first thing I see is that that the current trunk gives a coding error:
The next thing I see is that if I replace the unicode object with the string "\xc3\x84rtonwall" then the test never completes:
The next thing I notice is that if I replace it with the string "thisisjustastring" then the test still hangs.
So, it looks like the problems on yukyuk don't necessarily have anything to do with unicode, but it's just that the test_cli.Cp test is the first one that tries to construct a TLS connection, or something:
Okay, that problem on yukyuk was that I had deliberately introduced an uncaught exception into my system-wide installation of Twisted when testing something else. Please disregard all that.
So now to debug the unicode issue I'm moving back to our dapper buildslave..
So, as an extra datapoint,
sys.getfilesystemencoding()
on the two buildslaves that are failing reports "UTF-8". My local workstation (on which this test passes) reports "ANSI_X3.4-1968". However, on our gutsy and hardy buildslaves (which are also passing), it reports UTF-8 too.Some differential analysis shows that one early sign of trouble is during the webapi t=set_children call that 'tahoe cp' does. This API uses the under-documented t=set_children call, which (from what I can tell) accepts a JSON-encoded POST body that has a mapping from child name to a description of that child (including the URI). I see a difference (between the working platform and the failing platform) in the JSON that is sent with this request.
The working platform (my workstation 'fluxx') sends:
The failing platform (our feisty buildslave) sends:
Which suggests that the failing platform is double-encoding the filename.
Still investigating.. so far this is pointing at the client-side code (in source:src/allmydata/scripts/tahoe_cp.py).
Zooko, Brian:
If you give me a copy of your virtual machine disk, I could have a look into this failing test. What do you think ?
That's a nice offer, Francois. I don't know how to get a copy of that vm image and I suspect that it would take hours to transfer over the net (or, heh, upload to a Tahoe grid).
I have an idea! I'll give you an account on the dapper buildslave. There. Please send me an ssh public key to zooko@zooko.com and I'll put it into
francois@slave3.allmydata.com:.ssh/authorized_keys
and then you should be able to log in!I can see a difference in simplejson between the failing platform (dapper buildbot) and a working one (my workstation).
Test script:
Working platform:
Failing platform:
However, in both cases having \u control sequences in str objects sounds weird.
What is the current policy in tahoe according to string handling ? It looks like utf-8 encoded str objects are used in some places while unicode objects are used in others.
More on that later...
Oh, perhaps this has something to do with the version of simplejson!
The failing dapper has
2008-11-24 21:03:29.998Z [-] tahoe versions: allmydata: 1.2.0-r3235, foolscap: 0.3.2, pycryptopp: 0.5.12, zfec: 1.4.2, twisted: 8.1.0, nevow: 0.9.32, python: 2.4.3, platform: Linux-Ubuntu_6.06-i686-32bit, simplejson: 1.8.1, pyopenssl: 0.7, setuptools: 0.7a1
The failing feisty2.5 has
2008-11-24 20:47:45.960Z [-] tahoe versions: allmydata: 1.2.0-r3235, foolscap: 0.3.2, pycryptopp: 0.5.2-1, zfec: 1.4.0-4, twisted: 2.5.0, nevow: 0.9.32, python: 2.5.1, platform: Linux-Ubuntu_7.04-i686-32bit, simplejson: 1.4, pyopenssl: 0.6, setuptools: 0.7a1
Hm.
Passing buildslaves:
2008-11-24 20:53:04.924Z [-] tahoe versions: allmydata: 1.2.0-r3235, foolscap: 0.3.2, pycryptopp: 0.5.12, zfec: 1.4.2, twisted: 2.5.0, nevow: 0.9.18, python: 2.5.1, platform: CYGWIN_NT-5.1-1.5.25-0.156-4-2-i686-32bit-WindowsPE, simplejson: 2.0.5, pyopenssl: 0.6, setuptools: 0.7a1
2008-11-24 20:52:39.742Z [-] tahoe versions: allmydata: 1.2.0-r3235, foolscap: 0.3.2, pycryptopp: 0.5.12, zfec: 1.4.2, twisted: 8.1.0, nevow: 0.9.32, python: 2.4.4c1, platform: Linux-Ubuntu_6.10-i686-32bit, simplejson: 2.0.5, pyopenssl: 0.7, setuptools: 0.7a1
2008-11-24 09:54:08.473Z [-] tahoe versions: allmydata: 1.2.0-r3235, foolscap: 0.3.2, pycryptopp: 0.5.12, zfec: 1.1, twisted: 8.1.0, nevow: 0.9.32, python: 2.4.4, platform: Linux-debian_4.0-i686-32bit, simplejson: 2.0.5, pyopenssl: 0.7, setuptools: 0.7a1
2008-11-24 20:44:52.993Z [-] tahoe versions: allmydata: 1.2.0-r3235, foolscap: 0.3.2, pycryptopp: 0.5.2-1, zfec: 1.4.0-4, twisted: 2.5.0, nevow: 0.9.32, python: 2.5.1, platform: Linux-Ubuntu_7.10-i686-32bit, simplejson: 1.7.1, pyopenssl: 0.6, setuptools: 0.7a1
2008-11-24 20:51:45.037Z [-] tahoe versions: allmydata: 1.2.0-r3235, foolscap: 0.3.1, pycryptopp: 0.5.12, zfec: 1.4.2, twisted: 2.5.0, nevow: 0.9.32, python: 2.5.2, platform: Linux-Ubuntu_8.04-i686-32bit, simplejson: 2.0.5, pyopenssl: 0.6, setuptools: 0.6c8
2008-11-24 20:52:14.527Z [-] tahoe versions: allmydata: 1.2.0-r3235, foolscap: 0.3.2, pycryptopp: 0.5.12, zfec: 1.4.2, twisted: 8.1.0, nevow: 0.9.32, python: 2.6.0, platform: Linux-Ubuntu_8.04-i686-32bit_ELF, simplejson: 2.0.5, pyopenssl: 0.7, setuptools: 0.7a1
2008-11-24 20:46:17.288Z [-] tahoe versions: allmydata: 1.2.0-r3235, foolscap: 0.3.2, pycryptopp: 0.5.4, zfec: 1.4.2, twisted: 8.0.1, nevow: 0.9.32, python: 2.4.3, platform: SunOS-5.11-i86pc-i386-32bit, simplejson: 2.0.5, pyopenssl: 0.7, setuptools: 0.7a1
2008-11-24 20:49:20.293Z [-] tahoe versions: allmydata: 1.2.0-r3235, foolscap: 0.3.2, pycryptopp: 0.5.12, zfec: 1.4.2, twisted: 2.5.0, nevow: 0.9.32, python: 2.5.1, platform: Windows-XP-5.1.2600-SP2, simplejson: 2.0.5, pyopenssl: 0.6, setuptools: 0.7a1
2008-11-24 20:46:38.028Z [-] tahoe versions: allmydata: 1.2.0-r3235, foolscap: 0.3.2, pycryptopp: 0.5.12, zfec: 1.4.2, twisted: 8.1.0, nevow: 0.9.32, python: 2.5.1, platform: Darwin-9.5.0-i386-32bit, simplejson: 2.0.5, pyopenssl: 0.6, setuptools: 0.6c8
2008-11-24 20:45:18.952Z [-] tahoe versions: allmydata: 1.2.0-r3235, foolscap: 0.3.2, pycryptopp: 0.5.1, zfec: 1.4.1-2, twisted: 8.1.0, nevow: 0.9.31, python: 2.5.2, platform: Linux-Ubuntu_8.04-x86_64-64bit, simplejson: 2.0.3, pyopenssl: 0.7, setuptools: 0.7a1
2008-11-19 23:36:30.603Z [-] tahoe versions: allmydata: 1.2.0-r3225, foolscap: 0.3.2, pycryptopp: 0.5.12, zfec: 1.4.2, twisted: 8.1.0, nevow: 0.9.32, python: 2.6.0, platform: Linux-_-x86_64-64bit_ELF, simplejson: 2.0.4, pyopenssl: 0.8, setuptools: 0.7a1
Sigh. Forgot wiki markup
Yes, seems plausible.
I tried patching setup.py to require simplejson 2.0.5.
But this fail because simplejson-1.8.1 is already installed system-wide.
Would it be possible to remove this system-wide egg ?
I can confirm that upgrading simplejson to version 2.0.5 does fix this bug on slave3.
This was verified by installing a "clean" python environment with "virtualenv --no-site-package" after applying attached patch.
Attachment require-simplejson.patch (92385 bytes) added
Require simplejson version >= 2.0.5
Regarding the versioning conflict, see #530 (use setuptools's --multi-version mode).
Thanks for the fix! I'll apply your patch post haste!
But waitasecond, how is it that one of the successful buildbots was using simplejson 2.0.4 and another was using simplejson 1.7.1?
And one of the failing buildbots (dapper) was using simplejson 1.8.1?
Ok, changeset:5ebd7319822836ed, which raises the setuptools requirement from >= 1.4 to > 1.8.1, made the buildbots turn green. I don't really know why, but I'm closing this ticket as "fixed". Thanks a lot, Francois!
(P.S. Please pursue your earlier line of thought of "What exactly is the policy for handling strings within the Tahoe codebase?".)
For anybody who is investigating the history of which version of simplejson different versions of tahoe have required:
changeset:f76a81fda216e4fb
changeset:49dccbd4668a2b12
changeset:1f2e3fc912bb897e
changeset:b0dd88158a3e5b0f
changeset:44c73492704144e8
So, our policy for handling strings should be:
and comments, variable names, and argument names should help us make this
distinction.
I have a vague feeling that our code was somehow asking simplejson to do
something unreasonable.
In particular, that test code above
((@@http://allmydata.org/trac/tahoe/ticket/534#comment:368988@@)) is asking simplejson
to take a bytestring with non-ASCII characters and JSONify it, which is an
unreasonable thing to ask of it (JSON can encode unicode, but not arbitrary
bytestrings). The input to simplejson.dumps() should always be unicode
objects or bytestrings with only ASCII characters.
The tahoe_cp code must take responsibility for transforming the sys.argv
bytestring (which, hopefully, is UTF-8 encoded) into a unicode object before
doing anything else with it. When it is then put into the webapi set_children
body (which is defined to be a JSON-encoded dictionary), simplejson.dumps()
should be able to handle it. Likewise, if the sys.argv name gets used in a
URL (maybe for tahoe_cp, more likely for tahoe_put), it must be encoded into
UTF-8 and then URL-encoded.
I'm glad this fix seems to improve matters, but I'd like to make sure that
our code is doing the right thing.
I'm re-opening this ticket because, per Brian's most recent comment, it seems like the test code is asking simplejson to do something unreasonable, and because per the new #555 (tahoe .deb cannot be installed on hardy: simplejson dependency is too new), it is problematic for Tahoe to require simplejson > 1.7.1.
See also changeset:9d729109d2d4c24a where I change the requirement to
simplejson >= 1.7.1
(again).So there is something about our Ubuntu Dapper buildslave that makes it fail
allmydata.test.test_cli.Cp.test_unicode_filename
even when other buildslaves using the same version ofsimplejson
pass. Anyway, I just went and experimented on the Dapper buildslave, byeasy_install
'ing different versions ofsimplejson
from here: http://pypi.python.org/simple/simplejsonWith
simplejson
v1.7.5, v1.8, v1.8.1, and v1.9 this unit test fails. Withsimplejson
v2.0 it passes.I still haven't understood what this test does or why Brian thinks it is an invalid test.
I'm sorry if it sounded like I was accusing
test_unicode_filename
ofdoing something invalid. I think
test_unicode_filename
is ok (although Ihaven't looked at it very closely yet).
My comment about "that test code above .. asking .. an unreasonable thing"
was specifically about the test script which Francois posted to this ticket:
This code is passing a bytestring with non-ASCII characters to
simplejson.dumps, but JSON is not capable of encoding non-ASCII bytestrings.
So it's pushing simplejson.dumps beyond its normal territory, into the range
where things are not usually as well tested (i.e. how does dumps() respond to
invalid inputs?). I'm not surprised that its response will vary a lot from
one minor version to the next: error cases are usually driven by bug reports.
I'd find it more reasonable if that code instead used:
The
tahoe_cp.py
code should be careful to never send non-ASCIIbytestrings to simplejson.dumps. In fact, it should never send bytestrings to
dumps() at all, but instead should be sending unicode objects (well, at least
for filenames. for other tokens it's pretty benign).
So the analysis that I want to do on this issue is to trace the code in
tahoe_cp.py, from the point where a filename is read off the local disk
(using os.listdir), to the point where it is passed to simplejson.dumps. This
is an area of considerable disagreement within the Python community (the
python-dev list has been very busy in the last week with a discussion of how
unicode filenames and environment variables ought to be handled in the
just-released Python 3.0). I fear we're going to need some platform-specific
code here, but I think it's safe for us to declare that on linux at least we
expect os.listdir to give us bytestrings with UTF-8 encoded names.
Note that this raises the question of what 'tahoe cp' ought to do when a
filename on local disk has non-UTF8 non-ASCII bytes in it: apparently
(according to one of Glyph's recent posts, IIRC) some systems (KDE?) stuff
the high-bit-set bytes into some magical reserved unicode space, so that they
can at least survive a roundtrip. It might not be a bad idea to use this same
technique: if
filename.decode("utf-8")
on the response from os.listdirraises UnicodeDecodeError, then map it into this special space instead.
I fully agree with Brian's comment on my test script being broken.
However, the test_unicode_filename test looks reasonable because it calls do_cli with UTF-8 bytestrings, in much the same way as if directly called on the command line under Linux.
I'm going to investigate to find if and where bytestrings are being sent to simplejson, following up on warner.
Per recent discussion http://allmydata.org/pipermail/tahoe-dev/2008-December/000948.html , I've reverted some recent changes in changeset:883e51b02dae732a, and marked the cli unicode test as
todo
in changeset:9f117dbe8f7260a0.Attachment stringutils.dpatch (194898 bytes) added
Francois:
Thanks for the patch! I'm glad that it adds a test of mkdir as well.
Two things: 1. We prefer functions which are stricter about their inputs, so could you change to_unicode() so that it raises an exception if the argument is not a str? Like this:
Could you please add some doc to CLI.txt explaining that it currently works only for utf-8 strings on argv or from the filesystem, and that it tries to detect if your filesystem is providing utf-8 and raise an exception if not. Oh, and also there is a third thing:
What happens if I run a command-line like "tahoe cp A B" and sys.getdefaultencoding() returns something other than utf-8? Do I see a Python backtrace on stdout? Please make it so that this causes an explanatory "USAGE"-style string.
Oh, and
http://docs.python.org/library/sys.html#sys.getfilesystemencoding
So I guess 4 is to use sys.getfilesystemencoding() instead of sys.getdefaultencoding() when trying to determine what encoding we get back from os.listdir(), and 5 is to document in CLI.txt the distinction between filesystem encoding and command-line-arguments encoding.
Phewf. Too bad this isn't easier.
Thanks a lot for your help, François!
Regards,
Zooko
See also François's post to tahoe-dev, which also mentions #565 (unicode arguments on the command-line) and #629 ('tahoe backup' doesn't tolerate 8-bit filenames).
Ok, the latest patch bundle (unicode.dpatch) will hopefully fix all those unicode issues (including #629) in a much cleaner way.
All encoding conversion code is now located in stringutils.py. It provides a single location where we might be able to use better encoding detection methods such as those discussed on the mailing-list.
Methods currently in used are:
for command line arguments (sys.argv)
for text display (sys.stdout)
for filename encoding on the filesystem (os.listdir(), open())
Many precondition checks have been added to ensure that filenames are treated as unicode objects.
Attachment unicode.dpatch (37990 bytes) added
Uploaded a new version of unicode.dpatch which include a patch to fix a conflict introduced by changeset:5d57da93fd336692.
Attachment unicode-v2.dpatch (40174 bytes) added
Uploaded a new version of my patch (unicode-v2.dpatch) which fix a conflict introduced by changeset:7e8958671b2e07f4.
Ok, we had plenty of discussion on the mailing-list. Zooko posted a summary available at http://allmydata.org/pipermail/tahoe-dev/2009-March/001379.html
Basically, there are two cases:
Case 1 is what's currently implemented by my patches.
Case 2 requires to optionally add two attributes to each file, a bytestring and a guessed encoding. As Brian pointed out, the bytesting attribute must be encoded as base32 to survive to a JSON trip.
Now, I first want to focus on case 1 while raising an helpful error on case 2.
Does it sound any good ?
Francois: thanks for working on this! I was planning to amend your patch myself, but I'll let you do it.
Here is my most recent idea about how this should be done:
http://allmydata.org/pipermail/tahoe-dev/2009-March/001379.html
Except that this isn't my most recent idea after all. I amended my intent a little, as prompted by pointed questions from nejucomo on IRC, and by looking at the actual source code where directories are processed:
http://allmydata.org/trac/tahoe/browser/src/allmydata/dirnode.py?rev=6e57576f2eeedcb8#L168
Then I tried to write down my ideas in detail and this forced me to realize that they were incomplete and wrong and I had to amend them a whole lot more in order to finish this letter. Finally, I asked JP Calderone for help, and he helped me understand how to write filenames back into a local Linux filesystem without risking that the user will accidentally overwrite their local files with tahoe files (because the tahoe files were written out under different representation than they were displayed), and how to do normalization, and how to cheaply ensure that silent misdecodings could be repaired in some future generation.
Okay, here's the best design yet:
I think that the unicode representation of the filename should continue to be the unique key in the directory (which current Tahoe 1.3.0 requires).
So there should be a data structure with a required "filename" part, and a required "failed_decode" flag, and an optional "alleged_encoding" part. The "filename" part is the canonical value of the filename, but we recognize that sometimes we can't actually get the real filename into unicode form. If our attempt to interpret the filename into unicode fails, then we set the "failed_decode" flag and put the iso-8859-1-decoding of it into the "filename" part.
Here are the steps of reading a filename from the filesystem and adding that filename into an existing Tahoe directory.
On Windows or Mac read the filename with the unicode APIs. Normalize the string with filename = unicodedata.normalize('NFC', filename). Leave out the "alleged_encoding" part. Set the "failed_decode" flag to False.
On Linux read the filename with the string APIs to get "bytes" and call sys.getfilesystemencoding() to get "alleged_encoding". Then, call bytes.decode(alleged_encoding, 'strict') to try to get a unicode object.
2.a. If this decoding succeeds then normalize the unicode filename with filename = unicodedata.normalize('NFC', filename), store the resulting filename and the alleged_encoding, and set the "failed_decode" to False. (Storing the alleged_encoding is for the benefit of future generations, who may discover that the decoding was actually wrong even though it didn't raise an error, and who could then use the alleged_encoding to undo the damage. For example Shawn Willden has a prototype tool which lets a human examine the filename as decoded with different encodings and pick the one that means something in a language they know.)
2.b. If this decoding fails, then we decode it again with bytes.decode('iso-8859-1', 'strict'). Do not normalize it. Put the resulting unicode object into the "filename" part, set the "failed_decode" flag to True, and leave the "alleged_encoding" field out. This is a case of mojibake:
http://en.wikipedia.org/wiki/Mojibake
The reason to go the mojibake route is that it preserves the information, and in theory someone could later decode it and figure out the original filename. This has actually happened at least once, as shown by the photograph on that wikipedia page of the package which was delivered to the Russian recipient. Mojibake! (footnote 1)
How does that sound?
Phewf. Okay, now for the trip in the other direction. Suppose you have a Tahoe filename object, and you need to create a file in the local filesystem, because for example the user runs "tahoe cp -r $DIRCAP/subdir .". There are four cases:
Case 1: You are using a unicode-safe filesystem such as Windows or Mac, and you have a unicode object with failed_decode=False.
This is easy: use the Python unicode filesystem APIs to create the file and be happy.
Case 2: You are using a unicode-safe filesystem and you have a unicode object with failed_decode=True.
This is easy: use the Python unicode filesystem APIs to create the file, passing the latin-1-decoded filename (mojibake!).
Case 3: You are using a plain-bytes filesystem such as Linux, and you have a unicode object with failed_decode=False.
This is easy: use the Python unicode filesystem APIs to create the file.
Case 4: You are using a plain-bytes filesystem such as Linux, and you have a unicode object with failed_decode=True.
Now we should encode the filename using iso-8859-1 to get a sequence of bytes, and then write those bytes into the filesystem using the Python string filesystem API. This is no worse than any alternative, and in the case that the target filesystem has the same encoding as the original filesystem (such as because it is the same as the original filesystem, or because it is owned by a friend of the owner of the original filesystem), then this will restore the file to its proper name.
By the way, please see David Wheeler's recent proposal to start enforcing filename constraints in Linux: http://lwn.net/Articles/325304 . His proposals include changing Linux to require utf-8-encoding of all filenames.
Regards,
Zooko
footnote 1: I know that Alberto Berti has previously argued on tahoe-dev and on IRC that mojibake is less clean than the alternative of using bytes.decode(alleged_encoding, 'replace'). The latter is lossy, but it more clearly shows to the user that some or all of the filename couldn't be decoded. Alberto and others had convinced me of the wisdom of this, and I actually wrote this entire document specifying the 'decode-with-replace' approach instead of the mojibake approach, but I eventually realized that it wouldn't work. For one thing it was rather complicated to decide how to handle multiple filenames that all decode-with-replace to the same unicode name (you could imagine a whole directory full of files all named '????' because the locale is wrong). But the real killer is what to do when you are going to write the file back into the local filesystem. If you write a decoded-with-replace file back, then this means a round-trip from linux to tahoe and back can mess up all of your filenames. If you write the actual original bytes into the filesystem, then this means that you might accidentally overwrite files with a "tahoe cp", since "tahoe ls" just shows files with "???" in their names, but "tahoe cp" writes files out with actual characters instead of question marks.
Zooko: this sounds great to me!
We're going to go ahead and release tahoe-1.4.0 ASAP, but this will be one of the first improvements to land in tahoe-1.5.0, I hope! :-)
Attachment unicode-v3.dpatch (54865 bytes) added
Ok, here an amended version of all the previous patches.
It would be great if people could give it a try under Windows and MacOS X and publish their experiences here.
It currently only handle to simple case where filename encoding is coherent with what's detected by Python (sys.getfilesystemencoding()).
I'm reviewing your most recent patch, François.
I'll be posting my observations in separate comments as I understand more of the patch.
Here's the first observation:
The patch seems to assume that the terminal handles either
ascii
orutf-8
on stdout, but what about terminals that handle a different encoding, such as Windowscmd.exe
(which presumably handles whatever the current Windows codepage is, or elseutf-16le
)? Apparentlysys.stdout.encoding
will tell us what python thinks it should use if you pass a unicode string to it withprint myunicstr
orsys.stdout.write(myunicstr)
.In any case the documentation should explain this -- that what you see when you run
tahoe ls
will depend on the configuration of your terminal. Hm, this also suggests that it isn't correct for tahoe to have aunicode_to_stdout()
function and instead we should just rely on the pythonsys.stdout
encoding behavior. What do you think?I guess one place where I would be willing to second-guess python on this is, if the
sys.stdout.encoding
says the encoding isascii
or says that it doesn't know what the encoding is, then pre-encode your unicode strings withutf-8
(or, if on Windows, withutf-16le
), before printing them orsys.stdout.write()
'ing them. This is because of the following set of reasons:A misconfigured environment will result in python defaulting to
ascii
whenutf-8
will actually work better (I just now discovered that my own Mac laptop on which I am writing this was so misconfigured, and when I tried to fix it I then misconfigured it in a different way that had the same result! The first was:LANG
andLC_ALL
were being cleared out in my.bash_profile
, the second was: I setLANG
andLC_ALL
toen_DK.UTF-8
, but this version of Mac doesn't support that locale, so I had to change it toen_US.UTF-8
.)Terminals that actually can't handle
utf-8
and can only handleascii
are increasingly rare.If there is something that can handle only
ascii
and you give itutf-8
, you'll be emitting garbage instead of raising an exception, which might be better in some cases. On the other hand I suppose it could be worse in others. (Especially when it happens to produce control characters and screws up your terminal emulator...)I'm not entirely sure that this second-guessing of python is really going to yield better results more often than it yields worse results, and it is certainly more code, so I would also be happy with just emitting unicode objects to stdout and letting python and the local system config do the work from there.
small details and English spelling and editing:
s/Tahoe v1.3.1/Tahoe v1.5.0/
s/aliase/alias/
s/commande/command/
s/moderns/modern/
Oh, I see that your
unicode_to_stdout
function is exactly like the builtin python one except that it uses'replace'
mode. That's interesting. I guess maybe that is an improvement over the builtin python behavior (for our particular uses), and it also provides a function where we can add further second-guessing of python such as falling back toutf-8
orutf-16le
.I'm trying to understand this patch better by splitting it up into smaller patches. Here is a version of it which strips out all the
unicode_to_stdout()
(which I subsequently realized is not a bad idea). The next step probably ought to be splitting this one into yet smaller patches -- such as everything to do with aliases in one, and everything to do with files in another, and everything to do with cmdline args in a third. Not sure if these can be cleanly separated (since filenames and aliases can come in through the cmdline args, for example), but I might try and see if I understand it better by the attempt. :-)Attachment unicode-v3-minus-the-stdout-parts.patch.txt (31445 bytes) added
And here is a version of it where I stripped out the aliases parts. (Note: I definitely want the aliases part, I'm just trying to separate them out to understand it better, and I'm posting the intermediate work here just in case François or anyone wants to see what I'm doing.
Attachment unicode-v3-minus-the-stdout-and-aliases-parts.patch.txt (29387 bytes) added
Okay and here I've stripped out the argv parts:
Attachment unicode-v3-minus-the-stdout-and-aliases-and-argv-parts.patch.txt (23518 bytes) added
Here I've stripped out the parts having to do with encoding for URLs. Also I noticed some kludgy broken tweaks that I had put in during earlier steps of this process and stripped those out too.
Attachment unicode-v3-minus-the-stdout-and-aliases-and-argv-and-url-parts.patch.txt (16919 bytes) added
Hm. I just learned that the
windows-1252
encoding is a superset of theiso-8859-1
a.k.a.latin-1
encoding:http://en.wikipedia.org/wiki/Windows-1252
The difference is that some bytes which are mapped to control characters in
iso-8859-1
are mapped to characters inwindows-1252
. (Also maybe some of the characters are in a different order but that doesn't matter for this purpose.)Does that mean that when doing the mojibake fallback when decoding fails, if we decode with
windows-1252
instead ofiso-8859-1
then we'll have fewer control characters in the resulting unicode string? That sounds like an improvement.Hm, so there is this idea by Markus Kuhn called
utf-8b
.utf-8b
decoding is just likeutf-8
decoding, except that if the input string turns out not to be validutf-8
encoding, thenutf-8b
stores the invalid bytes of the string as invalid code points in the resulting unicode object. This means thatutf8b_encode(utf8b_decode(x)) == x
for anyx
(not just forx
's which areutf-8
-encodings of a unicode string).I wonder if
utf-8b
provides a simpler/cleaner way to accomplish the above. It would look like this. Take the design written in (@@http://allmydata.org/trac/tahoe/ticket/534#comment:369015@@) and change step 2 to be like this:sys.getfilesystemencoding()
to get "alleged_encoding". If the alleged encoding isascii
orutf-8
, or if it absent or invalid or denotes a codec that we don't have an implementation for, then setalleged_encoding = 'utf-8b'
instead. Then, callbytes.decode(alleged_encoding, 'strict')
to try to get a unicode object.2.a. If this decoding succeeds then normalize the unicode filename with
filename = unicodedata.normalize('NFC', filename)
, store the resulting filename and if the encoding that was used was notutf-8b
then store the alleged_encoding. (If the encoding that was used wasutf-8b
, then don't store the alleged_encoding --utf-8b
is the default and we can save space by omitting it.)2.b. If this decoding fails, then we decode it with
bytes.decode('utf-8b')
. Do not normalize it. Put the resulting unicode object into the "filename" part. Do not store an "alleged_encoding".Using
utf-8b
to store bytes from a failed decoding instead ofiso-8859-1
means that if the name or part of the name is actuallyascii
orutf-8
, then it will be (at least partially) legible. It also means that we can omit the "failed_decode" flag, because it makes no difference whether the filename was originally alleged to be inkoi8-r
, but failed to decode using thekoi8-r
codec, and so was instead decoded usingutf-8b
, or whether the filename was originally alleged to be inascii
orutf-8
, and was decoded usingutf-8b
. (Right? I think that's right.)An implementation, including a Python codec module, by Eric S. Tiedemann (1966-2008; I miss him):
http://hyperreal.org/~est/utf-8b
An implementation for GNU iconv by Ben Sittler:
http://bsittler.livejournal.com/10381.html
A PEP by Martin v. Löwis to automatically use
utf-8b
whenever you would otherwise useutf-8
:http://www.python.org/dev/peps/pep-0383
I wrote: """Using utf-8b to store bytes from a failed decoding instead of iso-8859-1 means ... that we can omit the "failed_decode" flag, because it makes no difference whether the filename was originally alleged to be in koi8-r, but failed to decode using the koi8-r codec, and so was instead decoded using utf-8b, or whether the filename was originally alleged to be in ascii or utf-8, and was decoded using utf-8b. (Right? I think that's right.)"""
Oh no, I am wrong about this because of the existence of byte-oriented systems where the filesystem encoding is not
utf-8
. When outputting a filename into such a system, you ought to check the "failed_decode" flag and, if it is set, reconstitute the original bytes before proceeding to emit the name using the byte-oriented API.Here is some code which attempts to explain what I mean. It doesn't actually run -- for example it is missing its
import
statements -- but writing it helped me think this through a little more:Also attached to this ticket...
Attachment fsencoding.py (3005 bytes) added
PEP 383'ish implementation of listdir()
I was referencing this ticket and my (untested) example code here in the PEP 383 thread on python-dev (http://mail.python.org/pipermail/python-dev/2009-April/089170.html ), and I realized that I forgot to set the "failed_decode" flag in my example code. Here is a new version of it that sets the "failed_decode" flag, and that uses
utf-8
for the attempted decode and only usesutf-8b
when doing the fallback (for clarity -- should make no difference since the attempted decode uses error handling mode 'strict').I will also attach it.
Attachment fsencode.py (3052 bytes) added
As promised, here's a darcs patch bundle containing the 17 patches which implements basic Unicode filename support for review purpose.
Attachment tahoe-534-bundle.dpatch (70453 bytes) added
Hooray! Thanks!
Fixed another small error that I discovered while writing to python-dev. Attaching latest version.
Attachment fsencode.2.py (3052 bytes) added
Fixed a couple more bugs or unnecessaries found by inspection while writing to tahoe-dev.
Attachment fsencode.3.py (2912 bytes) added
Yet again I realize that my previous understanding was incomplete or subtly flawed. This encoding stuff is very tricky! Here's my post to python-dev where I postulate that PEP 383 actually does solve most of our problems after all:
http://mail.python.org/pipermail/python-dev/2009-May/089318.html
Please read!
Okay, my current design is at the end of this message, including rationale:
http://allmydata.org/pipermail/tahoe-dev/2009-May/001670.html
Here is the summary:
To copy an entry from a local filesystem into Tahoe:
On Windows or Mac read the filename with the unicode APIs.
Normalize the string with filename = unicodedata.normalize('NFC',
filename). Leave the "original_bytes" key and the "failed_decode" flag
out of the metadata.
On Linux or Solaris read the filename with the string APIs, and
store the result in the "original_bytes" part of the metadata. Call
sys.getfilesystemencoding() to get an alleged_encoding. Then, call
bytes.decode(alleged_encoding, 'strict') to try to get a unicode
object.
2.a. If this decoding succeeds then normalize the unicode filename
with filename = unicodedata.normalize('NFC', filename), store the
resulting filename and leave the "failed_decode" flag out of the
metadata.
2.b. If this decoding fails, then we decode it again with
bytes.decode('latin-1', 'strict'). Do not normalize it. Store the
resulting unicode object into the "filename" part, set the
"failed_decode" flag to True. This is mojibake!
unicode string may already be present in the directory. If so, check
the failed_decode flags on the current entry and the new entry. If
they are both set or both unset then the new entry overwrites the old
entry -- they had the same name. If the failed_decode flags differ
then this is a case of collision -- the old entry and the new entry
had (as far as we are concerned) different names that accidentally
generated the same unicode. Alter the new entry's name, for example by
appending "~1" and then trying again and incrementing the number until
it doesn't match any extant entry.
To copy an entry from Tahoe into a local filesystem:
Always use the Python unicode API. The original_bytes field and the
failed_decode field in the metadata are not consulted.
François nicely broke his patch up into many small patches and attached the whole set as "tahoe-534-bundle.dpatch" ((68.8 KB) - added by francois 6 weeks ago). Here I'm attaching them as separate attachments so that I can link to them. Also I re-recorded one in order to avoid a conflict in [test_cli.py]source:src/allmydata/test/test_cli.py.
Attachment plumbing for unicode support.darcspatch (11899 bytes) added
Attachment tahoe manifest unicode support.darcspatch (9482 bytes) added
Attachment tahoe ls unicode support.darcspatch (9627 bytes) added
Attachment tahoe get unicode support.darcspatch (9144 bytes) added
Attachment tahoe webopen unicode support.darcspatch (8998 bytes) added
Attachment tahoe rm unicode support.darcspatch (9074 bytes) added
Attachment tahoe ln unicode support.darcspatch (9074 bytes) added
Attachment tahoe status unicode support.darcspatch (9031 bytes) added
Attachment tahoe check unicode support.darcspatch (9026 bytes) added
Attachment tahoe deep-check unicode support.darcspatch (9042 bytes) added
Attachment tahoe mkdir unicode support.darcspatch (10364 bytes) added
Attachment explain the current unicode support in CLI.darcspatch (9891 bytes) added
Attachment tahoe cp unicode support.darcspatch (15604 bytes) added
Attachment tahoe put unicode support.darcspatch (11435 bytes) added
Attachment aliases unicode support.darcspatch (12817 bytes) added
Attachment previous cli related patches should.darcspatch (9150 bytes) added
Attachment tahoe backup unicode support.darcspatch (15101 bytes) added
Unsetting the "review" keyword because this patch is not ready to review until François (or someone) reworks it to use
os.listdir(a_unicode_thing)
on Windows and Mac andos.listdir(a_string_thing)
on other platforms.Presumably these are the patches #734 is referring to.
To fix #734,
unicode_to_stdout
in the patchedutil/stringutil.py
should be something like:(This doesn't explain why
tahoe_ls.py
} was attempting to use None forname
it was trying to convert, but that's presumably a separate issue not caused by the patch.)The reason for the
catch [LookupError](wiki/LookupError)
is that Python can sometimes setsys.stdout.encoding
to an encoding that it does not recognize. For example, if you have Windowscmd.exe
set to use UTF-8 bychcp 65001
,sys.stdin.encoding
andsys.stdout.encoding
will be 'cp65001', which is not recognized as being the same as UTF-8. (If the encoding was actually something other than UTF-8, I think that producing mojibake on stdout is better than throwing an exception. Note that terminals that do bad things when they receive control characters are already broken; this isn't making it worse, because plain ASCII can include such control characters.)There are other issues with the patch; I just wanted to comment on this part while it's swapped into my brain (and people are now calling me to go and play scrabble, which sounds more fun :-)
Replying to davidsarah:
Python was obviously not fully swapped in -- s/
catch
/except
/."tahoe cp" command encoding issueto CLI (e.g. tahoe cp) does not handle Unicode file/directory names, either in arguments or the local filesystemCLI (e.g. tahoe cp) does not handle Unicode file/directory names, either in arguments or the local filesystemto CLI (e.g. tahoe cp) does not correctly handle or display Unicode file/directory names, either in arguments or the local filesystemSigh. I misread the exception message in #734. It is complaining about
sys.stdout.encoding
being None when stdout is redirected, not about the input string being None. So the fix for that should be something likeIt might also be a good idea to write a BOM at the start of the output. That would allow the encoding to be autodetected if a file containing redirected output is edited, which is helpful at least on Windows, and should be harmless on other platforms.
At least I did better at scrabble (won by 3 points :-)
Attachment unicode-helper-functions.diff (10347 bytes) added
This patch contains Unicode helper functions (stringutils.py) and associated tests
I'd be very grateful if someone could review patch attachment:unicode-helper-functions.diff
Then, I'll post a second patch which will modify the CLI code to actually use those helper functions.
I will review it. If I understand correctly François's strategy is to first write patches to handle the case that the bytes decode without error using the nominal encoding. This avoids all the tricky problems with bytes that don't decode according to the nominal encoding.
I'm excited about having this mocking tool so we can conveniently express platform-specific behavior in a deterministic, automated, cross-platform test.
I plan to add the following text to the NEWS file to tell users the current state of Unicode support.
Attachment unicode-filenames-handling.diff (29300 bytes) added
Patch attachment:unicode-filenames-handling.diff makes use of helper functions from attachment:unicode-helper-functions.diff to handle Unicode filenames in the CLI.
As for now, only filenames which are encoded in a coherent way according to the system locale are handled. In the other case, an error message is displayed and the current operation is stopped.
Unicode CLI argument in Windows cmd.exe doesn't work. See ticket #565 on this issue.
I really want to see this patch in trunk in the next 48 hours for Tahoe-LAFS v1.7, but I can't review it myself right now. Unassigning myself and writing a note to tahoe-dev pleading for someone else to review it.
Here is my review of attachment:unicode-filenames-handling.diff and attachment:unicode-helper-functions.diff .
Overall this is a good start. I'm glad we have a way to do real thorough tests of encoding (by mocking up stdout, sys, and other components of the system). This patch isn't ready to go into trunk so we won't ship it in v1.7 but at this rate we can definitely get it all polished up for v1.8.
Here are my comments:
I like the NEWS file text. :-)
The unit tests of the helper functions look good! Please add tests of the two cases where
listdir_unicode_unix()
would raiseFilenameEncodingError
.What does this comment mean? Recognized by whom? The code looks like it detects this and corrects it, so I guess you don't mean recognized by Tahoe-LAFS.
This looks wrong! cp850 and iso-8859-1 are different encodings. If I am right that this is wrong then it should be possible to write a test case that causes Tahoe-LAFS to incorrectly decode an input as long as this statement is still in the code.
Oh and look what I just found! It appears that cp65001 and utf-8 are also different: http://bugs.python.org/issue6058 (specifically http://bugs.python.org/msg97731 ). Let's see if we can write a unit test that goes red when Tahoe-LAFS (or perhaps the underlying Python libraries) incorrectly equate cp65001 and utf-8.
Is stdout encoding guaranteed to be the same as argv encoding? Maybe we should rename
get_stdout_encoding()
or at least add a comment toargv_to_unicode()
explaining that we use stdout encoding for this. Also the UsageError has the wrong text. Also it has the wrong name of the exception class! This proves that the line is not covered by unit tests. Please add a unit test of what happens with something that can't be decoded as the stdout encoding. :-)If we're going to do this fallback let's have a unit test of it. This means a test of the case that
get_stdout_encoding()
returns an encoding that Python doesn't know how to do, right!? Maybe let's not have a test of that and also not have that code at all and let it raise LookupError in that case. :-)"encoutered" -> "encountered"
Finish the test for
open_unicode()
.Add a test that the normalization is useful, i.e. a test that is red if we skip the
unicodedata.normalize()
step.There are a couple of hard-coded
'utf-8'
and a literal'ascii'
in attachment:unicode-filenames-handling.diff . Some of them may be correct, for example if it is encoding to a URL, but in that case maybe we should invokeunicode_to_url()
instead (mostly for documentation purposes). Some of them appear to be wrong to me -- aliases files shouldn't be hard-coded to utf-8... Oh wait, yes they should, right? Then please add doc stating so. Finally the encoding for stdout in tahoe_manifest.py definitely seems wrong to be hard-coded to 'utf-8'. Add a unit test showing that the current code with the hard-coded 'utf-8' fails on a system where stdout needs a different encoding.In test_cli.py, if the
test_listdir_unicode_bad()
test doesn't make sense on the current platform, then raiseSkipTest
instead of returning.Finally, use a code coverage tool (I recommend trialcoverage) to see if any of the lines changed by these patches are not-fully-tested by these tests.
Zooko, thank you very much for the review.
Replying to zooko:
Please find an updated version of the patches which takes all your
comments into account and has now 100% test coverage.
Done
The idea behind this piece code came from comment:369040 and comment:369046.
Anyway, I removed it completely because getting correct Unicode
command-line arguments under Windows will require calling the Unicode
Windows API function
GetCommandLineW
, see ticket #565.In the meantime, I've marked all Windows test cases as todo.
The function was renamed get_term_encoding(), the wrong comment was
rewritten and the error message now contains the guessed terminal
encoding.
According to the email message linked from ticket #734, users probably
prefers having a slightly modified ouput in the terminal (unknown
characters replaced by '?') than an exception.
Fixed :)
Done.
Done in
test_unicode_normalization()
.I already invoke
unicode_to_url()
instead of.encode('utf-8')
in every place. The two remaining hard-coded'utf-8'
occurrences are for the alias file encoding.Hard-coded
'ascii'
is used for representing capabilities, it mightbe better to create
Capability
objects instead of usingstr
but that's most likely out of the scope of this patch.
Yes, the alias file is read and written as UTF-8. A note was added in
frontends/CLI.txt
.Done.
I had trouble making use of
test-coverage
Makefile target, so Ifinally used
coverage.py
directly.Attachment unicode-helper-functions-v2.diff (12927 bytes) added
Attachment unicode-filenames-handling-v2.diff (30979 bytes) added
Let's define capabilities as being utf-8 encoded for forward-compatibility purposes. (No new unit test is required to test this change.)
Can you (or have you already) written a unit test which exercises the fallback:
except [LookupError](wiki/LookupError)?: return s.encode('utf-8', 'replace')
?As we discussed on IRC, this version doesn't do any lossy decoding from local filesystem into unicode -- if a filename can't be decoded with the declared
sys.getfilesystemencoding()
then an exception is raised. Is this exercised by a unit test?I can't think of any other issues -- we just need someone to review the latest patch.
Replying to zooko:
Shouldn't capabilities be defined as plain ASCII?
Yes, in
test_stringutils.py
line 70:Yes, line 98 of
test_stringutils.py
:Replying to [francois]comment:93:
Future caps might include non-ascii characters. And some of the current code might be able to process caps of a future version. (Forward-compatibility) There's no reason not to define the current ones as utf-8 is there?
USSJoin and drewp had this to say on IRC:
Replying to [zooko]comment:94:
Ok, I see your point, I'll modify the two
.encode('ascii')
into.encode('utf-8')
into filesrc/allmydata/scripts/common.py
and write a unit test which add an alias to capability from the future ('fi-蜔쳨欝遃䝦舜琇襇邤䍏㵦☚✸킾궑蒴犏띎냔㳆㼿졨浴䒉ΐ屝稜퍙鉧迴') as proposed and tries to get it back.Attachment unicode-filenames-handling-v3.2.diff (30969 bytes) added
Attachment additionnal-tests.diff (1127 bytes) added
Replying to zooko:
The preconditions are only there for extra safety to prevent programmers from using the library in unexpected ways. In normal usage, they should not be triggered unless because of a bug.
The only user visible exception for now is
FilenameEncodingError
. It will probably get catched and logged in the future to avoid breaking a complete backup session because of a single badly encoded file.Ok, those are the patches which should be applied to trunk if the review is considered sufficient enough.
This patch seems to be a little bit messed up by the "END OF DESCRIPTION" line being accidentally included in the description part, apparently because of whitespace at the beginning of the "END OF DESCRIPTION": http://tahoe-lafs.org/trac/tahoe-lafs/attachment/ticket/534/unicode-filenames-handling-v3.2.diff
Also that patch doesn't apply cleanly to current trunk. Could you please attach your darcs patch (the result of
darcs send -o darcs.patch.txt
)?http://tahoe-lafs.org/trac/tahoe-lafs/wiki/Patches
Make that
darcs send -o unicode-filenames-handling-v3.2.dpatch
.I applied the patch attachment:unicode-helper-functions-v2.diff and ran the tests and got some failures:
The failure were:
and
This is on Mac OS 10.4.11.
I applied attachment:additionnal-tests.diff and ran the tests and got a failure:
The issue of caps with non-ASCII chars has been spun out into #1051 (capabilities from the future could have non-ascii characters).
Replying to francois:
Reviewing these three now.
unicode_to_stdout()
I think that if we getUnicodeEncodeError
then we should add the'replace'
flag but using the same encoding (the return value ofget_term_encoding()
), but we should not catch and try to handleLookupError
. Oh, in fact that means that we should omit thetry: … except:
block entirely and just doreturn s.encode(get_term_encoding(), 'replace')
. What do you think? If you think we should do something different then construct a unit test that we fail unless we do something different. :-)Is this
if
statement obsolete now that we have proper mockery-based tests? Or isdo_cli()
a higher layer of functionality which should be tested in addition to the current tests? Or shoulddo_cli()
itself get mockerified so that it will run the same on all platforms? (Maybe it is good to test the actual non-mockerified underlying platform sometimes?) What is the intent of thisif
-- to test howdo_cli()
handles non-ASCII chars on those filesystems which can handle them? If that is the case then add a comment to that effect and make the test actually test whether the underlying system can encodeÄ
rather than whether the underlying system isANSI_X3.4-1968
. (For example, I guess a Japanese system in SHIFT-JIS isn'tANSI_X3.4-1968
and also can't representÄ
.unicodedata.normalize()
. I just went and removed that call to see that test go red and finally gained a dim understanding of what that call is for. :-)Unsetting the "review-needed" flag until François addresses the comments so far.
To test this feature on your system apply François's most recent patches (from comment:369062 unless he has posted new ones since then) and try to upload, download, move, rename, and backup files with non-ASCII characters in their names. Also directories. Use the WUI, the CLI, the WAPI, or
tahoe backup
as you prefer. For bonus points try installing the SFTP patches from #1037 and test filenames with non-ASCII characters through SFTP, sshfs, Nautilus, etc. :-)I did some tests on my Windows Box.
The directory TEST contain this filename (in one line):
LONG_TEST_STRING !#$%&'()+,-.012345789=@ABCDEFGHIJKQRSTYZbcdefghijklmnopqrstuvwxyz{}~ÇüéâäàåçêëèïîìÄÅÉæÆôöòûùÿÖÜø£Ø×áíóúñѪº¿®¬½¼¡«»ÁÂÀ©¢¥ãäðÐÊËÈÍÎϦÌÓßÔÒõÕµþÞÚÛÙýݯ´±¾¶§÷¸°¨·¹³² _1.txt
Test 1:
D:\rdfc>D:\tahoe-fdz_534\bin\tahoe.exe cp --recursive TEST ROOT:test_534
Success: files copied
=> I copied back this file on my disk and the md5sum was the same as the source file.
Test 2:
D:\rdfc\TEST>D:\tahoe-fdz_534\bin\tahoe.exe mkdir ROOT:à_testé
URI:DIR2:wskq3uymjffqq6zidua7xrc3oq:crnp2xhec76wwjwghx2ddae5nenxv365id7nx56tepwdzwlet3ha
=> no errors,but next test fails.
Test 3:
D:\rdfc\TEST>D:\tahoe-fdz_534\bin\tahoe.exe ls ROOT:
Ó_testÚ
BR''=> '''Ó'''_test'''Ú''' <> '''à'''_test'''é'''BR
''PS: same output with the WUI''
'''Test 4:'''BR
D:\rdfc\TEST>D:\tahoe-fdz_534\bin\tahoe.exe mkdir "ROOT:LONG_TEST_STRING !#$%&'()+,-.012345789=@ABCDEFGHIJKQRSTYZbcdefghijklmnopqrstuvwxyz{}~ÇüéâäàåçêëèïîìÄÅÉæÆôöòûùÿÖÜø£Ø×áíóúñѪº¿®¬½¼¡«»ÁÂÀ©¢¥ãäðÐÊËÈÍÎϦÌÓßÔÒõÕµþÞÚÛÙýݯ´±¾¶§÷
¸°¨·¹³²"BR
URI:DIR2:64cxbqxkd2rj4lxpuragkltrpu:5rjnyv3cnvdavg7uprmkb2f2fvncasdi6t6k3ert3t5htbpffaca
=> no errors,but next test fails.
Test 5:
D:\rdfc\TEST>D:\tahoe-fdz_534\bin\tahoe.exe ls ROOT:
LONG_TEST_STRING !#$%&'()+,-.012345789=@ABCDEFGHIJKQRSTYZbcdefghijklmnopqrstuvwxyz{}~'''óÚÔõÓÕþÛÙÞ´¯ý─┼╔µã¶÷‗¹¨ Í▄°úÏÎßݾ·±Ð¬║┐«¼¢╝í½╗┴┬└®óÑÒ├ñð╩╦╚═╬¤ª╠Ë▀ÈʧıÁ■Ì┌█┘²¦»┤¡▒¥Âº¸©░¿À╣│▓'''
BR''=> result is the same as test 3, not the same name as source.''BR
'''test 6:'''BR
D:\rdfc>D:\tahoe-fdz_534\bin\tahoe.exe backup TEST ROOT:test_534
1 files uploaded (0 reused), 0 files skipped, 1 directories created (0 reused), 0 directories skipped backup done, elapsed time: 0:00:15
BR'''Test 7:'''BR
D:\rdfc\TEST>D:\tahoe-fdz_534\bin\tahoe.exe ls ROOT:test_534/Archives/2010-05-19_22:08:36Z
LONG_TEST_STRING !#$%&'()+,-.012345789=@ABCDEFGHIJKQRSTYZbcdefghijklmnopqrstuvwxyz{}~ÇüéâäàåçêëèïîìÄÅÉæÆôöòûùÿÖÜø£Ø×áíóúñѪº¿®¬½¼¡«»ÁÂÀ©¢¥ãäðÐÊËÈÍÎϦÌÓßÔÒõÕµþÞÚÛÙýݯ´±¾¶§÷¸°¨·¹³² _1.txt
=> the filename is the same as source.
Fred
Replying to [zooko]comment:105:
IMHO, tests in
test_cli.py
should exercise the actual usage of theCLI tools and thus should not use mocking at all. Whereas, tests
test_stringutils.py
exercise the stringutils ''library'' and usemocking to simulate all possible platforms.
You're right, this exception handling is not very pretty. I chose your
proposition and modified
get_term_encoding()
to return'ascii'
forsys.stdout.encoding
value isNone
. A newunit test has been written as well.
Yes, this statement is obsolete and has been removed. I'd say that the
test should be skipped if it is not possible to write abritrary
filenames on the filesystem or pass arbitrary arguments on the command
line on the platform on which the tests are being run.
Attachment unicode-helper-functions-v4.diff (13891 bytes) added
Attachment unicode-filenames-handling-v4.diff (32422 bytes) added
Attachment unicode-bundle-v4.darcspatch (61069 bytes) added
The latest patches which take comment comment:109 into account have been uploaded. They run without failure under Ubuntu 10.04 and Windows XP.
And a darcs patch which correctly apply to trunk and contains the two previous patches and a third one which add a dependency on
mock
library (#1016).The latest attachment:unicode-bundle-v4.darcspatch still fails unit test on my Mac. It occurred to me that I can program the buildbots to run your branch and report! This could be useful! Here is your branch on a new trac instance:
http://tahoe-lafs.org/trac/tahoe-lafs-ticket534/log/
and here is the result of me forcing all the buildslaves to try to build and test your branch at 22:12:09 just now:
http://tahoe-lafs.org/buildbot/waterfall?show_events=true&branch=ticket534&reload=none
:-)
Now you can see how your branch fares on all our buildslaves!
Send me your ssh pub key and I'll set it up so that you can push patches directly to that branch on http://tahoe-lafs.org and then push a button on the buildbot web page to force all buildslaves to build your latest patches.
Ok, all tests now pass on all supported platforms or are skipped if irrelevant to the platform.
So, please review the 7 new patches which have been added after unicode-helper-functions-v4.diff and unicode-filenames-handling-v4.diff.
The patches were commited in trunk by Zooko: changeset:b2542b87085be095 -> changeset:0eb4d83937a97094.
I'm now reviewing the three patches for this functionnality which were added by Zooko.
About patch changeset:442008a6905d7374, the change in {tahoe_manifest.py} makes sense.
However, I don't agree with the changes made in
test_cli.py
. Because calling methodself.do_cli
with UTF-8 encoded arguments only makes sense of the platform on which the test is run does also encodes CLI arguments in UTF-8. This assumption is wrong in all othercases, such as terminals which only support ASCII characters. This is probably why the tests are failing on many buildbots.
About patch changeset:db8a6f3aa6d78122, using
locale.getpreferredencoding()}} instead ```sys.stdout.encoding
looks like a good idea. Changes in functionsskip_non_unicode_fs
andskip_non_unicode_stdout
looks also good. Same comment than before on the usage ofself.do_cli
.Patch changeset:5bcca5151eb897e6, Hey this one is easy, OK ;)
David-Sarah added to the tests some nice extensions. They and Brian Warner and I chatted on IRC and agreed that ```tahoe ls}}] ought to report the fact that a filename was unrepresentable if it was, and which filename (as represented by an escaped/quoted representation such as \u1234 for each unicode code point). David-Sarah implemented that. David-Sarah posted their new patch here and I just reviewed it: http://allmydata.org/trac/tahoe-lafs-ticket534/changeset/4434
François: please by all means review this patch and any other new patches from David-Sarah on the topic of unicode!
Oh. I found one thing in my review:
open_unicode()
should not do anything that the builtinopen()
function doesn't already do. Fortunately now that we have such thorough tests we should be able to verify that by redefining it like this:and see if the tests pass.
In David-Sarah's latest patches, they added a call to
os.path.expanduser()
intoopen_unicode()
. I think that puttingexpanduser()
into some central location so that we'll never forget to use it makes sense, so perhaps we'll end up renamingopen_unicode()
toopen_cu()
:I'm currently reviewing changeset changeset:80252debcd94fc28.
In util/stringutils.py:
Why shouldn't
open_unicode
behave just likeopen
by default?I see that a bunch of preconditions were left out, especially in
to_str
but that's hopefully not too dangerous.OK, I reviewed changeset:65b6f4e3ce4cdf13, changeset:529c9f673a89f7fa, changeset:8b014372b1ab9460, changeset:731e3d68dff0f8c9 and changeset:3c883e6e44e82d1e.
I don't really understand why
test_listdir_unicode_bad
was removed in changeset:3c883e6e44e82d1e.Attachment misc-update.dpatch (46461 bytes) added
Replying to francois:
The test was failing on Mac OS X, which allows badly-encoded filenames to be written, but then lists them using %-escaping. This is reasonable, but couldn't have been predicted, and it's impossible to predict what behaviour other filesystems might have that would break the test. So Zooko and I decided that the test was invalid.
Note that the
test_listdir_unicode
test inStringUtilsNonUnicodePlatform
mocks the return fromos.listdir
and checks thatlistdir_unicode
raises aFilenameEncodingError
when it should. We decided that this was a sufficient test, without actually trying to create the badly encoded filenames on the local filesystem.Replying to francois:
We should never be opening files in text mode.
to_str
is intended to accept either a Unicode or (UTF-8) byte string. It is used to coerce strings from the output ofsimplejson.loads
, which might be either.Fixed modulo the normalization issue in ticket #1076, which we intend to address for 1.7final.