add streaming (on-line) upload to HTTP interface #320
Labels
No labels
c/code
c/code-dirnodes
c/code-encoding
c/code-frontend
c/code-frontend-cli
c/code-frontend-ftp-sftp
c/code-frontend-magic-folder
c/code-frontend-web
c/code-mutable
c/code-network
c/code-nodeadmin
c/code-peerselection
c/code-storage
c/contrib
c/dev-infrastructure
c/docs
c/operational
c/packaging
c/unknown
c/website
kw:2pc
kw:410
kw:9p
kw:ActivePerl
kw:AttributeError
kw:DataUnavailable
kw:DeadReferenceError
kw:DoS
kw:FileZilla
kw:GetLastError
kw:IFinishableConsumer
kw:K
kw:LeastAuthority
kw:Makefile
kw:RIStorageServer
kw:StringIO
kw:UncoordinatedWriteError
kw:about
kw:access
kw:access-control
kw:accessibility
kw:accounting
kw:accounting-crawler
kw:add-only
kw:aes
kw:aesthetics
kw:alias
kw:aliases
kw:aliens
kw:allmydata
kw:amazon
kw:ambient
kw:annotations
kw:anonymity
kw:anonymous
kw:anti-censorship
kw:api_auth_token
kw:appearance
kw:appname
kw:apport
kw:archive
kw:archlinux
kw:argparse
kw:arm
kw:assertion
kw:attachment
kw:auth
kw:authentication
kw:automation
kw:avahi
kw:availability
kw:aws
kw:azure
kw:backend
kw:backoff
kw:backup
kw:backupdb
kw:backward-compatibility
kw:bandwidth
kw:basedir
kw:bayes
kw:bbfreeze
kw:beta
kw:binaries
kw:binutils
kw:bitcoin
kw:bitrot
kw:blacklist
kw:blocker
kw:blocks-cloud-deployment
kw:blocks-cloud-merge
kw:blocks-magic-folder-merge
kw:blocks-merge
kw:blocks-raic
kw:blocks-release
kw:blog
kw:bom
kw:bonjour
kw:branch
kw:branding
kw:breadcrumbs
kw:brians-opinion-needed
kw:browser
kw:bsd
kw:build
kw:build-helpers
kw:buildbot
kw:builders
kw:buildslave
kw:buildslaves
kw:cache
kw:cap
kw:capleak
kw:captcha
kw:cast
kw:centos
kw:cffi
kw:chacha
kw:charset
kw:check
kw:checker
kw:chroot
kw:ci
kw:clean
kw:cleanup
kw:cli
kw:cloud
kw:cloud-backend
kw:cmdline
kw:code
kw:code-checks
kw:coding-standards
kw:coding-tools
kw:coding_tools
kw:collection
kw:compatibility
kw:completion
kw:compression
kw:confidentiality
kw:config
kw:configuration
kw:configuration.txt
kw:conflict
kw:connection
kw:connectivity
kw:consistency
kw:content
kw:control
kw:control.furl
kw:convergence
kw:coordination
kw:copyright
kw:corruption
kw:cors
kw:cost
kw:coverage
kw:coveralls
kw:coveralls.io
kw:cpu-watcher
kw:cpyext
kw:crash
kw:crawler
kw:crawlers
kw:create-container
kw:cruft
kw:crypto
kw:cryptography
kw:cryptography-lib
kw:cryptopp
kw:csp
kw:curl
kw:cutoff-date
kw:cycle
kw:cygwin
kw:d3
kw:daemon
kw:darcs
kw:darcsver
kw:database
kw:dataloss
kw:db
kw:dead-code
kw:deb
kw:debian
kw:debug
kw:deep-check
kw:defaults
kw:deferred
kw:delete
kw:deletion
kw:denial-of-service
kw:dependency
kw:deployment
kw:deprecation
kw:desert-island
kw:desert-island-build
kw:design
kw:design-review-needed
kw:detection
kw:dev-infrastructure
kw:devpay
kw:directory
kw:directory-page
kw:dirnode
kw:dirnodes
kw:disconnect
kw:discovery
kw:disk
kw:disk-backend
kw:distribute
kw:distutils
kw:dns
kw:do_http
kw:doc-needed
kw:docker
kw:docs
kw:docs-needed
kw:dokan
kw:dos
kw:download
kw:downloader
kw:dragonfly
kw:drop-upload
kw:duplicity
kw:dusty
kw:earth-dragon
kw:easy
kw:ec2
kw:ecdsa
kw:ed25519
kw:egg-needed
kw:eggs
kw:eliot
kw:email
kw:empty
kw:encoding
kw:endpoint
kw:enterprise
kw:enum34
kw:environment
kw:erasure
kw:erasure-coding
kw:error
kw:escaping
kw:etag
kw:etch
kw:evangelism
kw:eventual
kw:example
kw:excess-authority
kw:exec
kw:exocet
kw:expiration
kw:extensibility
kw:extension
kw:failure
kw:fedora
kw:ffp
kw:fhs
kw:figleaf
kw:file
kw:file-descriptor
kw:filename
kw:filesystem
kw:fileutil
kw:fips
kw:firewall
kw:first
kw:floatingpoint
kw:flog
kw:foolscap
kw:forward-compatibility
kw:forward-secrecy
kw:forwarding
kw:free
kw:freebsd
kw:frontend
kw:fsevents
kw:ftp
kw:ftpd
kw:full
kw:furl
kw:fuse
kw:garbage
kw:garbage-collection
kw:gateway
kw:gatherer
kw:gc
kw:gcc
kw:gentoo
kw:get
kw:git
kw:git-annex
kw:github
kw:glacier
kw:globalcaps
kw:glossary
kw:google-cloud-storage
kw:google-drive-backend
kw:gossip
kw:governance
kw:grid
kw:grid-manager
kw:gridid
kw:gridsync
kw:grsec
kw:gsoc
kw:gvfs
kw:hackfest
kw:hacktahoe
kw:hang
kw:hardlink
kw:heartbleed
kw:heisenbug
kw:help
kw:helper
kw:hint
kw:hooks
kw:how
kw:how-to
kw:howto
kw:hp
kw:hp-cloud
kw:html
kw:http
kw:https
kw:i18n
kw:i2p
kw:i2p-collab
kw:illustration
kw:image
kw:immutable
kw:impressions
kw:incentives
kw:incident
kw:init
kw:inlineCallbacks
kw:inotify
kw:install
kw:installer
kw:integration
kw:integration-test
kw:integrity
kw:interactive
kw:interface
kw:interfaces
kw:interoperability
kw:interstellar-exploration
kw:introducer
kw:introduction
kw:iphone
kw:ipkg
kw:iputil
kw:ipv6
kw:irc
kw:jail
kw:javascript
kw:joke
kw:jquery
kw:json
kw:jsui
kw:junk
kw:key-value-store
kw:kfreebsd
kw:known-issue
kw:konqueror
kw:kpreid
kw:kvm
kw:l10n
kw:lae
kw:large
kw:latency
kw:leak
kw:leasedb
kw:leases
kw:libgmp
kw:license
kw:licenss
kw:linecount
kw:link
kw:linux
kw:lit
kw:localhost
kw:location
kw:locking
kw:logging
kw:logo
kw:loopback
kw:lucid
kw:mac
kw:macintosh
kw:magic-folder
kw:manhole
kw:manifest
kw:manual-test-needed
kw:map
kw:mapupdate
kw:max_space
kw:mdmf
kw:memcheck
kw:memory
kw:memory-leak
kw:mesh
kw:metadata
kw:meter
kw:migration
kw:mime
kw:mingw
kw:minimal
kw:misc
kw:miscapture
kw:mlp
kw:mock
kw:more-info-needed
kw:mountain-lion
kw:move
kw:multi-users
kw:multiple
kw:multiuser-gateway
kw:munin
kw:music
kw:mutability
kw:mutable
kw:mystery
kw:names
kw:naming
kw:nas
kw:navigation
kw:needs-review
kw:needs-spawn
kw:netbsd
kw:network
kw:nevow
kw:new-user
kw:newcaps
kw:news
kw:news-done
kw:news-needed
kw:newsletter
kw:newurls
kw:nfc
kw:nginx
kw:nixos
kw:no-clobber
kw:node
kw:node-url
kw:notification
kw:notifyOnDisconnect
kw:nsa310
kw:nsa320
kw:nsa325
kw:numpy
kw:objects
kw:old
kw:openbsd
kw:openitp-packaging
kw:openssl
kw:openstack
kw:opensuse
kw:operation-helpers
kw:operational
kw:operations
kw:ophandle
kw:ophandles
kw:ops
kw:optimization
kw:optional
kw:options
kw:organization
kw:os
kw:os.abort
kw:ostrom
kw:osx
kw:osxfuse
kw:otf-magic-folder-objective1
kw:otf-magic-folder-objective2
kw:otf-magic-folder-objective3
kw:otf-magic-folder-objective4
kw:otf-magic-folder-objective5
kw:otf-magic-folder-objective6
kw:p2p
kw:packaging
kw:partial
kw:password
kw:path
kw:paths
kw:pause
kw:peer-selection
kw:performance
kw:permalink
kw:permissions
kw:persistence
kw:phone
kw:pickle
kw:pip
kw:pipermail
kw:pkg_resources
kw:placement
kw:planning
kw:policy
kw:port
kw:portability
kw:portal
kw:posthook
kw:pratchett
kw:preformance
kw:preservation
kw:privacy
kw:process
kw:profile
kw:profiling
kw:progress
kw:proxy
kw:publish
kw:pyOpenSSL
kw:pyasn1
kw:pycparser
kw:pycrypto
kw:pycrypto-lib
kw:pycryptopp
kw:pyfilesystem
kw:pyflakes
kw:pylint
kw:pypi
kw:pypy
kw:pysqlite
kw:python
kw:python3
kw:pythonpath
kw:pyutil
kw:pywin32
kw:quickstart
kw:quiet
kw:quotas
kw:quoting
kw:raic
kw:rainhill
kw:random
kw:random-access
kw:range
kw:raspberry-pi
kw:reactor
kw:readonly
kw:rebalancing
kw:recovery
kw:recursive
kw:redhat
kw:redirect
kw:redressing
kw:refactor
kw:referer
kw:referrer
kw:regression
kw:rekey
kw:relay
kw:release
kw:release-blocker
kw:reliability
kw:relnotes
kw:remote
kw:removable
kw:removable-disk
kw:rename
kw:renew
kw:repair
kw:replace
kw:report
kw:repository
kw:research
kw:reserved_space
kw:response-needed
kw:response-time
kw:restore
kw:retrieve
kw:retry
kw:review
kw:review-needed
kw:reviewed
kw:revocation
kw:roadmap
kw:rollback
kw:rpm
kw:rsa
kw:rss
kw:rst
kw:rsync
kw:rusty
kw:s3
kw:s3-backend
kw:s3-frontend
kw:s4
kw:same-origin
kw:sandbox
kw:scalability
kw:scaling
kw:scheduling
kw:schema
kw:scheme
kw:scp
kw:scripts
kw:sdist
kw:sdmf
kw:security
kw:self-contained
kw:server
kw:servermap
kw:servers-of-happiness
kw:service
kw:setup
kw:setup.py
kw:setup_requires
kw:setuptools
kw:setuptools_darcs
kw:sftp
kw:shared
kw:shareset
kw:shell
kw:signals
kw:simultaneous
kw:six
kw:size
kw:slackware
kw:slashes
kw:smb
kw:sneakernet
kw:snowleopard
kw:socket
kw:solaris
kw:space
kw:space-efficiency
kw:spam
kw:spec
kw:speed
kw:sqlite
kw:ssh
kw:ssh-keygen
kw:sshfs
kw:ssl
kw:stability
kw:standards
kw:start
kw:startup
kw:static
kw:static-analysis
kw:statistics
kw:stats
kw:stats_gatherer
kw:status
kw:stdeb
kw:storage
kw:streaming
kw:strports
kw:style
kw:stylesheet
kw:subprocess
kw:sumo
kw:survey
kw:svg
kw:symlink
kw:synchronous
kw:tac
kw:tahoe-*
kw:tahoe-add-alias
kw:tahoe-admin
kw:tahoe-archive
kw:tahoe-backup
kw:tahoe-check
kw:tahoe-cp
kw:tahoe-create-alias
kw:tahoe-create-introducer
kw:tahoe-debug
kw:tahoe-deep-check
kw:tahoe-deepcheck
kw:tahoe-lafs-trac-stream
kw:tahoe-list-aliases
kw:tahoe-ls
kw:tahoe-magic-folder
kw:tahoe-manifest
kw:tahoe-mkdir
kw:tahoe-mount
kw:tahoe-mv
kw:tahoe-put
kw:tahoe-restart
kw:tahoe-rm
kw:tahoe-run
kw:tahoe-start
kw:tahoe-stats
kw:tahoe-unlink
kw:tahoe-webopen
kw:tahoe.css
kw:tahoe_files
kw:tahoewapi
kw:tarball
kw:tarballs
kw:tempfile
kw:templates
kw:terminology
kw:test
kw:test-and-set
kw:test-from-egg
kw:test-needed
kw:testgrid
kw:testing
kw:tests
kw:throttling
kw:ticket999-s3-backend
kw:tiddly
kw:time
kw:timeout
kw:timing
kw:to
kw:to-be-closed-on-2011-08-01
kw:tor
kw:tor-protocol
kw:torsocks
kw:tox
kw:trac
kw:transparency
kw:travis
kw:travis-ci
kw:trial
kw:trickle
kw:trivial
kw:truckee
kw:tub
kw:tub.location
kw:twine
kw:twistd
kw:twistd.log
kw:twisted
kw:twisted-14
kw:twisted-trial
kw:twitter
kw:twn
kw:txaws
kw:type
kw:typeerror
kw:ubuntu
kw:ucwe
kw:ueb
kw:ui
kw:unclean
kw:uncoordinated-writes
kw:undeletable
kw:unfinished-business
kw:unhandled-error
kw:unhappy
kw:unicode
kw:unit
kw:unix
kw:unlink
kw:update
kw:upgrade
kw:upload
kw:upload-helper
kw:uri
kw:url
kw:usability
kw:use-case
kw:utf-8
kw:util
kw:uwsgi
kw:ux
kw:validation
kw:variables
kw:vdrive
kw:verify
kw:verlib
kw:version
kw:versioning
kw:versions
kw:video
kw:virtualbox
kw:virtualenv
kw:vista
kw:visualization
kw:visualizer
kw:vm
kw:volunteergrid2
kw:volunteers
kw:vpn
kw:wapi
kw:warners-opinion-needed
kw:warning
kw:weapi
kw:web
kw:web.port
kw:webapi
kw:webdav
kw:webdrive
kw:webport
kw:websec
kw:website
kw:websocket
kw:welcome
kw:welcome-page
kw:welcomepage
kw:wiki
kw:win32
kw:win64
kw:windows
kw:windows-related
kw:winscp
kw:workaround
kw:world-domination
kw:wrapper
kw:write-enabler
kw:wui
kw:x86
kw:x86-64
kw:xhtml
kw:xml
kw:xss
kw:zbase32
kw:zetuptoolz
kw:zfec
kw:zookos-opinion-needed
kw:zope
kw:zope.interface
p/blocker
p/critical
p/major
p/minor
p/normal
p/supercritical
p/trivial
r/cannot reproduce
r/duplicate
r/fixed
r/invalid
r/somebody else's problem
r/was already fixed
r/wontfix
r/worksforme
t/defect
t/enhancement
t/task
v/0.2.0
v/0.3.0
v/0.4.0
v/0.5.0
v/0.5.1
v/0.6.0
v/0.6.1
v/0.7.0
v/0.8.0
v/0.9.0
v/1.0.0
v/1.1.0
v/1.10.0
v/1.10.1
v/1.10.2
v/1.10a2
v/1.11.0
v/1.12.0
v/1.12.1
v/1.13.0
v/1.14.0
v/1.15.0
v/1.15.1
v/1.2.0
v/1.3.0
v/1.4.1
v/1.5.0
v/1.6.0
v/1.6.1
v/1.7.0
v/1.7.1
v/1.7β
v/1.8.0
v/1.8.1
v/1.8.2
v/1.8.3
v/1.8β
v/1.9.0
v/1.9.0-s3branch
v/1.9.0a1
v/1.9.0a2
v/1.9.0b1
v/1.9.1
v/1.9.2
v/1.9.2a1
v/cloud-branch
v/unknown
No milestone
No project
No assignees
4 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: tahoe-lafs/trac#320
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
In 0.8.0, the upload interfaces visible to HTTP all require the file to be
completely present on the tahoe node before any upload work can be
accomplished. For a FUSE plugin (talking to a local tahoe node) that provides
an open/write/close POSIX-like API to some application, this means that the
write() calls all finish quickly, while the close() call takes a long time.
Many applications cannot handle this. These apps enforce timeouts on the
close() call on the order of 30-60 seconds. If these apps can handle network
filesystems at all, my hunch is that they will be more tolerant of delays in
the write() calls than in the close().
This effectively imposes a maximum file size on uploads, determined by the
link speed times the close() timeout. Using the helper can improve this by a
factor of 'N/k' relative to non-assisted uploads. The current FUSE plugin has
a number of unpleasant workarounds that involve lying to the close() call
(pretending that the file has been uploaded when in fact it has not), which
have a bunch of knock-on effects (like how to handle the subsequent open+read
of the file that we've supposedly just written).
To accomodate this better, we need to move the slow part of upload from
close() into write(). That means that whatever slow DSL link we're traversing
(either ciphertext to the helper or shares to the grid) needs to get data
during write().
This requires a number of items:
an HTTP interface that will accept partial data.
has been fully received, so to continue using twisted.web we must
either hack it or add something application-visible (like "upload
handles" which accept multiple PUTs or POSTs and then a final "close"
action).
all the Twisted folks I've spoken to say we shouldn't use it yet, and
3) it doesn't work with Nevow. To use it, we would probably need to
include a copy of twisted.web2 with Tahoe, which either means renaming
it to something that doesn't conflict with the twisted package, or
including a copy of twisted as well.
some way to use randomly-generated encryption keys instead of CHK-based
ones. At the very least we must make sure that we can start sending data
over the slow link before we've read the entire file. The FUSE interface
(with open/write/close) doesn't give the FUSE plugin knowledge of the
full file before the close() call. Our current helper remote interface
requires knowledge of the storage index (and thus the key) before the
helper is contacted. This introduces a tension between de-duplication and
streaming upload.
I've got more notes on this stuff.. will add them later.
Since it looks like twisted.web2 won't be ready for production use for a
while (if ever), and the hacks we'd have to make to twisted.web1 would be
effectively the same as rewriting twisted.web2, we decided to go with the
application-visible approach. This means upload handles.
Rob was more comfortable with server-generated handles than with
client-generated ones, so the web-API I'm planning to build will use a series
of POSTs like so:
close, then compute the CHK encryption key. This defeats streaming.
encryption key. This enables streaming.
equivalent binary form as the encryption key. This enables streaming.
the equivalent binary form as the encryption key. This enables
streaming. This is the same form as the output of 'tahoe dump-cap'.
entirely of URL-safe ASCII characters. All further calls will use it.
written in order. No
seek
calls are supported at this time. TheContent-Type of the POST can be anything except one of the usual HTML
form encoding types (multipart/form-data or
application/x-www-form-urlencoded), to prevent the twisted.web request
handler from attempting to parse the chunk.
consumed in the client. If the upload is occurring in a streaming
fashion, this will attempt to push the chunk over the slow link before
returning, to accomplish the goal of moving the upload time from the
close() call to the write() calls.
empty.
This API is all the application needs to know about, but to make streaming
work, we need a bit more under the hood. The largest current challenge is
that immutable lease requests must be accompanied by an accurate size value,
so we can't start encoding until we know the size of the file. That means we
can only get streaming with a helper. We need a new helper protocol that will
start with a storage index and then push ciphertext to the helper (instead of
having the helper pull ciphertext), then tell the helper that we're done. At
that point, the helper knows the size of the file, so it can encode and push.
So I'm going to build these two protocols: the POST /upload one and the
push-to-helper one, since that will enable streaming in our current
most-important use case. Later, we can investigate a different storage-server
protocol that will let us declare a maximum size, then push data until we're
done, then reset the size to the correct value. With that one in place, we
will be able to stream without a helper. Note, however, that CHK (computed by
the tahoe node) always disables streaming.
Ugh -- I was excited about making Tahoe do streaming using the simple old RESTful API. I'm not very excited about changing the wapi to facilitate streaming. If we're going the direction of extending the wapi to enable more sophisticated file semantics, then we should probably head in the direction of making it be a subset of WebDAV.
http://webdav.org/
Basically, there is value in better streaming performance with current simple wapi, and there is value in a more complex API that allows things like seek() and versioning (i.e. WebDAV), but extending the wapi to do this chunked streaming is a "sour spot" in the trade-off which uglifies the wapi and enables only a little bit of added functionality.
Another reason that I'm unhappy about this decision is that code to handle the current wapi in streaming fashion already exists and works:
http://twistedmatrix.com/trac/browser/branches/web2-new-stream-1937-2
Brian wrote "twisted.web2 won't be ready for production use for a while", but I'm skeptical about what this "ready for production use" actually means concretely -- I think it has more to do with the Twisted project not having working release automation and volunteers to do release management than with there actually being bugs that would prevent that code from sufficing for this ticket.
:-(
After much discussion and prioritizing, we've decided to back down from this
goal, and put this project on hold for a month or more.
The problem that we hoped to solve with this feature was that native apps
that use Tahoe through a FUSE plugin could behave badly if the close() call
took a long time to fix. A secondary goal was to make the OS's built-in
progress bar (for drag-and-drop copies) more accurate. There are three basic
approaches we can take:
we're using a helper, close() still takes about 3MBps, so it isn't
instantaneous, and if some windows app has a 30-second timeout on
close(), this still limits us to 90MB files. Also this kind of streaming
means that we must give up convergence. Progress bars are fairly
accurate. Close means close.
Progress bar is wrong. Close means close.
write cache. write() is fast, close() is fast, apps are happy, progress
bar is wrong, close means "we'll work on it".
We decided that approach 3 was the way to go. We plan to implement sync() in
the FUSE layer to block until the write cache is empty (at least on systems
where it exists.. we aren't yet sure if the SMB protocol that windows-FUSE
uses provides such a call). Backup apps are likely to use something like
sync() to be sure the data is really flushed out, and therefore they ought to
be safe (although they might enforce some other sort of timeout on sync(),
who knows).
We'll use a separate progress indication mechanism (a toolbar icon?) to let
the user know that the write cache is non-empty, and that therefore they
should not shut down their computer quite yet. The FUSE plugin should be able
to display status information about its cache and an ETA of how long it will
take to finish pushing.
This also ties in to the dirnode batching. If we're batching directory
additions to make them go faster, we're doing write caching anyways, and have
already committed to making the close() call lie about its completion status.
We may consider exposing tahoe's current-operation progress information in a
machine-readable format to the FUSE plugin, so it can include that status in
its own. To make this accurate, we need to add some sort of "task-id" (a
unique number) to each webapi request. These task-ids can then be put in the
JSON status output web page, so the FUSE plugin can correlate the tasks.
re: twisted.web2 not being ready for a while:
When we asked the twisted.web IRC folks last week, we identified the
following problems:
fixing nevow, despite the offer of money
releasing it, despite the offer of money. web2 is in a strange place,
where its existence is inhibiting work on web1, and the existence of web1
is inhibiting work on web2.
mind) well-designed, the consensus among the twisted folks was that it
wasn't worth using, and that the code from that web2-new-stream branch
might be better. The fact that there exist two functional streaming
mechanisms and that the twisted community hasn't settled upon either of
them makes me even less confident that web2 will be released any time
soon. (it feels like they're arguing about things that don't need
fixing). I may be completely wrong about this one, though.
Using an unreleased copy of twisted.web2 is difficult, because python's
import mechanism makes it hard to have your twisted.internet come from one
place and your twisted.web2 come from somewhere else. (setuptools "namespace
packages" are one attempt to solve this, as is the divmod "combinator", and
both appear to be pretty ugly hacks).
So I think the easiest approach would be to make a private copy of web2 in
the allmydata tree, perhaps under allmydata.tw_web2 . To do this, we'd have
to touch most of the 103 .py files and change their import statements to pull
from allmydata.tw_web2.FOO instead of twisted.web2.FOO . This would make it
difficult to apply later upstream patches, although we might get lucky and
'darcs replace' could do much of the work for us. However, I don't trust
'darcs replace' to do this correctly in the long term: I think each upstream
update would need to be applied by hand and the results carefully inspected.
We'd have to play darcs games (i.e. maintain a separate web2-tracking repo
and merge its contents into the tahoe one with some directory-renaming
patches) to enable ongoing updates. And we'd have to add 876kB of an external
library to the Tahoe source tree, which is already much larger than I'd
prefer.
The best outcome would be if the twisted folks made up their mind about web2,
made a release, and then made a release of Twisted that included it. Then we
could simply declare a dependency upon Twisted-2.6.0 or Twisted-8.0 or
whatever they're going to call it this week and we'd be done. But that's
certainly not going to happen before we ship 1.0 in a week, and I don't
believe it is going to happen within the next three months either.
So, I'm glad that we were able to decide to punt on the streaming features,
because I didn't see a happy way to implement them in a single PUT or POST,
and I too did not like the multiple-POST app-visible approach described
above.
Brian, your summary is good. One thing you overlooked is the option of shipping our own entire twisted including twisted.web2, thus avoiding renaming issues.
Also, please be more specific about what you fear might go wrong with using
darcs replace
. On IRC you said that a potential problem is that the token might not match other uses, for example the token "twisted.web2" wouldn't match "from twisted import web2". This is a valid concern, but I want to be clear that there is nothing buggy or vague or complicated about darcs's replace-token functionality -- you just have to spell out all tokens that you want replaced. There are no funny merge edge cases or anything with token-replace patches.Thanks! Yes, we could ship all of twisted with tahoe, at a cost of 853 files,
89 directories, and 7.8MB of python code (roughly 8 times larger than Tahoe
itself: 97 files, 7 directories, and 1.2MB in src/allmydata/). In addition,
we would be making it more difficult for users (and developers!) to use any
other version of twisted along with Tahoe.
We are effectively doing this for/to our Mac and Windows users, by virtue of
using py2app/py2exe, for the goal of making a single-file install. For that
purpose, I think it's a win, and I wouldn't mind having a custom version of
twisted in those application bundles. But for developers I think it would be
a loss.
re: 'darcs replace'. My first concern is the set of filenames on which the
operations are performed. I believe that darcs requires you to enumerate the
filenames when you perform the replace command, and later patches could add
files that contain tokens that you want to replace. The 'darcs replace' that
renames twisted.web2 with allmydata.tw_web2 in foo.py, performed in January
when we first started the process, will not catch the tokens in the new
bar.py that got added in a later version of web2 released in June.
My second (weaker) concern is the variety of forms that the import statement
might take:
This mainly depends upon the regexp that darcs uses to define a 'token',
versus non-token boundaries. I think that if you just do 'darcs replace
twisted.web2 allmydata.tw_web2' then it declares '.' to be a yes-token
character, which means it can't be a token-boundary, which means that it
won't be replaced in 'twisted.web2.dav'. But there may be a way to explicitly
tell darcs what you want to use as an is-a-token regexp.
Once 2.5 and relative imports are more common, there could be other forms,
although again it is unlikely that we'd see 'from ..web2.dav import
noneprops', since that would be a dumb equivalent of 'from dav import
noneprops'.
I don't believe web2 does dynamically-computed import statements, but I think
nevow does (using twisted.python.reflect.namedAny, for example). These would
also be likely missed by 'darcs replace'.
I haven't used 'darcs replace' enough to be comfortable with it, but I agree
that there is nothing buggy or magical about it.
this isn't going to happen for 1.1.0
We've discussed some of the storage-server protocol changes that would support this, in http://allmydata.org/pipermail/tahoe-dev/2008-May/000630.html
Also #392 (pipeline upload segments) is related.
I mentioned this ticket as one of the most important-to-me improvements that we could make in the Tahoe code: http://allmydata.org/pipermail/tahoe-dev/2008-September/000809.html
Argh! The lack of this feature just caused me to lose data!
My drive is nearly full on my Macbook Pro. I tried to backup a file to Tahoe so that I could delete that file to make room. While it was uploading, I started editing a very difficult, delicate, emotional letter to the OSI license-discuss mailing list about the Transitive Grace Period Public Licence.
Tahoe tried to make a temporary copy of the large file in order to hash it before uploading it, thus running my system out of disk space and causing the editor that I was using to crash and lose some of the letter I was composing. How frustrating!
The biggest reason why Tahoe doesn't already do streaming uploads was that we liked the "hash it before uploading" it as a way to achieve convergence so that successive uploads of the same file by the same person would not waste upload bandwidth and storage space. Now that we have backupdb, that same goal can be handled much more efficiently (most of the time) by backupdb. Hopefully now we can move to proper streaming upload.
#684 is about the part of this in which the client can specify what encryption key to use. There is a patch submitted by Shawn Willden.
add streaming upload to HTTP interfaceto add streaming (on-line) upload to HTTP interfaceIf you love this ticket, you might also like #809 (Measure how segment size affects upload/download speed.) and #398 (allow users to disable use of helper: direct uploads might be faster).
#684 (specifying the encryption key) is wontfixed, but I don't think it would be necessary for this ticket if random keys were used.
Are uploads using a helper streaming?
Replying to jsgf:
Currently the Tahoe-LAFS gateway (storage client) receives the entire file plaintext, writes it out in a temp file on disk (while computing the secure hash of it), then generates an encryption key (using that secure hash), then reads it back from the temp file on disk, encrypting as it goes. This is all the same whether you're usikng an immutable upload helper or not. The difference is without the immutable upload helper you also do erasure coding during this second pass while you are doing encryption. With the immutable upload helper you just do the encryption, streaming the ciphertext to the immutable upload helper who does the erasure coding.
#294 (make the option of random-key encryption available through the wui and cli) was about a related issue. In order to do streaming upload the Tahoe-LAFS gateway will of course have to do random-key encryption. However, I don't think users actually need to have a switch to control random-key encryption as such, so I've closed #294 and marked it as a duplicate of this ticket.
I intend to have a go at this for Tahoe-LAFS v1.8. The part that I'm likely to have the most trouble with is getting access to the first part of the file which has been uploaded from e.g. the web browser to the twisted.web web server before the entire file has been uploaded. There is a longstanding, stale twisted ticket which is in the context of the now abandoned twisted.web2 project:
http://twistedmatrix.com/trac/ticket/1937 # in twisted.web2, change "stream" to use newfangled not yet defined stream api
There may be some other way to get access to the data incrementally before the entire file has been completely uploaded. Help?
Other tickets that we would hopefully also be able to close as part of this work:
In the case of the SFTP frontend, there is no problem with getting at the upload stream, unlike HTTP. So we could implement streaming upload immediately for SFTP at least in some cases (see #1041 for details), if the uploader itself supported it.
Perhaps we should leave this ticket for the issue of getting at the upload stream of an HTTP request in twisted.web (which is what most of the above comments are about), and open a ticket for streaming support in the new uploader. It looks like the current
IUploadable
interface isn't really suited to streaming (for example it has aget_size
method, and it pulls the data when a "push" approach would be more appropriate), so there is some design work to do on that new ticket that is independent of HTTP.Although I would dearly love to get this ticket fixed, I think we have enough other important issues in front of us for v1.8.0, so I'm moving this into the "soon" Milestone. If you think you can fix this in the next couple of weeks, move it back into the "1.8" Milestone, but then you either have to move an equivalent mass of tickets out of "1.8" or you have to commit to spending an extra strong dose of volunteer energy to get this fixed. ;-)
Replying to davidsarah:
That ticket is #1288.
The correct ticket in the Twisted issue tracker is: http://twistedmatrix.com/trac/ticket/288 (no way to access the data of an upload which is in-progress), not http://twistedmatrix.com/trac/ticket/1937 (in twisted.web2, change "stream" to use newfangled not yet defined stream api).
There is a preliminary patch by exarkun attached to Twisted ticket 288.