Adjust the probability of selecting a node according to its storage capacity (or other fitness measure) #872
Labels
No labels
c/code
c/code-dirnodes
c/code-encoding
c/code-frontend
c/code-frontend-cli
c/code-frontend-ftp-sftp
c/code-frontend-magic-folder
c/code-frontend-web
c/code-mutable
c/code-network
c/code-nodeadmin
c/code-peerselection
c/code-storage
c/contrib
c/dev-infrastructure
c/docs
c/operational
c/packaging
c/unknown
c/website
kw:2pc
kw:410
kw:9p
kw:ActivePerl
kw:AttributeError
kw:DataUnavailable
kw:DeadReferenceError
kw:DoS
kw:FileZilla
kw:GetLastError
kw:IFinishableConsumer
kw:K
kw:LeastAuthority
kw:Makefile
kw:RIStorageServer
kw:StringIO
kw:UncoordinatedWriteError
kw:about
kw:access
kw:access-control
kw:accessibility
kw:accounting
kw:accounting-crawler
kw:add-only
kw:aes
kw:aesthetics
kw:alias
kw:aliases
kw:aliens
kw:allmydata
kw:amazon
kw:ambient
kw:annotations
kw:anonymity
kw:anonymous
kw:anti-censorship
kw:api_auth_token
kw:appearance
kw:appname
kw:apport
kw:archive
kw:archlinux
kw:argparse
kw:arm
kw:assertion
kw:attachment
kw:auth
kw:authentication
kw:automation
kw:avahi
kw:availability
kw:aws
kw:azure
kw:backend
kw:backoff
kw:backup
kw:backupdb
kw:backward-compatibility
kw:bandwidth
kw:basedir
kw:bayes
kw:bbfreeze
kw:beta
kw:binaries
kw:binutils
kw:bitcoin
kw:bitrot
kw:blacklist
kw:blocker
kw:blocks-cloud-deployment
kw:blocks-cloud-merge
kw:blocks-magic-folder-merge
kw:blocks-merge
kw:blocks-raic
kw:blocks-release
kw:blog
kw:bom
kw:bonjour
kw:branch
kw:branding
kw:breadcrumbs
kw:brians-opinion-needed
kw:browser
kw:bsd
kw:build
kw:build-helpers
kw:buildbot
kw:builders
kw:buildslave
kw:buildslaves
kw:cache
kw:cap
kw:capleak
kw:captcha
kw:cast
kw:centos
kw:cffi
kw:chacha
kw:charset
kw:check
kw:checker
kw:chroot
kw:ci
kw:clean
kw:cleanup
kw:cli
kw:cloud
kw:cloud-backend
kw:cmdline
kw:code
kw:code-checks
kw:coding-standards
kw:coding-tools
kw:coding_tools
kw:collection
kw:compatibility
kw:completion
kw:compression
kw:confidentiality
kw:config
kw:configuration
kw:configuration.txt
kw:conflict
kw:connection
kw:connectivity
kw:consistency
kw:content
kw:control
kw:control.furl
kw:convergence
kw:coordination
kw:copyright
kw:corruption
kw:cors
kw:cost
kw:coverage
kw:coveralls
kw:coveralls.io
kw:cpu-watcher
kw:cpyext
kw:crash
kw:crawler
kw:crawlers
kw:create-container
kw:cruft
kw:crypto
kw:cryptography
kw:cryptography-lib
kw:cryptopp
kw:csp
kw:curl
kw:cutoff-date
kw:cycle
kw:cygwin
kw:d3
kw:daemon
kw:darcs
kw:darcsver
kw:database
kw:dataloss
kw:db
kw:dead-code
kw:deb
kw:debian
kw:debug
kw:deep-check
kw:defaults
kw:deferred
kw:delete
kw:deletion
kw:denial-of-service
kw:dependency
kw:deployment
kw:deprecation
kw:desert-island
kw:desert-island-build
kw:design
kw:design-review-needed
kw:detection
kw:dev-infrastructure
kw:devpay
kw:directory
kw:directory-page
kw:dirnode
kw:dirnodes
kw:disconnect
kw:discovery
kw:disk
kw:disk-backend
kw:distribute
kw:distutils
kw:dns
kw:do_http
kw:doc-needed
kw:docker
kw:docs
kw:docs-needed
kw:dokan
kw:dos
kw:download
kw:downloader
kw:dragonfly
kw:drop-upload
kw:duplicity
kw:dusty
kw:earth-dragon
kw:easy
kw:ec2
kw:ecdsa
kw:ed25519
kw:egg-needed
kw:eggs
kw:eliot
kw:email
kw:empty
kw:encoding
kw:endpoint
kw:enterprise
kw:enum34
kw:environment
kw:erasure
kw:erasure-coding
kw:error
kw:escaping
kw:etag
kw:etch
kw:evangelism
kw:eventual
kw:example
kw:excess-authority
kw:exec
kw:exocet
kw:expiration
kw:extensibility
kw:extension
kw:failure
kw:fedora
kw:ffp
kw:fhs
kw:figleaf
kw:file
kw:file-descriptor
kw:filename
kw:filesystem
kw:fileutil
kw:fips
kw:firewall
kw:first
kw:floatingpoint
kw:flog
kw:foolscap
kw:forward-compatibility
kw:forward-secrecy
kw:forwarding
kw:free
kw:freebsd
kw:frontend
kw:fsevents
kw:ftp
kw:ftpd
kw:full
kw:furl
kw:fuse
kw:garbage
kw:garbage-collection
kw:gateway
kw:gatherer
kw:gc
kw:gcc
kw:gentoo
kw:get
kw:git
kw:git-annex
kw:github
kw:glacier
kw:globalcaps
kw:glossary
kw:google-cloud-storage
kw:google-drive-backend
kw:gossip
kw:governance
kw:grid
kw:grid-manager
kw:gridid
kw:gridsync
kw:grsec
kw:gsoc
kw:gvfs
kw:hackfest
kw:hacktahoe
kw:hang
kw:hardlink
kw:heartbleed
kw:heisenbug
kw:help
kw:helper
kw:hint
kw:hooks
kw:how
kw:how-to
kw:howto
kw:hp
kw:hp-cloud
kw:html
kw:http
kw:https
kw:i18n
kw:i2p
kw:i2p-collab
kw:illustration
kw:image
kw:immutable
kw:impressions
kw:incentives
kw:incident
kw:init
kw:inlineCallbacks
kw:inotify
kw:install
kw:installer
kw:integration
kw:integration-test
kw:integrity
kw:interactive
kw:interface
kw:interfaces
kw:interoperability
kw:interstellar-exploration
kw:introducer
kw:introduction
kw:iphone
kw:ipkg
kw:iputil
kw:ipv6
kw:irc
kw:jail
kw:javascript
kw:joke
kw:jquery
kw:json
kw:jsui
kw:junk
kw:key-value-store
kw:kfreebsd
kw:known-issue
kw:konqueror
kw:kpreid
kw:kvm
kw:l10n
kw:lae
kw:large
kw:latency
kw:leak
kw:leasedb
kw:leases
kw:libgmp
kw:license
kw:licenss
kw:linecount
kw:link
kw:linux
kw:lit
kw:localhost
kw:location
kw:locking
kw:logging
kw:logo
kw:loopback
kw:lucid
kw:mac
kw:macintosh
kw:magic-folder
kw:manhole
kw:manifest
kw:manual-test-needed
kw:map
kw:mapupdate
kw:max_space
kw:mdmf
kw:memcheck
kw:memory
kw:memory-leak
kw:mesh
kw:metadata
kw:meter
kw:migration
kw:mime
kw:mingw
kw:minimal
kw:misc
kw:miscapture
kw:mlp
kw:mock
kw:more-info-needed
kw:mountain-lion
kw:move
kw:multi-users
kw:multiple
kw:multiuser-gateway
kw:munin
kw:music
kw:mutability
kw:mutable
kw:mystery
kw:names
kw:naming
kw:nas
kw:navigation
kw:needs-review
kw:needs-spawn
kw:netbsd
kw:network
kw:nevow
kw:new-user
kw:newcaps
kw:news
kw:news-done
kw:news-needed
kw:newsletter
kw:newurls
kw:nfc
kw:nginx
kw:nixos
kw:no-clobber
kw:node
kw:node-url
kw:notification
kw:notifyOnDisconnect
kw:nsa310
kw:nsa320
kw:nsa325
kw:numpy
kw:objects
kw:old
kw:openbsd
kw:openitp-packaging
kw:openssl
kw:openstack
kw:opensuse
kw:operation-helpers
kw:operational
kw:operations
kw:ophandle
kw:ophandles
kw:ops
kw:optimization
kw:optional
kw:options
kw:organization
kw:os
kw:os.abort
kw:ostrom
kw:osx
kw:osxfuse
kw:otf-magic-folder-objective1
kw:otf-magic-folder-objective2
kw:otf-magic-folder-objective3
kw:otf-magic-folder-objective4
kw:otf-magic-folder-objective5
kw:otf-magic-folder-objective6
kw:p2p
kw:packaging
kw:partial
kw:password
kw:path
kw:paths
kw:pause
kw:peer-selection
kw:performance
kw:permalink
kw:permissions
kw:persistence
kw:phone
kw:pickle
kw:pip
kw:pipermail
kw:pkg_resources
kw:placement
kw:planning
kw:policy
kw:port
kw:portability
kw:portal
kw:posthook
kw:pratchett
kw:preformance
kw:preservation
kw:privacy
kw:process
kw:profile
kw:profiling
kw:progress
kw:proxy
kw:publish
kw:pyOpenSSL
kw:pyasn1
kw:pycparser
kw:pycrypto
kw:pycrypto-lib
kw:pycryptopp
kw:pyfilesystem
kw:pyflakes
kw:pylint
kw:pypi
kw:pypy
kw:pysqlite
kw:python
kw:python3
kw:pythonpath
kw:pyutil
kw:pywin32
kw:quickstart
kw:quiet
kw:quotas
kw:quoting
kw:raic
kw:rainhill
kw:random
kw:random-access
kw:range
kw:raspberry-pi
kw:reactor
kw:readonly
kw:rebalancing
kw:recovery
kw:recursive
kw:redhat
kw:redirect
kw:redressing
kw:refactor
kw:referer
kw:referrer
kw:regression
kw:rekey
kw:relay
kw:release
kw:release-blocker
kw:reliability
kw:relnotes
kw:remote
kw:removable
kw:removable-disk
kw:rename
kw:renew
kw:repair
kw:replace
kw:report
kw:repository
kw:research
kw:reserved_space
kw:response-needed
kw:response-time
kw:restore
kw:retrieve
kw:retry
kw:review
kw:review-needed
kw:reviewed
kw:revocation
kw:roadmap
kw:rollback
kw:rpm
kw:rsa
kw:rss
kw:rst
kw:rsync
kw:rusty
kw:s3
kw:s3-backend
kw:s3-frontend
kw:s4
kw:same-origin
kw:sandbox
kw:scalability
kw:scaling
kw:scheduling
kw:schema
kw:scheme
kw:scp
kw:scripts
kw:sdist
kw:sdmf
kw:security
kw:self-contained
kw:server
kw:servermap
kw:servers-of-happiness
kw:service
kw:setup
kw:setup.py
kw:setup_requires
kw:setuptools
kw:setuptools_darcs
kw:sftp
kw:shared
kw:shareset
kw:shell
kw:signals
kw:simultaneous
kw:six
kw:size
kw:slackware
kw:slashes
kw:smb
kw:sneakernet
kw:snowleopard
kw:socket
kw:solaris
kw:space
kw:space-efficiency
kw:spam
kw:spec
kw:speed
kw:sqlite
kw:ssh
kw:ssh-keygen
kw:sshfs
kw:ssl
kw:stability
kw:standards
kw:start
kw:startup
kw:static
kw:static-analysis
kw:statistics
kw:stats
kw:stats_gatherer
kw:status
kw:stdeb
kw:storage
kw:streaming
kw:strports
kw:style
kw:stylesheet
kw:subprocess
kw:sumo
kw:survey
kw:svg
kw:symlink
kw:synchronous
kw:tac
kw:tahoe-*
kw:tahoe-add-alias
kw:tahoe-admin
kw:tahoe-archive
kw:tahoe-backup
kw:tahoe-check
kw:tahoe-cp
kw:tahoe-create-alias
kw:tahoe-create-introducer
kw:tahoe-debug
kw:tahoe-deep-check
kw:tahoe-deepcheck
kw:tahoe-lafs-trac-stream
kw:tahoe-list-aliases
kw:tahoe-ls
kw:tahoe-magic-folder
kw:tahoe-manifest
kw:tahoe-mkdir
kw:tahoe-mount
kw:tahoe-mv
kw:tahoe-put
kw:tahoe-restart
kw:tahoe-rm
kw:tahoe-run
kw:tahoe-start
kw:tahoe-stats
kw:tahoe-unlink
kw:tahoe-webopen
kw:tahoe.css
kw:tahoe_files
kw:tahoewapi
kw:tarball
kw:tarballs
kw:tempfile
kw:templates
kw:terminology
kw:test
kw:test-and-set
kw:test-from-egg
kw:test-needed
kw:testgrid
kw:testing
kw:tests
kw:throttling
kw:ticket999-s3-backend
kw:tiddly
kw:time
kw:timeout
kw:timing
kw:to
kw:to-be-closed-on-2011-08-01
kw:tor
kw:tor-protocol
kw:torsocks
kw:tox
kw:trac
kw:transparency
kw:travis
kw:travis-ci
kw:trial
kw:trickle
kw:trivial
kw:truckee
kw:tub
kw:tub.location
kw:twine
kw:twistd
kw:twistd.log
kw:twisted
kw:twisted-14
kw:twisted-trial
kw:twitter
kw:twn
kw:txaws
kw:type
kw:typeerror
kw:ubuntu
kw:ucwe
kw:ueb
kw:ui
kw:unclean
kw:uncoordinated-writes
kw:undeletable
kw:unfinished-business
kw:unhandled-error
kw:unhappy
kw:unicode
kw:unit
kw:unix
kw:unlink
kw:update
kw:upgrade
kw:upload
kw:upload-helper
kw:uri
kw:url
kw:usability
kw:use-case
kw:utf-8
kw:util
kw:uwsgi
kw:ux
kw:validation
kw:variables
kw:vdrive
kw:verify
kw:verlib
kw:version
kw:versioning
kw:versions
kw:video
kw:virtualbox
kw:virtualenv
kw:vista
kw:visualization
kw:visualizer
kw:vm
kw:volunteergrid2
kw:volunteers
kw:vpn
kw:wapi
kw:warners-opinion-needed
kw:warning
kw:weapi
kw:web
kw:web.port
kw:webapi
kw:webdav
kw:webdrive
kw:webport
kw:websec
kw:website
kw:websocket
kw:welcome
kw:welcome-page
kw:welcomepage
kw:wiki
kw:win32
kw:win64
kw:windows
kw:windows-related
kw:winscp
kw:workaround
kw:world-domination
kw:wrapper
kw:write-enabler
kw:wui
kw:x86
kw:x86-64
kw:xhtml
kw:xml
kw:xss
kw:zbase32
kw:zetuptoolz
kw:zfec
kw:zookos-opinion-needed
kw:zope
kw:zope.interface
p/blocker
p/critical
p/major
p/minor
p/normal
p/supercritical
p/trivial
r/cannot reproduce
r/duplicate
r/fixed
r/invalid
r/somebody else's problem
r/was already fixed
r/wontfix
r/worksforme
t/defect
t/enhancement
t/task
v/0.2.0
v/0.3.0
v/0.4.0
v/0.5.0
v/0.5.1
v/0.6.0
v/0.6.1
v/0.7.0
v/0.8.0
v/0.9.0
v/1.0.0
v/1.1.0
v/1.10.0
v/1.10.1
v/1.10.2
v/1.10a2
v/1.11.0
v/1.12.0
v/1.12.1
v/1.13.0
v/1.14.0
v/1.15.0
v/1.15.1
v/1.2.0
v/1.3.0
v/1.4.1
v/1.5.0
v/1.6.0
v/1.6.1
v/1.7.0
v/1.7.1
v/1.7β
v/1.8.0
v/1.8.1
v/1.8.2
v/1.8.3
v/1.8β
v/1.9.0
v/1.9.0-s3branch
v/1.9.0a1
v/1.9.0a2
v/1.9.0b1
v/1.9.1
v/1.9.2
v/1.9.2a1
v/cloud-branch
v/unknown
No milestone
No project
No assignees
3 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: tahoe-lafs/trac#872
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
If the probability of the peer selection algorithm putting a node close to the beginning of the list were proportional to its storage capacity, then that would better tolerate grids with a wide range of node capacities.
With a uniform selection probability, as at present, small-capacity nodes will be expected to receive many requests to store shares that they don't have room for, and to download shares that they don't have.
See http://allmydata.org/pipermail/tahoe-dev/2009-December/003408.html and followups for mailing list discussion.
Also see /tahoe-lafs/trac/issues/26246#comment:-1.
bwahaha, welcome to a big can of worms :)
source:docs/specifications/outline.txt (section 3: "Server Selection
Algorithm, filecap format") is worth reading. It points out the requirement
that all of the uploader's choices are somehow recorded and made available to
the downloader. Or, rather, the downloader's sequence of servers needs to be
"well" correlated with the uploader's sequence.
So any upload-time code which is influenced by things like current remaining
server space will need a way to record its choices (or the information which
went into that choice) in the filecap, so it can influence the downloader in
the same way.
That said, choosing servers according to capacity would serve the purpose of
filling servers at the same time as opposed to filling them at the same rate.
(i.e., all servers become full at the same moment, versus each server sees
the same inbound bytes-per-second rate). If all servers had the same
capacity, these two options would be identical.
Part of the discussion in #302 is about whether this is good, important, or
irrelevant. In general, I think that full-at-the-same-time is good, but I'm
not sure it's actually better than fill-at-the-same-rate. I believe that
maximum reliablity occurs when each file has as many choices for servers as
possible, but those options will dwindle over time as servers get full. A
system which probabilistically favors some servers over others (based upon
capacity or whatever) will have fewer choices to work with.
Hm, I think there's a rigorous argument in there somewhere. The entropy of
the server-selection process (given a random storage-index) should be fairly
well-defined. A non-probabilistic algorithm will just give you log,,2,, of
the number of possible choices. A probabilistic algorithm would be like that,
but with each choice weighted by the probability of its selection. (I used to
know this stuff, really I did.. I'll look up my old Information Theory
textbook when I get home).
With that sort of definition, we could evaluate different algorithms
according to how well they maximize that entropy. Moreover, the entropy is a
function of the current state of the grid (like how many free servers are
left), and that state will evolve in different ways according to the
algorithm we choose. So we can further evaluate that entropy over time. Any
non-homogeneous grid will see the entropy drop over time, as the grid fills
up and the choices dwindle. We could develop a metric to talk about the
entropy averaged across all files: maybe the best algorithm is the one that
manages the highest average entropy, or perhaps the lowest variance, or
something.
A probabilistic selection algorithm will always have lower per-file entropy
than a non-probabilistic one, given the same number of potential servers
(well, a non-uniform-probability algorithm, to be precise). But if it manages
to preserve server availability longer, then the entropy averaged over the
life of the grid (from empty to full) might be higher. That's probably the
way we should investigate the value of a different algorithm.
Replying to warner:
I agree with the first sentence, but not the second. The "expected full at the same time" property will tend to maximize the number of storage nodes available to accept shares for as long as possible; that's why I believe it is better for file preservation.
There's a fairly straightforward way to change the selection algorithm to achieve this property. Suppose that capacity estimates are multiples of some unit C. If a storage node has capacity estimate eC*, we give it e entries in the list to be permuted (there's a way to make this more efficient; see below). That is, permute the list and then discard duplicates later than the first occurrence of a given node. The effect is similar to splitting each storage server into e virtual nodes that share the same disk space, but with the important difference that the upload algorithm will still try not to put multiple shares on a given server.
This means that the capacity estimate of a given storage node can't change, and must be known by the Introducer so that it can tell all other nodes. (The actual capacity can be different to the estimate; that won't cause any greater problems than at present.)
The performance of this algorithm as given above is poor when the sum of all e is large, but it can be improved by selecting the servers using a binary search tree rather than an explicit list. That is, each step of a Fisher-Yates shuffle would choose a random element from the search tree weighted by its capacity estimate, then delete that element from the tree. This is equivalent to using an arbitrarily small C.
I'm not sure that selection entropy is the main issue. The two most important things we want are:
Both of these are affected primarily by the proportion of servers that are available, not by their probability of selection.
Replying to [davidsarah]comment:3:
I think there are two levels of reliability in action here. The first and
most important is to avoid doubling-up of shares (the "servers of happiness"
threshold, but really it's strictly >=N servers). Certainly your reliability
drops significantly when you go below this number of available servers.
The second order effect is the decorrelation of per-file server sets, which
is the entropy thing I'm talking about. It only makes sense to talk about
this one after you've ensured that you have the first level for everything.
Imagine that you had 20 servers, the usual 3-of-10 encoding, and the
selection rule was that on even days you used servers 1-10, and on odd days
you used servers 11-20. Each file would have the first kind of reliability
(every file would use N distinct servers). But the second kind of reliability
would be marginal: an attacker who destroys the right 50% of the servers
would completely kill half the files (in fact they could fatally wound half
the files with just 40% of the servers).
In contrast, if each file gets a random selection of all twenty servers, then
there's minimal correlation between the servers used by any two files. An
attacker who destroys servers 1-10 would expect to kill just 2126/184756 =
1.15% of the files.
So I think the first goal is to keep >=N servers free for as long as possible
(ideally until the very last file fills the grid), but if we can achieve
that, then our second goal should be to maximize the number of ways in which
files are uploaded.
Yeah, I like the simplicity of that. But we need a stable way to inform the
downloader about the capacities we saw, so they can get to the same list.
Maybe a layer of indirection could help: the serverlist is stored in stable,
well-known places that do not depend upon server capacity (and the serverlist
isn't big enough to fill those places much), but the shares can go elsewhere
(to places chosen for the fill-at-the-same-time goal).
I'll agree with all of that. Certainly selection entropy is less important
than the servers-of-happiness (really >=N) criterion. I don't know how it
should compare against download performance.. probably below. I guess I'd put
selection entropy as the third item in your list.
I hope to explain my entropy concept better.
Here's another example of why I think the probabilistic approach needs to be
evaluated against the entropy concept. Imagine that you've got 3-of-10
encoding and 15 servers: 10 big ones and 5 tiny ones. The probabilistic
algorithm will almost always pick the 10 big ones and ignore the 5 tiny ones.
So even though we've nominally got 15 free servers, we rarely actually use
them all. So almost every file we upload will share a server-set
(big1-big10), making them more vulnerable (as a group). The entropy of the
selection algorithm will be nearly zero, since the tiny servers are chosen
with such low probability. The entropy will remain mostly constant over time,
though, since you'll probably fill the tiny servers at the same time as the
big servers, so your choices will remain about the same for the whole time.
Of course, if you send any more traffic towards those tiny servers (such as
if you went for same-rate instead of same-time), they'll fill up sooner than
the big ones, and they'll quickly be full. At that point, the entropy drops
to zero, because you have exactly one option.
Since the servers in this example are not of uniform size, this loss of
entropy is inevitable. There's a finite number of possibilities, and each
byte you upload consumes some of them. A completely homogeneous grid with
uniformly-sized uploads will run out of selection entropy all at the same
time, just as the last file causes the grid to be full. The
entropy-versus-time graph (from t=0 to t=grid-is-full) is flat. For
heterogeneous grids and a non-probabilstic algorithm the graph looks like a
step-wise decrementing function, starting high, dropping a bit after each
server fills up, but flat inside each region (the last plateau is at 0, when
there are exactly N servers left, then there's a region when we're doubling
up shares that we'd represent with a red line or negative numbers or
something else). I think a capacity-sensitive algorithm's graph would look
completely flat: since all servers should fill at the same time, there should
be no steps, but the overall entropy will be lower than if you chose freely
between the initially-available servers.
A flat graph would mean that late-uploaded files are just as good as
early-uploaded files. A decreasing curve means that early files have it
better than late files (or, to be more precise, that a batch of files
uploaded early will have less mutual-correlation than a similar batch
uploaded late: killing a random set of servers would be expected to kill more
of the late files than the early ones).
I suspect that the area under this curve is constant, independent of the
selection algorithm, and that the area is a function of the set of server
capacities. It would be maximal for homogeneous servers.
I'm merely thinking that it might be possible to measure the shape of this
curve for different selection algorithms, and that it's something to keep in
mind when picking one. If my suspicion about those shapes is correct, then
the probabilistic approach seems the "fairest", in that it would give equal
distributions to both early and late files.
I still don't know how to record enough information about the server choices
into the filecap, though. Capacities will change over time, and to make this
work right for the uploader, they'll need to keep changing their
probabilities in response to new "how much space do you have left" updates
from the servers.
Replying to [warner]comment:4:
Agreed, but I think that "expected full at the same time" is likely to help with this decorrelation as well. The reason is that if some servers are full -- even if >= N servers are not full -- then the choice of servers has been reduced.
For example, suppose you have N small-capacity servers and N large-capacity servers. If you choose servers uniformly, then all of the small-capacity servers will fill up first, and then the choice of servers for shares of subsequent files will be reduced to N. So by attacking only the N large-capacity servers, the attacker is disproportionately likely to kill the more recently added files. (Reading the later part of your comment, it seems we're in violent agreement on this.)
Note that this is a very likely situation in practice given the tendency to add servers in batches with increasing capacities (as in the case of allmydata); and in that case even rebalancing all shares would not help. With the non-uniform choice, OTOH, then rebalancing would restore the random distribution of all shares (regardless of when their file was originally uploaded) across servers.
The attacker still gets a greater advantage from killing a server with a higher capacity, but only to the extent that we would expect because it holds more shares. When the servers are mostly full, we cannot avoid that property.
The extent of the bias is also limited by the attempt to place only one share on each server when uploading. Suppose you have one server that has 100 times the capacity of all the others. The algorithm I suggested will almost always place one share of each file on that server -- but only one. This seems like reasonable behaviour (even for this unreasonably extreme configuration): it uses the capacity of the supersized server as well as possible without relying on it excessively.
If capacity estimates are fixed, then informing the downloader about them is no more difficult than informing the downloader about public keys. If the actual capacity of a server increases relative to its estimate, then the effect of that is never any worse than the uniform-probability selection. So I think there's a good case for just following the "worse is better" approach of assuming fixed capacity estimates (which makes the selection algorithm stable -- or as stable given server changes as it is now).
Agreed.
The same is true of the uniform algorithm as soon as the tiny servers fill up, which will be soon. The main difference seems to be that the non-uniform algorithm spreads out the non-uniformity in server selection over time.
[...]
You mean it will start at a lower value, I assume? That also matches my intuition (although we should do some simulations), and I think that the flat graph is preferable, because we don't want to favour earlier files at the expense of later ones.
Now there's a bold prediction (that we should be able to test by simulation).
As explained above, I don't think this is necessary.
Replying to [davidsarah]comment:3:
I didn't explain this well. The idea is that for each node of the search tree, you keep track of the total weights of its left and right subtrees. That allows you to pick a random node in the tree with probability proportional to its weight, by making depth binary choices where depth ~= log2 n for n servers. Deleting a node also takes depth time, because you have to update the total weights on the ancestors of the deleted node. The overall time is therefore O(n log n).
(The is more like the original Fisher-Yates shuffle than the Durstenfeld variant.)
Part of the goal of #543 ('rebalancing manager') is "to smooth out disk usage among all servers (more by percentage than by absolute usage)." This ticket might help by giving such a rebalancer less to do.
Ah, there is another constraint on the shuffle algorithm: it must be approximately stable when servers are added or removed. The existing algorithm (essentially, hash each peerid and sort by the hashes) is stable because adding or removing a server just adds or removes its hash, and the other hashes are sorted in the same order. The first algorithm described in comment:3 is also stable in this sense, because it can be defined in a similar way by hashing the peerid and a small integer. (It's easy to make this compatible with the existing scheme when all capacity estimates are equal.)
However the Fisher-Yates-based algorithm described in comment:6 is not stable in the required sense, and I don't see how to make it so (a pity, because I'd just finished implementing it :-/ ).
I'm not convinced that this stability or "consistent hashing" property is a hard requirement. All of the Tahoe-LAFS grids that have been deployed so far (with one exception) have so few storage servers that most reads query every server. The one exception is the allmydata.com production grid, which has about a hundred servers. It might work just fine to query all one hundred of them on every read, too.
Whether the consistent hashing property is important to real deployments is an empirical measurement question, IMO, and my guess is that for all of the current small grids the answer is "no measurable impact" and for allmydata.com the answer is "measurable impact, but not a critical problem".
Even if this stability property is not critical, it seems that losing it would be an unnecessary regression that might prevent us from scaling up to larger grids.
The original algorithm in comment:3 keeps this property while still meeting the goals of this ticket. I don't think the fact that it is less efficient when (sum of all e) is large would be a serious obstacle. Besides, I have an idea about how to do better, but I'll have to think about it some more.
Replying to davidsarah:
The comment:3 algorithm is equivalent to picking the minimum hash value out of e independent hash values for each server. We can get the same result by taking a single hash, and transforming it so that it follows the same distribution as the minimum of e hashes would have done.
Let Xe be the distribution given by the minimum of e independent uniform distributions U1..e, each on [0, 1). The cumulative distribution function of Xe is given by:
Then we can use inverse transform sampling to generate samples from Xe. For that we need the inverse of F_Xe which is
So if we let y be the peer id hash for a given server scaled to the range [0, 1), and e be its capacity estimate, then sorting according to 1 - (1-y)^(1/e)^ will give the same distribution of permutations that we would have got by the comment:3 algorithm.
Plotting (F_Xe)^-1^ for various e gives results^(1/1)&y1=1-(1-x)^(1/2)&y2=1-(1-x)^(1/3)&y3=1-(1-x)^(1/4)&y4=1-(1-x)^(1/5)&xmin=-0.35&xmax=1.35&ymin=0&ymax=1 that are intuitively reasonable in order for this to work: increasing e biases the transformed hash toward lower values that are more likely to be near the start of the list (but for any e, there is still some chance that the server will not be picked).
Also notice that:
To be be more concrete, at source:src/allmydata/storage_client.py#L121 :
would change to something like
using this utility function to convert a binary string to an integer (which inexplicably doesn't seem to be in the stdlib):
Incidentally, I know that Python floating point arithmetic might not give exactly the same results between machines. That shouldn't matter because it can only have the effect of swapping two servers next to each other in the permuted order, which we should be tolerant of.
neat!
The stability of this all still depends upon the stability of the capacity
estimates, right? I gather you've been assuming that any given server would
publish its total capacity in a fixed record, along with its nodeid. I've
been assuming that each server would publish it's current-remaining-capacity
periodically, in a record that changes over time, like the one that contains
the location hints and version info.
I like the adaptiveness of schemes that keep publishing updates as remaining
space dwindles. There will be a lot of random noise in our traffic rates, and
if these rates are adjusted over time to match the remaining space, then
we'll get a nice feedback loop to compensate for accidental fluctuations.
Also, server operators are likely to add or remove space at various times,
and it would be nice to be able to adapt to that.
But I don't know how to build a system with all of these nice properties at
once: smooth filling of servers by percentage instead of rate, stable
ordering of servers between upload time and download time, short filecaps,
minimal auxilliary information (like an explicit serverlist stored in some
intermediate location).
Good argument. (Zooko and I have discussed download-time query flooding
before, and we usually tend to land on the same sides each time). I don't yet
know how to scale tahoe up to millions of nodes, but I think it will be
important to have a well-defined and stable place to find your shares (even
if you have to do O(log(N)) queries to find them). Giving up on that now, by
requiring an O(N) search, feels like it will make that sort of scaling much
much harder.
Maybe we should discuss the properties of a protocol with an intermediate
step. I wrote up some if this in #599. The idea would be that upload-time
could place shares anywhere it likes (to achieve good load-balancing, or
geographical diversity, or ideal download bandwidth, whatever), but it would
then write down a list of which servers got used, and store that list in a
specific (well-known, stable) set of places.
Download reliability would depend upon having both one copy of the sharelist
available and >=k shares. But the list should be small enough to let us
afford 1-of-N encoding and have lots of copies, so the sharelist's impact on
reliability should be dwarfed by the share's impact (if you can get 10 or 20
dBA better, it'll be lost in the noise).
However, repairers and rebalancers need to participate in the protocol: they
have to find and update the sharelist. And we have to quantify how much of a
problem it would be for the sharelist to be wrong, because some of the
sharelist-holding servers might be unavailable when you go to move or create
some shares. It's effectively adding a bit of mutable state to your
normally-immutable shares, with all the CAP Theorem consequences that
entails.
#599 suggests putting list of "where are the other shares" hints on each
share, which would turn your download algorithm into "search normally for the
first share, then use the hints to accelerate the search for the others".
This would get rid of the potential reliablity penalty (since you get
fate-sharing between sharelists and shares), but couldn't accomodate
completely arbitrary share placement: you'd have to find at least one share
before you found all the other ones. So it might help improve performnce on
large grids (where, due to regular churn, you might normally have to query
hundreds or thousands of servers to find enough shares), but still wouldn't
really permit the use of fancy load-balancing share-placement algorithms like
what we're discussing here.
Replying to warner:
Thanks.
Increasing the capacity estimate of a node can only move it nearer the start of the list for any given file (storage id). Similarly, decreasing the capacity estimate of a node can only move it further from the start of the list for any given file. I think this is the strongest stability property that could be expected.
(When I said "the peer id hash for a given server" in comment:11, I actually meant the hash of the peer id and the storage id. The properties above hold because the sample biasing is done after computing the hash, and doesn't affect its input.)
Using the remaining capacity rather than the total capacity would make the peer selection less stable. It should still be quite stable while the grid is not close to being full (since servers will tend to fill at a rate proportional to their initial capacity), but when any given server is nearly full, its remaining space relative to other servers would no longer be a good approximation to its initial capacity relative to other servers.
Yes. I'm not sure yet whether this outweighs the stability issue.
That could work if file caps contain an epoch number, and if the history of all remaining capacities at each epoch can be obtained by all nodes. However,
The epoch number scheme has most of these properties, with the caveats above, but the auxiliary information is the full history of remaining capacities (or fitness values; see below) at each epoch.
The nodes don't need to maintain connections to all other nodes, so that's not the scaling constraint. The obvious scaling constraint is the size of the location info for other nodes. When you get to the point where that information takes an unreasonable amount of memory, you can split the network into subgrids, with each subgrid having a set of supernodes (with should be the most available nodes in each subgrid). Then each node only needs to know the locations of all the supernodes, and each supernode only needs to know the locations of other nodes within its subgrid. This creates a small-world network in which any node is at most two hops from any other. So, you can scale to roughly the square of the number of nodes that would otherwise be feasible: use the permuted list algorithm to pick the supernodes for a given file, and have them use the algorithm again to route to the actual storage servers.
But this isn't the right ticket to discuss scaling to many nodes; that would be #235.
I think that the epoch scheme is a refinement of that. Note that the bias doesn't have to be by capacity; it could use any fitness function. Using a single fitness value for each server wouldn't give you geographic diversity, but it would allow biasing by bandwidth etc.
Yes. The size of a history of fitness values need not be much greater than the location and public key hash info, as long as there are not too many epochs.
Count me as a skeptic of the relevance of the CAP theorem; it depends on a very strong notion of consistency. In any case, if the auxiliary information is the history of fitness values, then that history is only extended, not changed, so we don't really have mutable state.
I'm not sure this has a significant advantage over the epoch number approach. It has the same disadvantages wrt. convergent immutable files, although it wouldn't necessarily leak information about when a file was uploaded.
Adjust the probability of selecting a node according to its storage capacityto Adjust the probability of selecting a node according to its storage capacity (or other fitness measure)Fun analogy for anyone who knows image processing: if the cdf of the uniform distribution corresponds to a greyscale ramp, then applying the '1 - (1-x)^e^' bias is effectively applying a gamma function to lighten or darken it, increasing the chance of picking a lighter or darker shade.
On the p2p-hackers list, Tony Arcieri wrote: