add notes on load testing

[Imported from Trac: page Performance, version 21]
2007-12-19 00:46:10 +00:00 · 2007-12-19 00:46:10 +00:00 · d191d4b47c
parent 9ce3cfc7a1
commit d191d4b47c
1 changed files with 48 additions and 0 deletions
--- a/Performance.md
+++ b/Performance.md
@ -103,6 +103,8 @@ upload speed is the constant per-file overhead, and the FEC expansion factor.
 ## Storage Servers
 ### storage index count
 ext3 (on tahoebs1) refuses to create more than 32000 subdirectories in a
 single parent directory. In 0.5.1, this appears as a limit on the number of
 buckets (one per storage index) that any [StorageServer](StorageServer) can hold. A simple
@ -124,3 +126,49 @@ server design (perhaps with a database to locate shares).
 I was unable to measure a consistent slowdown resulting from having 30000
 buckets in a single storage server.
 ## System Load
 The source:src/allmydata/test/check_load.py tool can be used to generate
 random upload/download traffic, to see how much load a Tahoe grid imposes on
 its hosts.
 Preliminary results on the Allmydata test grid (14 storage servers spread
 across four machines (each a 3ishGHz P4), two web servers): we used three
 check_load.py clients running with 100ms delay between requests, an
 80%-download/20%-upload traffic mix, and file sizes distributed exponentially
 with a mean of 10kB. These three clients get about 8-15kBps downloaded,
 2.5kBps uploaded, doing about one download per second and 0.25 uploads per
 second. These traffic rates were higher at the beginning of the process (when
 the directories were smaller and thus faster to traverse).
 The storage servers were minimally loaded. Each storage node was consuming
 about 9% of its CPU at the start of the test, 5% at the end. These nodes were
 receiving about 50kbps throughout, and sending 50kbps initially (increasing
 to 150kbps as the dirnodes got larger). Memory usage was trivial, about 35MB
 [VmSize](VmSize) per node, 25MB RSS. The load average on a 4-node box was about 0.3 .
 The two machines serving as web servers (performing all encryption, hashing,
 and erasure-coding) were the most heavily loaded. The clients distribute
 their requests randomly between the two web servers. Each server was
 averaging 60%-80% CPU usage. Memory consumption is minor, 37MB [VmSize](VmSize) and
 29MB RSS on one server, 45MB/33MB on the other. Load average grew from about
 0.6 at the start of the test to about 0.8 at the end. Network traffic
 (including both client-side plaintext and server-side shares) outbound was
 about 600Kbps for the whole test, while the inbound traffic started at
 200Kbps and rose to about 1Mbps at the end.
 ### initial conclusions
 So far, Tahoe is scaling as designed: the client nodes are the ones doing
 most of the work, since these are the easiest to scale. In a deployment where
 central machines are doing encoding work, CPU on these machines will be the
 first bottleneck. Profiling can be used to determine how the upload process
 might be optimized: we don't yet know if encryption, hashing, or encoding is
 a primary CPU consumer. We can change the upload/download ratio to examine
 upload and download separately.
 Deploying large networks in which clients are not doing their own encoding
 will require sufficient CPU resources. Storage servers use minimal CPU, so
 having all storage servers also be web/encoding servers is a natural
 approach.