add notes on load testing

[Imported from Trac: page Performance, version 21]
warner 2007-12-19 00:46:10 +00:00
parent 9ce3cfc7a1
commit d191d4b47c

@ -103,6 +103,8 @@ upload speed is the constant per-file overhead, and the FEC expansion factor.
## Storage Servers ## Storage Servers
### storage index count
ext3 (on tahoebs1) refuses to create more than 32000 subdirectories in a ext3 (on tahoebs1) refuses to create more than 32000 subdirectories in a
single parent directory. In 0.5.1, this appears as a limit on the number of single parent directory. In 0.5.1, this appears as a limit on the number of
buckets (one per storage index) that any [StorageServer](StorageServer) can hold. A simple buckets (one per storage index) that any [StorageServer](StorageServer) can hold. A simple
@ -124,3 +126,49 @@ server design (perhaps with a database to locate shares).
I was unable to measure a consistent slowdown resulting from having 30000 I was unable to measure a consistent slowdown resulting from having 30000
buckets in a single storage server. buckets in a single storage server.
## System Load
The source:src/allmydata/test/check_load.py tool can be used to generate
random upload/download traffic, to see how much load a Tahoe grid imposes on
its hosts.
Preliminary results on the Allmydata test grid (14 storage servers spread
across four machines (each a 3ishGHz P4), two web servers): we used three
check_load.py clients running with 100ms delay between requests, an
80%-download/20%-upload traffic mix, and file sizes distributed exponentially
with a mean of 10kB. These three clients get about 8-15kBps downloaded,
2.5kBps uploaded, doing about one download per second and 0.25 uploads per
second. These traffic rates were higher at the beginning of the process (when
the directories were smaller and thus faster to traverse).
The storage servers were minimally loaded. Each storage node was consuming
about 9% of its CPU at the start of the test, 5% at the end. These nodes were
receiving about 50kbps throughout, and sending 50kbps initially (increasing
to 150kbps as the dirnodes got larger). Memory usage was trivial, about 35MB
[VmSize](VmSize) per node, 25MB RSS. The load average on a 4-node box was about 0.3 .
The two machines serving as web servers (performing all encryption, hashing,
and erasure-coding) were the most heavily loaded. The clients distribute
their requests randomly between the two web servers. Each server was
averaging 60%-80% CPU usage. Memory consumption is minor, 37MB [VmSize](VmSize) and
29MB RSS on one server, 45MB/33MB on the other. Load average grew from about
0.6 at the start of the test to about 0.8 at the end. Network traffic
(including both client-side plaintext and server-side shares) outbound was
about 600Kbps for the whole test, while the inbound traffic started at
200Kbps and rose to about 1Mbps at the end.
### initial conclusions
So far, Tahoe is scaling as designed: the client nodes are the ones doing
most of the work, since these are the easiest to scale. In a deployment where
central machines are doing encoding work, CPU on these machines will be the
first bottleneck. Profiling can be used to determine how the upload process
might be optimized: we don't yet know if encryption, hashing, or encoding is
a primary CPU consumer. We can change the upload/download ratio to examine
upload and download separately.
Deploying large networks in which clients are not doing their own encoding
will require sufficient CPU resources. Storage servers use minimal CPU, so
having all storage servers also be web/encoding servers is a natural
approach.