From d191d4b47cb27434b91387d8cd23ca2fbaa9a9c5 Mon Sep 17 00:00:00 2001 From: warner <> Date: Wed, 19 Dec 2007 00:46:10 +0000 Subject: [PATCH] add notes on load testing [Imported from Trac: page Performance, version 21] --- Performance.md | 48 ++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 48 insertions(+) diff --git a/Performance.md b/Performance.md index 21bf651..a221da2 100644 --- a/Performance.md +++ b/Performance.md @@ -103,6 +103,8 @@ upload speed is the constant per-file overhead, and the FEC expansion factor. ## Storage Servers +### storage index count + ext3 (on tahoebs1) refuses to create more than 32000 subdirectories in a single parent directory. In 0.5.1, this appears as a limit on the number of buckets (one per storage index) that any [StorageServer](StorageServer) can hold. A simple @@ -124,3 +126,49 @@ server design (perhaps with a database to locate shares). I was unable to measure a consistent slowdown resulting from having 30000 buckets in a single storage server. + +## System Load + +The source:src/allmydata/test/check_load.py tool can be used to generate +random upload/download traffic, to see how much load a Tahoe grid imposes on +its hosts. + +Preliminary results on the Allmydata test grid (14 storage servers spread +across four machines (each a 3ishGHz P4), two web servers): we used three +check_load.py clients running with 100ms delay between requests, an +80%-download/20%-upload traffic mix, and file sizes distributed exponentially +with a mean of 10kB. These three clients get about 8-15kBps downloaded, +2.5kBps uploaded, doing about one download per second and 0.25 uploads per +second. These traffic rates were higher at the beginning of the process (when +the directories were smaller and thus faster to traverse). + +The storage servers were minimally loaded. Each storage node was consuming +about 9% of its CPU at the start of the test, 5% at the end. These nodes were +receiving about 50kbps throughout, and sending 50kbps initially (increasing +to 150kbps as the dirnodes got larger). Memory usage was trivial, about 35MB +[VmSize](VmSize) per node, 25MB RSS. The load average on a 4-node box was about 0.3 . + +The two machines serving as web servers (performing all encryption, hashing, +and erasure-coding) were the most heavily loaded. The clients distribute +their requests randomly between the two web servers. Each server was +averaging 60%-80% CPU usage. Memory consumption is minor, 37MB [VmSize](VmSize) and +29MB RSS on one server, 45MB/33MB on the other. Load average grew from about +0.6 at the start of the test to about 0.8 at the end. Network traffic +(including both client-side plaintext and server-side shares) outbound was +about 600Kbps for the whole test, while the inbound traffic started at +200Kbps and rose to about 1Mbps at the end. + +### initial conclusions + +So far, Tahoe is scaling as designed: the client nodes are the ones doing +most of the work, since these are the easiest to scale. In a deployment where +central machines are doing encoding work, CPU on these machines will be the +first bottleneck. Profiling can be used to determine how the upload process +might be optimized: we don't yet know if encryption, hashing, or encoding is +a primary CPU consumer. We can change the upload/download ratio to examine +upload and download separately. + +Deploying large networks in which clients are not doing their own encoding +will require sufficient CPU resources. Storage servers use minimal CPU, so +having all storage servers also be web/encoding servers is a natural +approach.