tahoe-lafs/trac-2024-07-25

scale up to many nodes #235

New issue

Open

opened 2007-12-18 16:52:32 +00:00 by zooko · 3 comments

zooko commented

2007-12-18 16:52:32 +00:00

Owner

I updated [The UseCases Page](wiki/UseCases) to reflect that someone might want to run a managed Tahoe grid comprising one thousand nodes. (If each node has a single 1 TB hard drive, that's a 1 PB grid. Obviously there are lots of other options, such as each node having six 1 TB hard drives in a RAID-6 configuration, resulting in 4 usable TB per node or a 4 PB grid.)

Anyway, we expect that the current Tahoe grid would have problems handling more simultaneously connected nodes. One known problem is that pyOpenSSL uses almost 1 MB of RAM per SSL connection. (See also #11.)

This ticket can be closed when Tahoe is demonstrated to handle one thousand simultaneously connected nodes smoothly.

I updated [The [UseCases](wiki/UseCases) Page](wiki/UseCases) to reflect that someone might want to run a managed Tahoe grid comprising one thousand nodes. (If each node has a single 1 TB hard drive, that's a 1 PB grid. Obviously there are lots of other options, such as each node having six 1 TB hard drives in a RAID-6 configuration, resulting in 4 usable TB per node or a 4 PB grid.) Anyway, we expect that the current Tahoe grid would have problems handling more simultaneously connected nodes. One known problem is that pyOpenSSL uses almost 1 MB of RAM per SSL connection. (See also #11.) This ticket can be closed when Tahoe is demonstrated to handle one thousand simultaneously connected nodes smoothly.

tahoe-lafs added the

labels 2007-12-18 16:52:32 +00:00

tahoe-lafs added this to the eventually milestone 2007-12-18 16:52:32 +00:00

tahoe-lafs modified the milestone from eventually to undecided

2008-06-01 21:04:59 +00:00

davidsarah commented

2010-01-03 02:56:00 +00:00

Author

Owner

From ticket:872#comment:16 :

I don't yet know how to scale tahoe up to millions of nodes, but I think it will be important to have a well-defined and stable place to find your shares (even if you have to do O(log(N)) queries to find them).

The nodes don't need to maintain connections to all other nodes, so that's not the scaling constraint. The obvious scaling constraint is the size of the location info for other nodes. When you get to the point where that information takes an unreasonable amount of memory, you can split the network into subgrids, with each subgrid having a set of supernodes (with should be the most available nodes in each subgrid). Then each node only needs to know the locations of all the supernodes, and each supernode only needs to know the locations of other nodes within its subgrid. This creates a small-world network in which any node is at most two hops from any other. So, you can scale to roughly the square of the number of nodes that would otherwise be feasible: use the permuted list algorithm to pick the supernodes for a given file, and have them use the algorithm again to route to the actual storage servers.

Note that this shouldn't be needed in order to scale to 1000 nodes -- the size of the location and public key info for 1000 nodes should easily be small enough to fit into memory. Do we need another ticket for scaling to grids with hundreds of thousands of nodes, or am I being too prematurely ambitious? :-)

From ticket:872#comment:16 : >> I don't yet know how to scale tahoe up to millions of nodes, but I think it will be important to have a well-defined and stable place to find your shares (even if you have to do O(log(N)) queries to find them). > The nodes don't need to maintain connections to all other nodes, so that's not the scaling constraint. The obvious scaling constraint is the size of the location info for other nodes. When you get to the point where that information takes an unreasonable amount of memory, you can split the network into subgrids, with each subgrid having a set of supernodes (with should be the most available nodes in each subgrid). Then each node only needs to know the locations of all the supernodes, and each supernode only needs to know the locations of other nodes within its subgrid. This creates a small-world network in which any node is at most two hops from any other. So, you can scale to roughly the square of the number of nodes that would otherwise be feasible: use the permuted list algorithm to pick the supernodes for a given file, and have them use the algorithm again to route to the actual storage servers. Note that this shouldn't be needed in order to scale to 1000 nodes -- the size of the location and public key info for 1000 nodes should easily be small enough to fit into memory. Do we need another ticket for scaling to grids with hundreds of thousands of nodes, or am I being too prematurely ambitious? :-)

warner commented

2010-01-06 07:36:57 +00:00

Author

Owner

There are a number of hurdles to scale up to lots of nodes. This ticket is sort of a reminder to enumerate some of them, or record known limitations and potential solutions.

The limitation alluded to in the summary is probably the first hurdle: Foolscap, at least, has been observed (as of the creation date of this ticket, some 2 years ago) to consume an unreasonable approx. 1MB of RAM per open connection. I seem to remember doing some analysis and deciding that pyOpenSSL was to blame, rather than foolscap, but that was a long time ago and the tests should be run again before putting too much energy into it. There's no good reason for it to use this much memory.. the connection state and buffers should really fit into a couple of kilobytes.

The next hurdle will be the current practice of maintaining open connections to all known storage servers. If we left the protocols alone, we could change this to open connections on-demand, but that would incur a significant per-file latency (for both upload and download), and of course things like file-check and mutable-file publish would become really really slow because both want to query lots of servers. So changing the peer-selection protocols would probably be necessary to effectively remove this limitation.

There are a number of hurdles to scale up to lots of nodes. This ticket is sort of a reminder to enumerate some of them, or record known limitations and potential solutions. The limitation alluded to in the summary is probably the first hurdle: Foolscap, at least, has been observed (as of the creation date of this ticket, some 2 years ago) to consume an unreasonable approx. 1MB of RAM per open connection. I seem to remember doing some analysis and deciding that pyOpenSSL was to blame, rather than foolscap, but that was a long time ago and the tests should be run again before putting too much energy into it. There's no good reason for it to use this much memory.. the connection state and buffers should really fit into a couple of kilobytes. The next hurdle will be the current practice of maintaining open connections to all known storage servers. If we left the protocols alone, we could change this to open connections on-demand, but that would incur a significant per-file latency (for both upload and download), and of course things like file-check and mutable-file publish would become really really slow because both want to query lots of servers. So changing the peer-selection protocols would probably be necessary to effectively remove this limitation.

davidsarah commented

2010-01-16 00:10:07 +00:00

Author

Owner

If you like this ticket, you might like #444 (reduce number of active connections: connect on demand).