Timeout of Servermap Update #1138

Open
opened 2010-07-28 06:29:57 +00:00 by eurekafag · 3 comments
eurekafag commented 2010-07-28 06:29:57 +00:00
Owner

I wrote to the mail list but no one answered. Hope this is the better place for such report.

I have one problem with updating the directories. Sometimes a server
is dropping out of network for various reasons but still remains
connected on welcome page. When I'm trying to access any directory my
node starts map updating and it may be very long operation during
which all work with directories hangs. It takes from 10 to 14 minutes
and I guess it's unacceptable for such system. Where can I set the
timeout for this? I set up:

timeout.keepalive = 60
timeout.disconnect = 300

But this doesn't help. One of servers lost internet connection
suddenly and that's what I got accessing one of directories (there
were 4 servers and only 3 requests succeded):

Started: 15:11:01 26-Jul-2010
Finished: 15:25:51 26-Jul-2010
Storage Index: o4wudooveaakzeqdhvspjsjgtm
Helper?: No
Progress: 100.0%
Status: Finished
Update Results

Timings:
Total: 14 minutes
Initial Queries: 4.5ms
Cumulative Verify: 1.0ms
Per-Server Response Times:
[nfyml6h3]: 3.4ms
[t726lxet]: 67ms
[35jflwuq]: 32ms

Almost 15 minutes! That becomes critical if you mount a directory via
sshfs+sftpd, then any process may stuck when lists this mounted dir
and the only way is to kill sshfs (you couldn't kill the process
itself, it's in D+ state) or wait (not knowing how long). Please,
point me to the right timeout option for such requests! 10 to 30
seconds would be very nice.

I wrote to the mail list but no one answered. Hope this is the better place for such report. I have one problem with updating the directories. Sometimes a server is dropping out of network for various reasons but still remains connected on welcome page. When I'm trying to access any directory my node starts map updating and it may be very long operation during which all work with directories hangs. It takes from 10 to 14 minutes and I guess it's unacceptable for such system. Where can I set the timeout for this? I set up: ``` timeout.keepalive = 60 timeout.disconnect = 300 ``` But this doesn't help. One of servers lost internet connection suddenly and that's what I got accessing one of directories (there were 4 servers and only 3 requests succeded): ``` Started: 15:11:01 26-Jul-2010 Finished: 15:25:51 26-Jul-2010 Storage Index: o4wudooveaakzeqdhvspjsjgtm Helper?: No Progress: 100.0% Status: Finished Update Results Timings: Total: 14 minutes Initial Queries: 4.5ms Cumulative Verify: 1.0ms Per-Server Response Times: [nfyml6h3]: 3.4ms [t726lxet]: 67ms [35jflwuq]: 32ms ``` Almost 15 minutes! That becomes critical if you mount a directory via sshfs+sftpd, then any process may stuck when lists this mounted dir and the only way is to kill sshfs (you couldn't kill the process itself, it's in D+ state) or wait (not knowing how long). Please, point me to the right timeout option for such requests! 10 to 30 seconds would be very nice.
tahoe-lafs added the
unknown
major
defect
1.7.1
labels 2010-07-28 06:29:57 +00:00
tahoe-lafs added this to the undecided milestone 2010-07-28 06:29:57 +00:00
zooko commented 2010-07-29 04:57:20 +00:00
Author
Owner

Thank you for the bug report. I think you are right that a ticket is a better way to get answers about this kind of thing than a mailing list message. (Although sending a mailing list message first is often a good way to start.)

Our long-term plan to fix this problem is to make it so that uploads don't wait for a long time to get a response from a specific server but instead fail-over to another server. That's #873 (upload: tolerate lost or unacceptably slow servers).

In the short-term, I'm not sure why setting the foolscap timeouts did not cause the upload to complete (whether successfully or failingly) more quickly. It is a mystery to me. Perhaps someone else could dig into the upload code and figure it out. One potentially productive way to do that would be to add more diagnostics to the status page, showing which requests to servers are currently outstanding and for how long they have been.

Thank you for the bug report. I think you are right that a ticket is a better way to get answers about this kind of thing than a mailing list message. (Although sending a mailing list message *first* is often a good way to start.) Our long-term plan to fix this problem is to make it so that uploads don't wait for a long time to get a response from a specific server but instead fail-over to another server. That's #873 (upload: tolerate lost or unacceptably slow servers). In the short-term, I'm not sure why setting the foolscap timeouts did not cause the upload to complete (whether successfully or failingly) more quickly. It is a mystery to me. Perhaps someone else could dig into the upload code and figure it out. One potentially productive way to do that would be to add more diagnostics to the status page, showing which requests to servers are currently outstanding and for how long they have been.
davidsarah commented 2010-07-30 06:19:02 +00:00
Author
Owner

Replying to zooko:

Our long-term plan to fix this problem is to make it so that uploads don't wait for a long time to get a response from a specific server but instead fail-over to another server. That's #873 (upload: tolerate lost or unacceptably slow servers).

From the description, the case at hand seems to be mutable download.

Will this be addressed by the New Downloader, or does that only handle immutable download?

Replying to [zooko](/tahoe-lafs/trac-2024-07-25/issues/1138#issuecomment-120571): > Our long-term plan to fix this problem is to make it so that uploads don't wait for a long time to get a response from a specific server but instead fail-over to another server. That's #873 (upload: tolerate lost or unacceptably slow servers). From the description, the case at hand seems to be mutable download. Will this be addressed by the New Downloader, or does that only handle immutable download?
tahoe-lafs added
code-network
and removed
unknown
labels 2010-07-30 06:19:02 +00:00
tahoe-lafs modified the milestone from undecided to soon 2010-07-30 06:19:02 +00:00
warner commented 2010-08-06 06:26:38 +00:00
Author
Owner

The #798 new-downloader is only for immutable files, sorry.

The timeout.disconnect timer, due to the low-overhead way it is implemented, may take up to twice the value to finally sever a connection. So a value of 300 could take up to 10 minutes to disconnect the server connection. But it shouldn't have let the connection stay up for 14 minutes. Two ideas come to mind: the timeout.disconnect clause might have been in the wrong section (it should be in the node section), or there might have been other traffic on that connection that kept it alive (but not the response to the mutable read query). Neither seems likely.. the only way I can imagine traffic keeping it alive is if the server were having weird out-of-memory or hardware errors and dropped one request while accepting others (we've seen things like this happen before, but it was on a server that had run out of memory).

It might help to collect some log information from your box after it does this. If you go to the front "Welcome" page, there's a button at the bottom that says "Report An Incident". Push that, and a few seconds later, a new "flogfile" will appear in your BASEDIR/logs/incidents/ directory. Upload and attach that file here: it will contain a record of the important events that occurred up to the moment you hit the button. We're looking for information about any messages sent to the lost server. If there's something weird like an out-of-memory condition, this might show up in the logs.

The #798 new-downloader is only for immutable files, sorry. The `timeout.disconnect` timer, due to the low-overhead way it is implemented, may take up to twice the value to finally sever a connection. So a value of 300 could take up to 10 minutes to disconnect the server connection. But it shouldn't have let the connection stay up for 14 minutes. Two ideas come to mind: the `timeout.disconnect` clause might have been in the wrong section (it should be in the `node` section), or there might have been other traffic on that connection that kept it alive (but not the response to the mutable read query). Neither seems likely.. the only way I can imagine traffic keeping it alive is if the server were having weird out-of-memory or hardware errors and dropped one request while accepting others (we've seen things like this happen before, but it was on a server that had run out of memory). It might help to collect some log information from your box after it does this. If you go to the front "Welcome" page, there's a button at the bottom that says "Report An Incident". Push that, and a few seconds later, a new "flogfile" will appear in your BASEDIR/logs/incidents/ directory. Upload and attach that file here: it will contain a record of the important events that occurred up to the moment you hit the button. We're looking for information about any messages sent to the lost server. If there's something weird like an out-of-memory condition, this might show up in the logs.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: tahoe-lafs/trac-2024-07-25#1138
No description provided.