occasional failure in iputil (timeout in test_runner): use 'netifaces' package? #532

Closed
opened 2008-11-03 22:00:13 +00:00 by warner · 8 comments
warner commented 2008-11-03 22:00:13 +00:00
Owner

I'm seeing very occasional failures in the allmydata.test.test_runner.RunNode.test_introducer test. To reproduce it, in one shell I run:

run_to_death.pl 'make quicktest TEST=allmydata.test.test_runner.RunNode.test_introducer'

(where run_to_death.pl is a little perl script I've got to just keep running the same command over and over again until the exit status is nonzero)

while in another shell I slow things down by doing python -c "while 1: pass".

This usually fails after about 10 minutes.

The actual failure is a timeout. It appears that the iputil.py routine that uses reactor.spawnProcess to run /sbin/ifconfig (to figure out which interfaces are available and therefore what local IP addresses we should advertise) just plain fails: the Deferred never fires. My hunch is that somehow the SIGCHLD handle is broken, so the child process has finished but the parent doesn't notice.

This doesn't happen frequently enough to really worry about, but some day it'd be nice to fix it.

One possibility is to switch to the 'python-netifaces' tool, which unfortunately has compiled C code, but which claims to be fairly cross-platform and probably doesn't require a separate command to be spawned.

I'm seeing very occasional failures in the `allmydata.test.test_runner.RunNode.test_introducer` test. To reproduce it, in one shell I run: ``` run_to_death.pl 'make quicktest TEST=allmydata.test.test_runner.RunNode.test_introducer' ``` (where run_to_death.pl is a little perl script I've got to just keep running the same command over and over again until the exit status is nonzero) while in another shell I slow things down by doing `python -c "while 1: pass"`. This usually fails after about 10 minutes. The actual failure is a timeout. It appears that the iputil.py routine that uses `reactor.spawnProcess` to run `/sbin/ifconfig` (to figure out which interfaces are available and therefore what local IP addresses we should advertise) just plain fails: the Deferred never fires. My hunch is that somehow the SIGCHLD handle is broken, so the child process has finished but the parent doesn't notice. This doesn't happen frequently enough to really worry about, but some day it'd be nice to fix it. One possibility is to switch to the 'python-netifaces' tool, which unfortunately has compiled C code, but which claims to be fairly cross-platform and probably doesn't require a separate command to be spawned.
tahoe-lafs added the
code
minor
defect
1.2.0
labels 2008-11-03 22:00:13 +00:00
tahoe-lafs added this to the undecided milestone 2008-11-03 22:00:13 +00:00
zooko commented 2008-11-03 22:18:29 +00:00
Author
Owner

So if your hunch is correct then this reveals the existence of a bug in Twisted?

So if your hunch is correct then this reveals the existence of a bug in Twisted?
warner commented 2008-11-03 23:30:54 +00:00
Author
Owner

seems plausible, yes. A smaller test case (which I don't quite have the time to build right now) would be to just run the /sbin/ifconfig command via reactor.spawnProcess, gathering but mostly ignoring the output, and then see if that can be made to fail.

seems plausible, yes. A smaller test case (which I don't quite have the time to build right now) would be to just run the /sbin/ifconfig command via reactor.spawnProcess, gathering but mostly ignoring the output, and then see if that can be made to fail.
zooko commented 2009-01-18 15:42:12 +00:00
Author
Owner

Should trial --until-failure allmydata.test.test_runner.RunNode.test_introducer also trigger the bug, then?

I'm running trial --until-failure pyutil.test.test_iputil.

Should `trial --until-failure allmydata.test.test_runner.RunNode.test_introducer` also trigger the bug, then? I'm running `trial --until-failure pyutil.test.test_iputil`.
zooko commented 2009-01-18 17:35:44 +00:00
Author
Owner

trial --until-failure pyutil.test.test_iputil wasn't able to reproduce this failure after about an hour of running. I'll try Brian's script next.

`trial --until-failure pyutil.test.test_iputil` wasn't able to reproduce this failure after about an hour of running. I'll try Brian's script next.
zooko commented 2009-01-18 17:37:55 +00:00
Author
Owner

Okay now I'm running this script:

time ( /bin/true; while [ $? = 0 ] ; do trial pyutil.test.test_iputil; done ) &> x.txt
Okay now I'm running this script: ``` time ( /bin/true; while [ $? = 0 ] ; do trial pyutil.test.test_iputil; done ) &> x.txt ```
zooko commented 2009-01-19 02:45:01 +00:00
Author
Owner

Okay, I let that script run all day and it didn't fail. Also the workstation (yukyuk) was loaded down with other jobs at the same time.

Okay, I let that script run all day and it didn't fail. Also the workstation (yukyuk) was loaded down with other jobs at the same time.
warner commented 2009-06-21 20:22:02 +00:00
Author
Owner

I ran this test in a loop on a loaded box for a while and it didn't fail either, so maybe it's been fixed in whatever new version of Twisted I'm using now. It sounds like we can let this one go. Closing as "works for me".

I ran this test in a loop on a loaded box for a while and it didn't fail either, so maybe it's been fixed in whatever new version of Twisted I'm using now. It sounds like we can let this one go. Closing as "works for me".
tahoe-lafs added the
worksforme
label 2009-06-21 20:22:02 +00:00
tahoe-lafs modified the milestone from undecided to 1.5.0 2009-06-21 20:22:02 +00:00
warner closed this issue 2009-06-21 20:22:02 +00:00
davidsarah commented 2012-12-16 15:22:50 +00:00
Author
Owner

Note that there definitely are nondeterministic bugs due to how we spawn the command for iputil; see #1381. I think that bug would not cause a timeout, though.

Note that there definitely are nondeterministic bugs due to how we spawn the command for iputil; see #1381. I think that bug would not cause a timeout, though.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: tahoe-lafs/trac-2024-07-25#532
No description provided.