crash on cygwin: doWrite on a Port, failure during test_system #31

Closed
opened 2007-05-03 19:14:31 +00:00 by zooko · 10 comments
zooko commented 2007-05-03 19:14:31 +00:00
Owner

allmydata.test.test_system.SystemTest.test_upload_and_download will run for kilosecs, and sometimes seg fault. Also my client won't connect to the introducer.

So something is deeply wrong in cygwin land...

allmydata.test.test_system.SystemTest.test_upload_and_download will run for kilosecs, and sometimes seg fault. Also my client won't connect to the introducer. So something is deeply wrong in cygwin land...
tahoe-lafs added the
code
major
defect
labels 2007-05-03 19:14:31 +00:00
zooko commented 2007-05-11 18:38:02 +00:00
Author
Owner

Ah, I also have Windows native python 2.4.3 installed on this machine which cannot be uninstalled, nor can a new Windows native python package be installed. (In both cases I get installer error 2003.)

So this bug is probably specific to my machine...

Ah, I also have Windows native python 2.4.3 installed on this machine which cannot be uninstalled, nor can a new Windows native python package be installed. (In both cases I get installer error 2003.) So this bug is probably specific to my machine...
tahoe-lafs added
minor
and removed
major
labels 2007-05-11 18:38:02 +00:00
zooko commented 2007-05-11 18:39:40 +00:00
Author
Owner

Actually the error code is 2203, at least on uninstall.

Actually the error code is 2203, at least on uninstall.
zooko commented 2007-05-11 22:35:47 +00:00
Author
Owner

Fixed my install of Windows. (learned about cacls.exe.)

Fixed my install of Windows. (learned about cacls.exe.)
zooko commented 2007-05-18 16:48:44 +00:00
Author
Owner

So, I just need to boot up my vmware machine that has cygwin and try the unit tests on it now that I've fixed it and then I can close this ticket as invalid...

So, I just need to boot up my vmware machine that has cygwin and try the unit tests on it now that I've fixed it and then I can close this ticket as invalid...
zooko commented 2007-06-06 05:59:06 +00:00
Author
Owner

Hm, I finally rebooted (so that I could run vmware again) and tried again, but again there is badness. The first time I ran the test it passed in 10s. The second time I ran it it hung. :-(

Hm, I finally rebooted (so that I could run vmware again) and tried again, but again there is badness. The first time I ran the test it passed in 10s. The second time I ran it it hung. :-(
warner commented 2007-06-06 16:37:54 +00:00
Author
Owner

I've seen an intermittent failure on the cygwin buildslave. The investigation I've done so far suggests that one of the Services is being shut down (the webserver) too early. No idea why.. maybe a socket error as it tries to accept() a connection?

I've seen an intermittent failure on the cygwin buildslave. The investigation I've done so far suggests that one of the Services is being shut down (the webserver) too early. No idea why.. maybe a socket error as it tries to accept() a connection?
warner commented 2007-07-24 23:05:34 +00:00
Author
Owner

Just an update here: we've tracked this down to a bug in Python's poll() module, which appears to only be triggered under cygwin. Modules/selectmodule.c:poll_poll() has code to skip over file descriptors that have not been fired (those with a null .revents field), but if this code is used, the list that poll() returns will be filled with random garbage.

When Twisted sees the junk file descriptors, it usually ignores them, because it knows that a fileno of two kabillion is not valid. But every once in a while that random garbage looks like a real descriptor. In the case of the test_system failure, it looks like the fileno of one of the listening sockets, and the random junk in the revents field looks like select.POLLOUT, so the reactor tries to do a doWrite to the listening socket, which throws an exception because listening sockets are never writable.

I'm working on a patch now, trying to make it pass the unit tests.

Just an update here: we've tracked this down to a bug in Python's poll() module, which appears to only be triggered under cygwin. Modules/selectmodule.c:poll_poll() has code to skip over file descriptors that have not been fired (those with a null .revents field), but if this code is used, the list that poll() returns will be filled with random garbage. When Twisted sees the junk file descriptors, it usually ignores them, because it knows that a fileno of two kabillion is not valid. But every once in a while that random garbage looks like a real descriptor. In the case of the test_system failure, it looks like the fileno of one of the listening sockets, and the random junk in the revents field looks like select.POLLOUT, so the reactor tries to do a doWrite to the listening socket, which throws an exception because listening sockets are never writable. I'm working on a patch now, trying to make it pass the unit tests.
warner commented 2007-07-25 01:16:38 +00:00
Author
Owner

Attachment cygwin-poll.diff (1049 bytes) added

patch for python-2.5.1's Modules/selectmodule.c to work around cygwin bug

**Attachment** cygwin-poll.diff (1049 bytes) added patch for python-2.5.1's Modules/selectmodule.c to work around cygwin bug
warner commented 2007-07-25 01:25:50 +00:00
Author
Owner

More data: cygwin's poll() function sometimes violates POSIX and returns an overly-large fd count. The return value is supposed to be equal to the number pollfd structures that have non-zero .revent fields, but sometimes cygwin's count is too high.

This causes python's selectmodule.c/poll_poll() to overrun the pollfd array, and copy random data into the python list that it returns to the python-side caller of select.poller().poll(). When Twisted's pollreactor sees this, most of the bogus fds are invalid and ignored, but sometimes one of them looks like a real fd and causes the corresponding doWrite or doRead method to be invoked. Usually these are harmless too, but in our case the random fds finally overlapped with a real listening socket (probably because that fd's fileno was sitting in nearby memory), and we hit the exception.

attachment:/tahoe-lafs/trac-2024-07-25/issues/7536:cygwin-poll.diff is a patch for Python-2.5.1 to workaround the issue, by ignoring the return value from poll() and counting the active fds manually. I've applied this patch and rebuilt the select.dll module on our cygwin buildslave, and now the buildbot is green.

This problem has been reported to the python folks at http://sourceforge.net/tracker/index.php?func=detail&aid=1759997&group_id=5470&atid=105470 along with the patch. Since it's cygwin's buggy poll() that's the root cause, I'm not sure if they'll accept the patch or not.

Zooko said he was tempted to work on a cygwin patch, so I'm going to hold off notifying the cygwin mailing list until we have that patch in hand (or decide that we aren't going to bother).
Either way, I think we understand this issue well enough to close out this bug.

More data: cygwin's poll() function sometimes violates POSIX and returns an overly-large fd count. The return value is supposed to be equal to the number pollfd structures that have non-zero .revent fields, but sometimes cygwin's count is too high. This causes python's selectmodule.c/poll_poll() to overrun the pollfd array, and copy random data into the python list that it returns to the python-side caller of select.poller().poll(). When Twisted's pollreactor sees this, most of the bogus fds are invalid and ignored, but sometimes one of them looks like a real fd and causes the corresponding doWrite or doRead method to be invoked. Usually these are harmless too, but in our case the random fds finally overlapped with a real listening socket (probably because that fd's fileno was sitting in nearby memory), and we hit the exception. attachment:[/tahoe-lafs/trac-2024-07-25/issues/7536](/tahoe-lafs/trac-2024-07-25/issues/7536):cygwin-poll.diff is a patch for Python-2.5.1 to workaround the issue, by ignoring the return value from poll() and counting the active fds manually. I've applied this patch and rebuilt the select.dll module on our cygwin buildslave, and now the buildbot is green. This problem has been reported to the python folks at <http://sourceforge.net/tracker/index.php?func=detail&aid=1759997&group_id=5470&atid=105470> along with the patch. Since it's cygwin's buggy poll() that's the root cause, I'm not sure if they'll accept the patch or not. Zooko said he was tempted to work on a cygwin patch, so I'm going to hold off notifying the cygwin mailing list until we have that patch in hand (or decide that we aren't going to bother). Either way, I think we understand this issue well enough to close out this bug.
tahoe-lafs added the
fixed
label 2007-07-25 01:25:50 +00:00
warner closed this issue 2007-07-25 01:25:50 +00:00
tahoe-lafs changed title from crash on cygwin to crash on cygwin: doWrite on a Port, failure during test_system 2007-07-25 01:25:50 +00:00
zooko commented 2008-01-19 14:17:04 +00:00
Author
Owner

This bug was fixed in cygwin 1.5.25-7, released 2007-12-17.

This bug was fixed in cygwin 1.5.25-7, released 2007-12-17.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: tahoe-lafs/trac-2024-07-25#31
No description provided.