crash on cygwin: doWrite on a Port, failure during test_system #31
Labels
No labels
0.2.0
0.3.0
0.4.0
0.5.0
0.5.1
0.6.0
0.6.1
0.7.0
0.8.0
0.9.0
1.0.0
1.1.0
1.10.0
1.10.1
1.10.2
1.10a2
1.11.0
1.12.0
1.12.1
1.13.0
1.14.0
1.15.0
1.15.1
1.2.0
1.3.0
1.4.1
1.5.0
1.6.0
1.6.1
1.7.0
1.7.1
1.7β
1.8.0
1.8.1
1.8.2
1.8.3
1.8β
1.9.0
1.9.0-s3branch
1.9.0a1
1.9.0a2
1.9.0b1
1.9.1
1.9.2
1.9.2a1
LeastAuthority.com automation
blocker
cannot reproduce
cloud-branch
code
code-dirnodes
code-encoding
code-frontend
code-frontend-cli
code-frontend-ftp-sftp
code-frontend-magic-folder
code-frontend-web
code-mutable
code-network
code-nodeadmin
code-peerselection
code-storage
contrib
critical
defect
dev-infrastructure
documentation
duplicate
enhancement
fixed
invalid
major
minor
n/a
normal
operational
packaging
somebody else's problem
supercritical
task
trivial
unknown
was already fixed
website
wontfix
worksforme
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: tahoe-lafs/trac-2024-07-25#31
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
allmydata.test.test_system.SystemTest.test_upload_and_download will run for kilosecs, and sometimes seg fault. Also my client won't connect to the introducer.
So something is deeply wrong in cygwin land...
Ah, I also have Windows native python 2.4.3 installed on this machine which cannot be uninstalled, nor can a new Windows native python package be installed. (In both cases I get installer error 2003.)
So this bug is probably specific to my machine...
Actually the error code is 2203, at least on uninstall.
Fixed my install of Windows. (learned about cacls.exe.)
So, I just need to boot up my vmware machine that has cygwin and try the unit tests on it now that I've fixed it and then I can close this ticket as invalid...
Hm, I finally rebooted (so that I could run vmware again) and tried again, but again there is badness. The first time I ran the test it passed in 10s. The second time I ran it it hung. :-(
I've seen an intermittent failure on the cygwin buildslave. The investigation I've done so far suggests that one of the Services is being shut down (the webserver) too early. No idea why.. maybe a socket error as it tries to accept() a connection?
Just an update here: we've tracked this down to a bug in Python's poll() module, which appears to only be triggered under cygwin. Modules/selectmodule.c:poll_poll() has code to skip over file descriptors that have not been fired (those with a null .revents field), but if this code is used, the list that poll() returns will be filled with random garbage.
When Twisted sees the junk file descriptors, it usually ignores them, because it knows that a fileno of two kabillion is not valid. But every once in a while that random garbage looks like a real descriptor. In the case of the test_system failure, it looks like the fileno of one of the listening sockets, and the random junk in the revents field looks like select.POLLOUT, so the reactor tries to do a doWrite to the listening socket, which throws an exception because listening sockets are never writable.
I'm working on a patch now, trying to make it pass the unit tests.
Attachment cygwin-poll.diff (1049 bytes) added
patch for python-2.5.1's Modules/selectmodule.c to work around cygwin bug
More data: cygwin's poll() function sometimes violates POSIX and returns an overly-large fd count. The return value is supposed to be equal to the number pollfd structures that have non-zero .revent fields, but sometimes cygwin's count is too high.
This causes python's selectmodule.c/poll_poll() to overrun the pollfd array, and copy random data into the python list that it returns to the python-side caller of select.poller().poll(). When Twisted's pollreactor sees this, most of the bogus fds are invalid and ignored, but sometimes one of them looks like a real fd and causes the corresponding doWrite or doRead method to be invoked. Usually these are harmless too, but in our case the random fds finally overlapped with a real listening socket (probably because that fd's fileno was sitting in nearby memory), and we hit the exception.
attachment:/tahoe-lafs/trac-2024-07-25/issues/7536:cygwin-poll.diff is a patch for Python-2.5.1 to workaround the issue, by ignoring the return value from poll() and counting the active fds manually. I've applied this patch and rebuilt the select.dll module on our cygwin buildslave, and now the buildbot is green.
This problem has been reported to the python folks at http://sourceforge.net/tracker/index.php?func=detail&aid=1759997&group_id=5470&atid=105470 along with the patch. Since it's cygwin's buggy poll() that's the root cause, I'm not sure if they'll accept the patch or not.
Zooko said he was tempted to work on a cygwin patch, so I'm going to hold off notifying the cygwin mailing list until we have that patch in hand (or decide that we aren't going to bother).
Either way, I think we understand this issue well enough to close out this bug.
crash on cygwinto crash on cygwin: doWrite on a Port, failure during test_systemThis bug was fixed in cygwin 1.5.25-7, released 2007-12-17.