P_WEXIT
P_WEXIT
- Subject: P_WEXIT
- From: Joseph Oreste Bruni <email@hidden>
- Date: Wed, 13 Dec 2006 11:45:34 -0700
Hello list,
Back in October we had a short discussion involving processes getting
stuck with a P_WEXIT status flag. My process is multi-threaded and
long-running (as in weeks) and handles up to 2000 simultaneous SSL
connections.
Our discussion focused on the potential problem with a thread
blocking on a call to accept() and how it is not necessarily woken if
a second thread close()'s the fd out from under the sleeping thread,
especially if there are pending connections. The remedy for this
recommended by Terry Lambert was to use pthread_kill() to signal the
sleeping thread to wake which can then explicitly use
pthread_testcancel() as part of its loop. (This helped to remove a
non-portable implementation in the code. Thanks, Terry.)
I have re-implemented all connection handling threads in this manner
using select() as my blocking point with a timeout of two seconds. I
also use SIGUSR1 from the main thread to wake threads blocking
(probably overkill since they will wake in two seconds anyway).
The program, at shutdown when receiving a SIGTERM, properly
pthread_joins() with all threads. All sockets have been closed. I
have been painstakingly logging each resource's entire life-cycle
from startup to shutdown and I have not missed anything. No memory
leaks, no thread leaks, no socket leaks.
After returning from main(), as noted from a parting call to syslog()
prior to my "return" statement, the process still hangs with a
P_WEXIT status flag, occasionally.
I'm convinced this is a kernel bug since I'm hanging just outside of
main(). My process is started and monitored using launchd, but
manually started processes hang this way, too.
If it's relevant, this is a mostly-POSIX C++ daemon which links with
the OpenSSL library. The OS is 10.4.7 ppc on a dual G5 (we are
scheduling to update to 10.4.8 soon).
One other thing I'd like to bring up that I forgot initially: Awhile
back I had a bug in the usage of an STL list that wasn't properly
mutexed based on my mis-interpretation of STL and thread-safety. This
was occasionally causing my program to segfault. Oddly, even a
segfault would land my program in the P_WEXIT state!! (I have since
mutex'd access to the list.)
Are there any other suggestions? This one is very difficult to
troubleshoot since the hang is outside my code. I'm willing to do a
ktrace during the shutdown if some one thinks it will help.
Joe Bruni
Attachment:
smime.p7s
Description: S/MIME cryptographic signature
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Darwin-kernel mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden