site_archiver@lists.apple.com Delivered-To: darwin-kernel@lists.apple.com Hello list, Joe Bruni _______________________________________________ Do not post admin requests to the list. They will be ignored. Darwin-kernel mailing list (Darwin-kernel@lists.apple.com) Help/Unsubscribe/Update your Subscription: http://lists.apple.com/mailman/options/darwin-kernel/site_archiver%40lists.a... This email sent to site_archiver@lists.apple.com Back in October we had a short discussion involving processes getting stuck with a P_WEXIT status flag. My process is multi-threaded and long-running (as in weeks) and handles up to 2000 simultaneous SSL connections. Our discussion focused on the potential problem with a thread blocking on a call to accept() and how it is not necessarily woken if a second thread close()'s the fd out from under the sleeping thread, especially if there are pending connections. The remedy for this recommended by Terry Lambert was to use pthread_kill() to signal the sleeping thread to wake which can then explicitly use pthread_testcancel() as part of its loop. (This helped to remove a non-portable implementation in the code. Thanks, Terry.) I have re-implemented all connection handling threads in this manner using select() as my blocking point with a timeout of two seconds. I also use SIGUSR1 from the main thread to wake threads blocking (probably overkill since they will wake in two seconds anyway). The program, at shutdown when receiving a SIGTERM, properly pthread_joins() with all threads. All sockets have been closed. I have been painstakingly logging each resource's entire life-cycle from startup to shutdown and I have not missed anything. No memory leaks, no thread leaks, no socket leaks. After returning from main(), as noted from a parting call to syslog() prior to my "return" statement, the process still hangs with a P_WEXIT status flag, occasionally. I'm convinced this is a kernel bug since I'm hanging just outside of main(). My process is started and monitored using launchd, but manually started processes hang this way, too. If it's relevant, this is a mostly-POSIX C++ daemon which links with the OpenSSL library. The OS is 10.4.7 ppc on a dual G5 (we are scheduling to update to 10.4.8 soon). One other thing I'd like to bring up that I forgot initially: Awhile back I had a bug in the usage of an STL list that wasn't properly mutexed based on my mis-interpretation of STL and thread-safety. This was occasionally causing my program to segfault. Oddly, even a segfault would land my program in the P_WEXIT state!! (I have since mutex'd access to the list.) Are there any other suggestions? This one is very difficult to troubleshoot since the hang is outside my code. I'm willing to do a ktrace during the shutdown if some one thinks it will help. smime.p7s
participants (1)
-
Joseph Oreste Bruni