Re: P_WEXIT
Re: P_WEXIT
- Subject: Re: P_WEXIT
- From: Terry Lambert <email@hidden>
- Date: Fri, 18 Aug 2006 14:43:50 -0700
On Aug 15, 2006, at 10:15 PM, Joseph Oreste Bruni wrote:
When it receives either a TERM or an INT, the main thread breaks
from it's loop and proceeds to cancel all threads and close any
listener sockets (one AF_INET and two AF_UNIX sockets). The main
thread performs a join on every thread.
I have a single "listen" thread that blocks on accept() waiting for
a connection. Since accept() is not a cancelation point, I have
accept wrapped in a loop that includes an explicit pthread_testcancel
(). There is no way to exit from this loop other than at
pthread_testcancel().
The main thread that is intending to shut everything down first
calls pthread_cancel() on the listener thread's ID to queue up a
cancelation, and then closes the listener socket. The listener
thread will receive the EINVALID from accept(), continues on its
loop until it hits the pthread_testcancel() at which point it is
joined by the main thread.
[I thought about using select() on the listener socket but that
isn't a cancelation point either. I'd just be in a spin-loop between
select() and pthread_testcancel().]
After looking through my code, I am joining with every thread that
has ever been created. My program then seems to get stuck after
returning from main(). I did spend a bit of time on this shutdown
code to make sure I got everything right from anywhere from 3 to 100
threads. I've never leaked a thread that I could tell. I keep all my
thread_id's in a vector and join on each one of them. If any of them
didn't come back, my main thread would get stuck on the call to
pthread_join, but that doesn't happen.
I still don't know what is causing me to get stuck in the P_WEXIT
state after I've joined all threads and returned from main(). This
process typically runs for weeks at a time handling around 2000
simultaneous SSL connections until you send it a TERM signal. Most
of the time it shuts down clean and does not leave any E processes.
Once in a while I get an E.
But, back to a point you made: rather than closing the socket in the
main thread, you suggested using pthread_kill to wake the listener
thread. Would this result in an EINTR from accept()?
Yes.
Is there a particular signal that I should use?
Any one you didn't block; my suggestion would be SIGUSR1 or something
like that.
Is the kernel creating any threads on my behalf other than the ones
from pthread_create() that might be getting stuck?
No. If you are using frameworks, they can be creating threads on your
behalf, but the kernel doesn't.
Before creating any threads, I block some signals with
pthread_sigmask() so that those signals are blocked in any
subsequently created threads. The main thread also creates any
needed sockets. After all threads have been created, the main thread
calls sigwait() with the same set of signals (TERM, INT, HUP), so
that only the main thread actually deals with signals. There are no
signal handlers installed -- I'm only using pthread_sigmask() and
sigwait() to deal with signals. The only signal I explicitly ignore
with a sigaction() is PIPE.
There is no explicit call for:
wakeup((caddr_t)&so->so_timeo);
in soo_close, as there is in soo_drain, and since you are not
connected, I'd not expect a close to interrupt things, unless you
happened to have a connection coming in at the same time (ironically,
if you close it from somewhere else, unless it's already getting a
connection, another connection won't wake it up, because it will never
make it in after the close). That's probably what's hanging you, and
you mostly don't see it because all the clients you closed out are
trying to reconnect, and you tend to luck out because of the delay
between the cancellation and the close.
If this same code works on another OS, you might want to file a radar
suggesting that a wakeup be sent in the soo_close, after the close
happens (this should cause the accept to return EBADF the next time
it's called, but only if we do some extra work to expect the wakeup on
close; otherwise, as the code sits, the spurious wakeup would likely
cause a panic to happen - if you file the radar, you may just wish to
cut and paste this email).
But for your immediate problem, a signal sent to the thread will
definitely wake it out of the msleep() in accept(), which is likely
where it's likely blocked. This is preferrable to testing for the
cancellation and then dropping out and closing it, if you see it
(which would be another approach), since you'd be sitting in the "E"
until another connection attempt happened.
-- Terry
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Darwin-kernel mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden
References: | |
| >P_WEXIT (From: Joseph Oreste Bruni <email@hidden>) |
| >Re: P_WEXIT (From: Terry Lambert <email@hidden>) |
| >Re: P_WEXIT (From: Joseph Oreste Bruni <email@hidden>) |
| >Re: P_WEXIT (From: AgentM <email@hidden>) |
| >Re: P_WEXIT (From: Terry Lambert <email@hidden>) |