Re: Process Signal Bug On Intel Dual Core Machines?
Re: Process Signal Bug On Intel Dual Core Machines?
- Subject: Re: Process Signal Bug On Intel Dual Core Machines?
- From: Terry Lambert <email@hidden>
- Date: Tue, 29 Aug 2006 18:45:18 -0700
Whether or not you are permitted to send a signal to a given process
depends on your credentials vs. the target processes credentials, and
whether or not the process has already zombied at the time you attempt
to send the signal. So, for example, if your daemon program is
started by launchd as root, and you are a non-root user (no, it does
not matter if you are an administrator or not), unless your
credentials match, or you are the process group leader and in the same
process group, etc., you will not be able to send a signal.
Effectively, you should *always* look at your result code; if you get
a -1, and errno is ESRCH, it means the process does not exist (as far
as the system can tell); if you get an EPERM, then this means that the
process exists, but you have insufficient privilege to send the signal
to it. You can also get an EINVALif the signal is out of range, but
that does not look like the case.
You can also effectively mask signals on a per thread basis, as well
as just a process basis, except for unmaskable signals - so you might
want to try SIGKILL from your process, rather than SIGINT, to see if
it's not delivered, or if it was just masked (perhaps by a library
routine you didn't know was masking it). If SIGKILL will always kill
it, then it's pretty likely that someone is masking the other
(maskable) signals.
You should also be aware that a signal sent to a process is delivered
via a Mach AST, and that gdb traps these ASTs, and can block,
redirect, or otherwise cause them to not be acted upon. Also, since
ASTs only fire on the way out of a system call (or cancellation point,
if the call can be precancelled), the signal will not be deliverred
until one or more of your threads run up through the trap handler into
the trampoline code in user space (and then call back into the kernel
to return to user space not on the signal context). So it's possible
for you to, for example, start a read, send a signal to the process,
and not see the read interrupted (i.e. it could complete prior to the
signal handler firing). This is the same things DEC True64 UNIX, and
a number of other OS's with signals built on AST-style implementations
will do: you get the signal, but the operation completes before the
signal delivery actually happens.
One common problem with debugging signals is that you don't want the
signals sent to a subshell used to invoke your program; the default in
gdb on Mac OS X is "start-with-shell" set to "on". If you plan on
debugging signals, you will want to turn this off, e.g.:
(gdb) set start-with-shell off
before attempting to debug anything to do with signals.
Hope that helps.
-- Terry
On Aug 28, 2006, at 8:46 AM, Markus Hanauska wrote:
Actually, after doing some more testing, the issue seems worse than
I thought. While KILL or KILLALL from shell always work with the
SIGINT signal, doing the same from within my controller app only
works *RANDOMLY*. Very simple code:
if (result == 0) {
if ((result = kill(pid, SIGINT)) != 0) {
Sometimes this kills the process and sometimes not. If it does not
work the first time (the process is never delivered), it works when
I call it a second time. Same code, once the signal is delivered,
once not.
On 06-08-28, at 16:46, Markus Hanauska wrote:
Hello!
I have a daemon process, that has a signal handler which listens
for SIGHUP, SIGINT, SIGTERM, SIGUSR1, SIGUSR2, SIGPIPE. The problem
is, this process never reacts to SIGTERM, which it does handle.
Funny thing is, it does not react to SIGQUIT, although it does not
block it (the default action should take place). It does however
react to SIGINT.
Don't get me wrong: The problem is, that the signals are really
*not* delivered to the process! I can prove it with GDB. E.g. I
start the process, it gets ID 972.
Now here's my GDB session:
~ root# gdb
GNU gdb 6.3.50-20050815 (Apple version gdb-609) (Fri Jul 28
05:21:24 UTC 2006)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License,
and you are
welcome to change it and/or distribute copies of it under certain
conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for
details.
This GDB was configured as "i386-apple-darwin".
(gdb) attach 927
Attaching to process 927.
Reading symbols for shared libraries . done
Reading symbols for shared libraries .............. done
0x900f9294 in __select ()
(gdb)
I'll now take a loot at the signal table in GDB:
(gdb) info signals
Signal Stop Print Pass to program Description
SIGHUP Yes Yes Yes Hangup
SIGINT Yes Yes No Interrupt
SIGQUIT Yes Yes Yes Quit
SIGILL Yes Yes Yes Illegal instruction
SIGTRAP Yes Yes No Trace/breakpoint trap
SIGABRT Yes Yes Yes Aborted
SIGEMT Yes Yes Yes Emulation trap
SIGFPE Yes Yes Yes Arithmetic exception
SIGKILL Yes Yes Yes Killed
SIGBUS Yes Yes Yes Bus error
SIGSEGV Yes Yes Yes Segmentation fault
SIGSYS Yes Yes Yes Bad system call
SIGPIPE Yes Yes Yes Broken pipe
SIGALRM No No Yes Alarm clock
SIGTERM Yes Yes Yes Terminated
:
As you can see, it shall stop at every signal and print every
received signal except SIGALRM. Further it should pass all signals
to the app except SIGINT and SIGTRAP.
Now I continue running the app
(gdb) cont
Continuing.
In another shell I do the following:
~ root# kill -TERM 927
What happens in GDB? Nothing. Is the signal handler called? No. I
have verified that by setting a break point at the signal handler
in a previous test. Even if I have no signal handler for TERM, even
if I block that signal or ignore that signal, GDB should still
*stop* and *print* it. It does not. Why not? Because _no signal_ is
delivered. Why not? How can this be?
Ok, let's try QUIT. QUIT is not even handled by my signal handler,
it should do the default action. Here we go:
~ root# kill -QUIT 927
Again, nothing!
Okay, but now, let's try SIGINT:
Program received signal SIGINT, Interrupt.
0x900f9294 in __select ()
(gdb)
Huh? How can this be? How can it be that SIGINT is delivered, but
SIGTERM and SIGQUIT are not?!? Wouldn't GDB show the signal,
regardless if my app ignores or blocks it (what it does not,
nowhere in the code I see anything like this taking place).
Now you may say, how's that a possible kernel bug? Very simple: I
can't reproduce that on any PPC machine. I also can't reproduce
that on my Mac Mini Core Solo, but I can reproduce this to 100% on
an iMac Intel with Dual Core.
This bug is driving me really nuts and lets me doubt my sanity. And
why only on Dual Core Intel? (not in Rosetta, the daemon is
universal) Can it be that this is some kernel layer bug in the
signal delivery?
The work-a-round for me is to use SIGINT on all machines which is
working fine as far as I can see. But this daemon exists since 10.2
and it has always been working, 10.2 to 10.4, on any machine,
always using SIGTERM and now, all of a sudden it fails on iMac and
many Mac Books and Mac Book Pros with Intel Dual Core - not always
to 100% reproducible; for some it's sometimes working and sometimes
not - which makes me believe even more that this is a really,
really nasty kernel bug.
I can provide you with every debug output from GDB, Shark or any
other tool you like. I just can't post any source here. Any help is
appreciated.
--
Best Regards,
Markus Hanauska
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Darwin-kernel mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Darwin-kernel mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden