Re: Process Signal Bug On Intel Dual Core Machines?
Re: Process Signal Bug On Intel Dual Core Machines?
- Subject: Re: Process Signal Bug On Intel Dual Core Machines?
- From: Markus Hanauska <email@hidden>
- Date: Mon, 28 Aug 2006 17:46:47 +0200
Actually, after doing some more testing, the issue seems worse than I
thought. While KILL or KILLALL from shell always work with the SIGINT
signal, doing the same from within my controller app only works
*RANDOMLY*. Very simple code:
if (result == 0) {
if ((result = kill(pid, SIGINT)) != 0) {
Sometimes this kills the process and sometimes not. If it does not
work the first time (the process is never delivered), it works when I
call it a second time. Same code, once the signal is delivered, once
not.
On 06-08-28, at 16:46, Markus Hanauska wrote:
Hello!
I have a daemon process, that has a signal handler which listens
for SIGHUP, SIGINT, SIGTERM, SIGUSR1, SIGUSR2, SIGPIPE. The problem
is, this process never reacts to SIGTERM, which it does handle.
Funny thing is, it does not react to SIGQUIT, although it does not
block it (the default action should take place). It does however
react to SIGINT.
Don't get me wrong: The problem is, that the signals are really
*not* delivered to the process! I can prove it with GDB. E.g. I
start the process, it gets ID 972.
Now here's my GDB session:
~ root# gdb
GNU gdb 6.3.50-20050815 (Apple version gdb-609) (Fri Jul 28
05:21:24 UTC 2006)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License,
and you are
welcome to change it and/or distribute copies of it under certain
conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for
details.
This GDB was configured as "i386-apple-darwin".
(gdb) attach 927
Attaching to process 927.
Reading symbols for shared libraries . done
Reading symbols for shared libraries .............. done
0x900f9294 in __select ()
(gdb)
I'll now take a loot at the signal table in GDB:
(gdb) info signals
Signal Stop Print Pass to program Description
SIGHUP Yes Yes Yes Hangup
SIGINT Yes Yes No Interrupt
SIGQUIT Yes Yes Yes Quit
SIGILL Yes Yes Yes Illegal instruction
SIGTRAP Yes Yes No Trace/breakpoint trap
SIGABRT Yes Yes Yes Aborted
SIGEMT Yes Yes Yes Emulation trap
SIGFPE Yes Yes Yes Arithmetic exception
SIGKILL Yes Yes Yes Killed
SIGBUS Yes Yes Yes Bus error
SIGSEGV Yes Yes Yes Segmentation fault
SIGSYS Yes Yes Yes Bad system call
SIGPIPE Yes Yes Yes Broken pipe
SIGALRM No No Yes Alarm clock
SIGTERM Yes Yes Yes Terminated
:
As you can see, it shall stop at every signal and print every
received signal except SIGALRM. Further it should pass all signals
to the app except SIGINT and SIGTRAP.
Now I continue running the app
(gdb) cont
Continuing.
In another shell I do the following:
~ root# kill -TERM 927
What happens in GDB? Nothing. Is the signal handler called? No. I
have verified that by setting a break point at the signal handler
in a previous test. Even if I have no signal handler for TERM, even
if I block that signal or ignore that signal, GDB should still
*stop* and *print* it. It does not. Why not? Because _no signal_ is
delivered. Why not? How can this be?
Ok, let's try QUIT. QUIT is not even handled by my signal handler,
it should do the default action. Here we go:
~ root# kill -QUIT 927
Again, nothing!
Okay, but now, let's try SIGINT:
Program received signal SIGINT, Interrupt.
0x900f9294 in __select ()
(gdb)
Huh? How can this be? How can it be that SIGINT is delivered, but
SIGTERM and SIGQUIT are not?!? Wouldn't GDB show the signal,
regardless if my app ignores or blocks it (what it does not,
nowhere in the code I see anything like this taking place).
Now you may say, how's that a possible kernel bug? Very simple: I
can't reproduce that on any PPC machine. I also can't reproduce
that on my Mac Mini Core Solo, but I can reproduce this to 100% on
an iMac Intel with Dual Core.
This bug is driving me really nuts and lets me doubt my sanity. And
why only on Dual Core Intel? (not in Rosetta, the daemon is
universal) Can it be that this is some kernel layer bug in the
signal delivery?
The work-a-round for me is to use SIGINT on all machines which is
working fine as far as I can see. But this daemon exists since 10.2
and it has always been working, 10.2 to 10.4, on any machine,
always using SIGTERM and now, all of a sudden it fails on iMac and
many Mac Books and Mac Book Pros with Intel Dual Core - not always
to 100% reproducible; for some it's sometimes working and sometimes
not - which makes me believe even more that this is a really,
really nasty kernel bug.
I can provide you with every debug output from GDB, Shark or any
other tool you like. I just can't post any source here. Any help is
appreciated.
--
Best Regards,
Markus Hanauska
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Darwin-kernel mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden