Re: Process Signal Bug On Intel Dual Core Machines?
Re: Process Signal Bug On Intel Dual Core Machines?
- Subject: Re: Process Signal Bug On Intel Dual Core Machines?
- From: Terry Lambert <email@hidden>
- Date: Wed, 30 Aug 2006 14:09:55 -0700
On Aug 30, 2006, at 1:48 AM, Markus Hanauska wrote:
On 06-08-30, at 03:45, Terry Lambert wrote:
Whether or not you are permitted to send a signal to a given
process depends on your credentials vs. the target processes
credentials, and whether or not the process has already zombied at
the time you attempt to send the signal.
No, the process has not zombied for sure, since it can still process
network packages.
Regarding permissions, the daemon runs as root, that is true, but
the control app belongs to root:wheel and has the SUID bit set, that
means it also runs as root, always (you can't let it run as another
user).
It's still not clear to me whether or not you are running anything
from launchd; are you?
Here is something else I have found out:
I found out that HUP signal (SIGHUP) is the most reliable one. It
*always* is delivered to the process. Since re-reading the config is
not supported anyway at the moment, I changed my daemon to use this
signal to shutdown and now it works *every time*. SIGHUP is always
delivered, the signal handler is always called, the app closes down
as expected.
I found out that SIGQUIT (which the app is not handling in the
signal handler) also works occasionally, but very seldom and it
kills the process as expected. SIGTERM still never works, SIGINT
works, but not reliably (fine from command line with kill and
killall, only on every second call from within my controller
application).
The command line is completely uninteresting for the purposes of
identifying the root cause of the bug: it does not help anyone who
could help you, to have that information; using different code paths
via kill/killall doesn't provide any useful information on top of what
you are already observing.
The differences between SIGQUIT / SIGTERM / SIGINT are mildly
interesting.
If you are using either getpass()/readpassphrase(), library routines,
you should be aware that SIGINT / SIGHUP / SIGQUIT / SIGTERM /
SIGTSTP / SIGTTIN / SIGTTOU are moved to an inactive signal handler
during either function call by Libc code, and only returned to active
status following input or interruption of input.
If you are using the system() function in order to run a subprogram,
rather than dealing with the fork()/execve() yourself, then SIGINT and
SIGQUIT are both ignored during the call (meaning that if they are
delivered during that time period, they will simply be ignored), and
SIGCHLD is blocked (meaning that if it is delivered during that time
period, it will not be delivered until it is unblocked later on).
There's no libc code that blocks, masks, or ignores SIGTERM explicitly.
So in general, unless your code is using SIGQUIT / SIGINT, then the
problem is going to be in one of the libraries that your code is
linked against.
Considering that the app runs as expected using the SIGHUP signal on
Intel and PPC (and this is treated exactly the SIGTERM used to be
treated), however, SIGTERM only works on PPC, I rather wonder how
this could be the fault of my application code.
There are many ways this could be the case; it's not really useful to
enumerate them: I mention several below, in answer to a different
question. If you insist, you can file a bug, and it will likely get
to me, at which point I will enumerate all of them in the context of
the bug.
NB: If you do file a bug, please include a cut down test case; also
include whether launchd is involved at all, and the list of libraries
against which you are linked, since the problem is likely either in
your code (in which case, the cut-down test case won't exhibit the
problem), or in one of those other libraries/frameworks (in which
case, the test case would be useful).
NNB: The volume of information noted above for diagnosing the problem
is not of general interest to the list, and it's not of interest to me
in private email instead of a bug filing; if you want to communicate
it to anyone, you should file a bug, so that the resulting information
is captured in the bug database; I absolutely hate solving any problem
more than once.
Effectively, you should *always* look at your result code; if you
get a -1, and errno is ESRCH, it means the process does not exist
(as far as the system can tell); if you get an EPERM, then this
means that the process exists, but you have insufficient privilege
to send the signal to it. You can also get an EINVALif the signal
is out of range, but that does not look like the case.
The result code of the ineffective kill/killall calls from command
line (as root!) is always 0, which means signal successfully
delivered, still nothing happens. Same in my app. I never get any
error back.
See my code example:
if (result == 0) {
if ((result = kill(pid, SIGINT)) != 0) {
result is always 0, still no signal arrives at the destination
process. I verified pid to be correct.
The example didn't indicate what happened when you got the error; my
suggestion would be a syslog(). The reason for this is that many
people tend to use longjmp() or printf in signal handlers, or child
processes, etc.. When they do this, they tend to ignore that the
stdio processing changed as of Tiger, and as a result, without an
explicit ffluch() in the case of a printf/perror, and without explicit
information about the sigaction state (i.e. is SA_RESTART set in the
flags field?, etc.), it's not possible to reproduce the exact problem
you are having. This is also why it's likely no one else resonded to
you posting to the mailing list.
You can also effectively mask signals on a per thread basis, as
well as just a process basis, except for unmaskable signals - so
you might want to try SIGKILL from your process, rather than
SIGINT, to see if it's not delivered, or if it was just masked
(perhaps by a library routine you didn't know was masking it).
But doesn't GDB ignore the masking?
No. GDB intercepts the ASTs using it's mach port for the process, and
filters those ASTs to the process based on internal decisions it
makes, some of which are controlled by that table you posted earlier,
and some of which are a side effect of the way it's invoked.
Even if I mask the signal, wouldn't GDB show the arriving signal
anyway and just the process ignore it?
No, it would not. If the signal is masked, then we do not generate
the AST.
The part of this that most people ignore is that signals are not
events, they are persistent conditions. If you masked the signal off,
and we delivered it anyway, and then attempted to filter it at the
AST, then we would have potentially interrupted a sleep on resource
availability. Just because an operation can run to completion before
the AST fires does not make the operation in progress non-
interruptible by the signal; for example, if you are sleeping in a
read waiting for buffer availability from the block I/O subsystem,
it's indeterminate whether your interruption of that wait will get
there first, or the interruption of the block I/O.
So it's not possible for an implementation to be reliable, if it were
to take the approach you suggest: reliable in this case refers to
masked signals not resulting in spurious EINTRs to system calls for
programs which have masked the signals to prevent such interruptions
occurring.
Also, because they are persistent conditions, rather than events, the
proper way to deal with them (e.g. SIGCHLD), is to fire the handle and
set a volatile variable from the handler, and then in your main loop
following the system call interruption, poll all potential sources
until various things that could have set the persistent condition have
all been handled.
For example, if you were writing "init" or "launchd" yourself, you
would take a SIGCHLD, set a volatile variable (e.g. "volatile int
saw_sigchld = 0;") to flag that you saw it in the handler (e.g.
"saw_sigchld = 1;"), and then in the main loop following the
interruption, loop, zero saw_sigchld and calling wait4() with W_NOHANG
until it quit returning child process information, doing whatever per-
child cleanup needs to be done in that loop.
NB: Doing it in this order avoids the race from zeroing it to it being
re-raised before you go into the loop processing dead children - so if
you had another one happen, it would end up cleaned up in the meantime
as a side effect of the structure of your code.
And why would different signals be masked on Intel than on PPC
machines?
Differrent code in your application or one of its libraries, resulting
from variant compilation based on __I386__ or __PPC__ or one of the
other manifest constants from the compiler being in scope; different
output from GNU configure; some signals depend on hardware support
(e.g. SIGFPE), etc..
You should also be aware that a signal sent to a process is
delivered via a Mach AST, and that gdb traps these ASTs, and can
block, redirect, or otherwise cause them to not be acted upon.
Yes, I know that. The problem is that GDB never sees any signals
arriving. If they'd arrive in GDB and just not at the process, I
wouldn't complain, but they are never shown in GDB to begin with.
And even without GDB running they don't arrive at the process.
The gdb will never see signals, period. It will potentially see ASTs.
Are you running the standard gdb which came with the system, and is in
the proper group for "procwatch" or "procmod", or did you compile your
own?
If you compiled your own, or renamed/moved, or otherwise did not use
the proper install process on it, the gdb you are using may fail to
obtain the task_for_pid() which gdb uses to establish the Mach port on
which it will receive, among other things, the signal AST
notifications. This is a difference between current PPC and Intel:
task_for_pid() is a privileged operation. It is our intention to take
it away from all but privileged processes.
You should check your system and console logs to see if you see any
denials of the form:
(109: gdb)tfp: failed on 117
Also, since ASTs only fire on the way out of a system call (or
cancellation point, if the call can be precancelled), the signal
will not be deliverred until one or more of your threads run up
through the trap handler into the trampoline code in user space
(and then call back into the kernel to return to user space not on
the signal context). So it's possible for you to, for example,
start a read, send a signal to the process, and not see the read
interrupted (i.e. it could complete prior to the signal handler
firing). This is the same things DEC True64 UNIX, and a number of
other OS's with signals built on AST-style implementations will do:
you get the signal, but the operation completes before the signal
delivery actually happens.
This is all a very nice theory,
Not theory; I am the current maintainer of the signal code.
but why is SIGHUP then always delivered (also from my controller
app), always interrupts the running select, but SIGTERM is never,
SIGQUIT only on rare occasions and SIGINT on every second call from
my controller app? This sounds all extremely undeterministic and
undeterministic behavior of the same piece of code sounds a lot like
a system bug to me.
Most likely because your are calling library routines which result in
things being masked, or you are running a multithreaded program, and
have failed to mask the signals on the other threads (either
intentionally, or as a result of some framework/library starting a
worker thread on your behalf). As a result, the select(0 is not
interrupted either because the signal attempting to interrupt it is
masked, or the signal is not delivered to the thread on which the
select is running, and therefore the select is never interrupted (and
remains blocked in the kernel).
One common problem with debugging signals is that you don't want
the signals sent to a subshell used to invoke your program; the
default in gdb on Mac OS X is "start-with-shell" set to "on". If
you plan on debugging signals, you will want to turn this off, e.g.:
(gdb) set start-with-shell off
before attempting to debug anything to do with signals.
I'll try that. I doubt that I will be able to explain the SIGHUP/
SIGTERM riddle that way.
I'll keep you updated. I have also filed a bug report for this issue.
It would probably be best to resolve this in the context of the bug
report, then.
Please make sure you've included the information I noted above as
being relevent to a bug report, or it's going to bounce back and forth
through ADCBugs several times while they collect the information from
you.
-- Terry
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Darwin-kernel mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden