Lists

Terms and Conditions
Lists hosted on this site
Email the Postmaster
Tips for posting to public mailing lists

Re: Process Signal Bug On Intel Dual Core Machines?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Process Signal Bug On Intel Dual Core Machines?

Subject: Re: Process Signal Bug On Intel Dual Core Machines?
From: Terry Lambert <email@hidden>
Date: Wed, 30 Aug 2006 14:09:55 -0700

On Aug 30, 2006, at 1:48 AM, Markus Hanauska wrote:

On 06-08-30, at 03:45, Terry Lambert wrote:
Whether or not you are permitted to send a signal to a given process depends on your credentials vs. the target processes credentials, and whether or not the process has already zombied at the time you attempt to send the signal.
No, the process has not zombied for sure, since it can still process network packages.

Regarding permissions, the daemon runs as root, that is true, but the control app belongs to root:wheel and has the SUID bit set, that means it also runs as root, always (you can't let it run as another user).

It's still not clear to me whether or not you are running anything from launchd; are you?

Here is something else I have found out:
I found out that HUP signal (SIGHUP) is the most reliable one. It *always* is delivered to the process. Since re-reading the config is not supported anyway at the moment, I changed my daemon to use this signal to shutdown and now it works *every time*. SIGHUP is always delivered, the signal handler is always called, the app closes down as expected.

I found out that SIGQUIT (which the app is not handling in the signal handler) also works occasionally, but very seldom and it kills the process as expected. SIGTERM still never works, SIGINT works, but not reliably (fine from command line with kill and killall, only on every second call from within my controller application).

The command line is completely uninteresting for the purposes of identifying the root cause of the bug: it does not help anyone who could help you, to have that information; using different code paths via kill/killall doesn't provide any useful information on top of what you are already observing.

The differences between SIGQUIT / SIGTERM / SIGINT are mildly interesting.

If you are using either getpass()/readpassphrase(), library routines, you should be aware that SIGINT / SIGHUP / SIGQUIT / SIGTERM / SIGTSTP / SIGTTIN / SIGTTOU are moved to an inactive signal handler during either function call by Libc code, and only returned to active status following input or interruption of input.

If you are using the system() function in order to run a subprogram, rather than dealing with the fork()/execve() yourself, then SIGINT and SIGQUIT are both ignored during the call (meaning that if they are delivered during that time period, they will simply be ignored), and SIGCHLD is blocked (meaning that if it is delivered during that time period, it will not be delivered until it is unblocked later on).

There's no libc code that blocks, masks, or ignores SIGTERM explicitly.

So in general, unless your code is using SIGQUIT / SIGINT, then the problem is going to be in one of the libraries that your code is linked against.

Considering that the app runs as expected using the SIGHUP signal on Intel and PPC (and this is treated exactly the SIGTERM used to be treated), however, SIGTERM only works on PPC, I rather wonder how this could be the fault of my application code.

There are many ways this could be the case; it's not really useful to enumerate them: I mention several below, in answer to a different question. If you insist, you can file a bug, and it will likely get to me, at which point I will enumerate all of them in the context of the bug.

NB: If you do file a bug, please include a cut down test case; also include whether launchd is involved at all, and the list of libraries against which you are linked, since the problem is likely either in your code (in which case, the cut-down test case won't exhibit the problem), or in one of those other libraries/frameworks (in which case, the test case would be useful).

NNB: The volume of information noted above for diagnosing the problem is not of general interest to the list, and it's not of interest to me in private email instead of a bug filing; if you want to communicate it to anyone, you should file a bug, so that the resulting information is captured in the bug database; I absolutely hate solving any problem more than once.

Effectively, you should *always* look at your result code; if you get a -1, and errno is ESRCH, it means the process does not exist (as far as the system can tell); if you get an EPERM, then this means that the process exists, but you have insufficient privilege to send the signal to it. You can also get an EINVALif the signal is out of range, but that does not look like the case.
The result code of the ineffective kill/killall calls from command line (as root!) is always 0, which means signal successfully delivered, still nothing happens. Same in my app. I never get any error back.
See my code example:
if (result == 0) {
    if ((result = kill(pid, SIGINT)) != 0) {
result is always 0, still no signal arrives at the destination process. I verified pid to be correct.

The example didn't indicate what happened when you got the error; my suggestion would be a syslog(). The reason for this is that many people tend to use longjmp() or printf in signal handlers, or child processes, etc.. When they do this, they tend to ignore that the stdio processing changed as of Tiger, and as a result, without an explicit ffluch() in the case of a printf/perror, and without explicit information about the sigaction state (i.e. is SA_RESTART set in the flags field?, etc.), it's not possible to reproduce the exact problem you are having. This is also why it's likely no one else resonded to you posting to the mailing list.

You can also effectively mask signals on a per thread basis, as well as just a process basis, except for unmaskable signals - so you might want to try SIGKILL from your process, rather than SIGINT, to see if it's not delivered, or if it was just masked (perhaps by a library routine you didn't know was masking it).
But doesn't GDB ignore the masking?

No. GDB intercepts the ASTs using it's mach port for the process, and filters those ASTs to the process based on internal decisions it makes, some of which are controlled by that table you posted earlier, and some of which are a side effect of the way it's invoked.

Even if I mask the signal, wouldn't GDB show the arriving signal anyway and just the process ignore it?

No, it would not. If the signal is masked, then we do not generate the AST.

The part of this that most people ignore is that signals are not events, they are persistent conditions. If you masked the signal off, and we delivered it anyway, and then attempted to filter it at the AST, then we would have potentially interrupted a sleep on resource availability. Just because an operation can run to completion before the AST fires does not make the operation in progress non- interruptible by the signal; for example, if you are sleeping in a read waiting for buffer availability from the block I/O subsystem, it's indeterminate whether your interruption of that wait will get there first, or the interruption of the block I/O.

So it's not possible for an implementation to be reliable, if it were to take the approach you suggest: reliable in this case refers to masked signals not resulting in spurious EINTRs to system calls for programs which have masked the signals to prevent such interruptions occurring.

Also, because they are persistent conditions, rather than events, the proper way to deal with them (e.g. SIGCHLD), is to fire the handle and set a volatile variable from the handler, and then in your main loop following the system call interruption, poll all potential sources until various things that could have set the persistent condition have all been handled.

For example, if you were writing "init" or "launchd" yourself, you would take a SIGCHLD, set a volatile variable (e.g. "volatile int saw_sigchld = 0;") to flag that you saw it in the handler (e.g. "saw_sigchld = 1;"), and then in the main loop following the interruption, loop, zero saw_sigchld and calling wait4() with W_NOHANG until it quit returning child process information, doing whatever per- child cleanup needs to be done in that loop.

NB: Doing it in this order avoids the race from zeroing it to it being re-raised before you go into the loop processing dead children - so if you had another one happen, it would end up cleaned up in the meantime as a side effect of the structure of your code.

And why would different signals be masked on Intel than on PPC machines?

Differrent code in your application or one of its libraries, resulting from variant compilation based on __I386__ or __PPC__ or one of the other manifest constants from the compiler being in scope; different output from GNU configure; some signals depend on hardware support (e.g. SIGFPE), etc..

You should also be aware that a signal sent to a process is delivered via a Mach AST, and that gdb traps these ASTs, and can block, redirect, or otherwise cause them to not be acted upon.
Yes, I know that. The problem is that GDB never sees any signals arriving. If they'd arrive in GDB and just not at the process, I wouldn't complain, but they are never shown in GDB to begin with. And even without GDB running they don't arrive at the process.


The gdb will never see signals, period.  It will potentially see ASTs.

Are you running the standard gdb which came with the system, and is in the proper group for "procwatch" or "procmod", or did you compile your own?

If you compiled your own, or renamed/moved, or otherwise did not use the proper install process on it, the gdb you are using may fail to obtain the task_for_pid() which gdb uses to establish the Mach port on which it will receive, among other things, the signal AST notifications. This is a difference between current PPC and Intel: task_for_pid() is a privileged operation. It is our intention to take it away from all but privileged processes.

You should check your system and console logs to see if you see any denials of the form:

	(109: gdb)tfp: failed on 117

Also, since ASTs only fire on the way out of a system call (or cancellation point, if the call can be precancelled), the signal will not be deliverred until one or more of your threads run up through the trap handler into the trampoline code in user space (and then call back into the kernel to return to user space not on the signal context). So it's possible for you to, for example, start a read, send a signal to the process, and not see the read interrupted (i.e. it could complete prior to the signal handler firing). This is the same things DEC True64 UNIX, and a number of other OS's with signals built on AST-style implementations will do: you get the signal, but the operation completes before the signal delivery actually happens.
This is all a very nice theory,


Not theory; I am the current maintainer of the signal code.

but why is SIGHUP then always delivered (also from my controller app), always interrupts the running select, but SIGTERM is never, SIGQUIT only on rare occasions and SIGINT on every second call from my controller app? This sounds all extremely undeterministic and undeterministic behavior of the same piece of code sounds a lot like a system bug to me.

Most likely because your are calling library routines which result in things being masked, or you are running a multithreaded program, and have failed to mask the signals on the other threads (either intentionally, or as a result of some framework/library starting a worker thread on your behalf). As a result, the select(0 is not interrupted either because the signal attempting to interrupt it is masked, or the signal is not delivered to the thread on which the select is running, and therefore the select is never interrupted (and remains blocked in the kernel).

One common problem with debugging signals is that you don't want the signals sent to a subshell used to invoke your program; the default in gdb on Mac OS X is "start-with-shell" set to "on". If you plan on debugging signals, you will want to turn this off, e.g.:
	(gdb) set start-with-shell off
before attempting to debug anything to do with signals.
I'll try that. I doubt that I will be able to explain the SIGHUP/ SIGTERM riddle that way. I'll keep you updated. I have also filed a bug report for this issue.

It would probably be best to resolve this in the context of the bug report, then.

Please make sure you've included the information I noted above as being relevent to a bug report, or it's going to bounce back and forth through ADCBugs several times while they collect the information from you.

-- Terry
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Darwin-kernel mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden


Follow-Ups:

Re: Process Signal Bug On Intel Dual Core Machines?
From: Markus Hanauska <email@hidden>


References:  
  >Process Signal Bug On Intel Dual Core Machines? (From: Markus Hanauska <email@hidden>)
  >Re: Process Signal Bug On Intel Dual Core Machines? (From: Markus Hanauska <email@hidden>)
  >Re: Process Signal Bug On Intel Dual Core Machines? (From: Terry Lambert <email@hidden>)
  >Re: Process Signal Bug On Intel Dual Core Machines? (From: Markus Hanauska <email@hidden>)




Prev by Date:
Re: Determining kext running path

Next by Date:
Re: Process Signal Bug On Intel Dual Core Machines?

Previous by thread:
Re: Process Signal Bug On Intel Dual Core Machines?

Next by thread:
Re: Process Signal Bug On Intel Dual Core Machines?

Index(es):

Date
Thread