Re: P_WEXIT

16 Dec 2006

      site_archiver@lists.apple.com
Delivered-To: darwin-kernel@lists.apple.com

On Dec 15, 2006, at 7:04 PM, Joseph Oreste Bruni wrote:
On Dec 15, 2006, at 7:35 PM, Terry Lambert wrote:
<http://developer.apple.com/technotes/tn2004/tn2118.html>
1) create N number of threads at startup
2) do network and file I/O for days, weeks, etc.
3) receive SIGTERM
4) join with N number of threads
5) exit()
6) hang in kernel who thinks there are still more threads
7) ???
-- Terry
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Darwin-kernel mailing list      (Darwin-kernel@lists.apple.com)
Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/darwin-kernel/site_archiver%40lists.a...
On Dec 15, 2006, at 9:20 AM, Quinn wrote:

At 20:24 -0700 13/12/06, Joseph Oreste Bruni wrote:

On my test system, I can start and kill my process all day long
and never reproduce this. In production, where the system runs
for weeks, the condition occurs...

You might want to set up kernel core dumps on your production
machine.  When you get a stuck process, you can press NMI to dump
a kernel core.  From that you could look at the state of the stuck
process.

I concur; I meant to respond to this earlier, and my opinion is
that there's nothing for this except the kernel debugger.
If you have a second machine to debug with, then that wold be
better thn a core dump server (unless there was an intent to send
the core to Apple in a bug report), since a lot of the kgmacros
that would be useful for debugging this issue do not work on core
files because of gdb limitations that have not been addresses
(specifically local shadow register state being used in place of
image register state so things like "switchtoact" actually switch).
-- Terry

I had already opened a bug (4680335), but it was closed with a
"behaves correctly" status. Create a new one or can I file this
under the original?
I figured this would eventually end up being a kernel problem since
the basic outline of my program is:
I've had to send kernel dumps before so I'm familiar with the
process. (Bug in original PCI fibrechannel card in G5 Xserve.)
Now I get to figure out how to generate an NMI on the headless
Xserve...

The guy in charge of the developer interaction shut it down after 7
days of not hearing anything to a ping.  The statement in it is "Will
re-open if Developer writes back in.".
If you set the debug flags for kernel debugging (e.g. 0x144),
momentarily hitting the power button will generate the require NMI to
drop you into the debugger (or send the core to a dump server, if you
have a dump server configured, instead, in the flags), rather than
putting the machine to sleep.
Let me say up front that I don't believe this is a kernel problem, and
the "showallstacks" command should confirm that there are still stacks
running in the process.  You may end up (if it has already been taken
off the active process list) having to track down the task from the
zombie list, if it doesn't show up in the task list - meaning it won't
show in the "showallstacks".
If that's the case, then walk the list to find the process, then look
in that for the task, and then "showtaskstacks 0xdsfgsdfg", where the
argument is the hex address of the task.
Also, if that's the case, it's probably easier to NMI before you shut
down the process, get the information from "showalltasks",
"c" (continue) it, cause the problem, NMI it again, and then use the
information you got previously for the "showtaskstacks"

There will be at least one thread in a strange state, and whatever it
(or they, if there's more than one) blocks on is the source of your
problem.
It's be a lot easier if I could attach the machine (8-)), but I really
don't have the hardware or software resources necessary to recreate
your problem and diagnose it in house.
This email sent to site_archiver@lists.apple.com

Terry Lambert

tags

participants (1)