site_archiver@lists.apple.com Delivered-To: darwin-kernel@lists.apple.com On Dec 15, 2006, at 7:04 PM, Joseph Oreste Bruni wrote: On Dec 15, 2006, at 7:35 PM, Terry Lambert wrote: <http://developer.apple.com/technotes/tn2004/tn2118.html> 1) create N number of threads at startup 2) do network and file I/O for days, weeks, etc. 3) receive SIGTERM 4) join with N number of threads 5) exit() 6) hang in kernel who thinks there are still more threads 7) ??? -- Terry _______________________________________________ Do not post admin requests to the list. They will be ignored. Darwin-kernel mailing list (Darwin-kernel@lists.apple.com) Help/Unsubscribe/Update your Subscription: http://lists.apple.com/mailman/options/darwin-kernel/site_archiver%40lists.a... On Dec 15, 2006, at 9:20 AM, Quinn wrote: At 20:24 -0700 13/12/06, Joseph Oreste Bruni wrote: On my test system, I can start and kill my process all day long and never reproduce this. In production, where the system runs for weeks, the condition occurs... You might want to set up kernel core dumps on your production machine. When you get a stuck process, you can press NMI to dump a kernel core. From that you could look at the state of the stuck process. I concur; I meant to respond to this earlier, and my opinion is that there's nothing for this except the kernel debugger. If you have a second machine to debug with, then that wold be better thn a core dump server (unless there was an intent to send the core to Apple in a bug report), since a lot of the kgmacros that would be useful for debugging this issue do not work on core files because of gdb limitations that have not been addresses (specifically local shadow register state being used in place of image register state so things like "switchtoact" actually switch). -- Terry I had already opened a bug (4680335), but it was closed with a "behaves correctly" status. Create a new one or can I file this under the original? I figured this would eventually end up being a kernel problem since the basic outline of my program is: I've had to send kernel dumps before so I'm familiar with the process. (Bug in original PCI fibrechannel card in G5 Xserve.) Now I get to figure out how to generate an NMI on the headless Xserve... The guy in charge of the developer interaction shut it down after 7 days of not hearing anything to a ping. The statement in it is "Will re-open if Developer writes back in.". If you set the debug flags for kernel debugging (e.g. 0x144), momentarily hitting the power button will generate the require NMI to drop you into the debugger (or send the core to a dump server, if you have a dump server configured, instead, in the flags), rather than putting the machine to sleep. Let me say up front that I don't believe this is a kernel problem, and the "showallstacks" command should confirm that there are still stacks running in the process. You may end up (if it has already been taken off the active process list) having to track down the task from the zombie list, if it doesn't show up in the task list - meaning it won't show in the "showallstacks". If that's the case, then walk the list to find the process, then look in that for the task, and then "showtaskstacks 0xdsfgsdfg", where the argument is the hex address of the task. Also, if that's the case, it's probably easier to NMI before you shut down the process, get the information from "showalltasks", "c" (continue) it, cause the problem, NMI it again, and then use the information you got previously for the "showtaskstacks" There will be at least one thread in a strange state, and whatever it (or they, if there's more than one) blocks on is the source of your problem. It's be a lot easier if I could attach the machine (8-)), but I really don't have the hardware or software resources necessary to recreate your problem and diagnose it in house. This email sent to site_archiver@lists.apple.com
participants (1)
-
Terry Lambert