Re: Understanding cores...

9 Jan 2007

      site_archiver@lists.apple.com
Delivered-To: darwin-kernel@lists.apple.com

On Jan 8, 2007, at 4:05 PM, Michael Tuexen wrote:
Derek
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Darwin-kernel mailing list      (Darwin-kernel@lists.apple.com)
Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/darwin-kernel/site_archiver%40lists.a...
Well on that system the NKE is load always, because it is required
by the application running on that

Mac Pro. In the meantime another system was setup (on a different
hardware) the it also crashes a lot.

BTW: A lot means a couple of times per day. And yes, the cores we
get (using a core dump server),

are all like this (some pointed to bugs in the NKE in the past, but
these could be fixed, then also

SCTP.kext was explicitly mentioned in the paniclog).
Any idea how to narrow down the problem?

If the EIP values in the "paniclog" register dump are identical/
similar across all the crashes you've observed (note that "unresolved
kernel trap" is just a generic label), and only occur when your
driver is loaded, it's likely to be memory corruption as I noted
previously. Is it always an EBP based access (typically a local or
parameter) in idle_thread() that's causing the fault? The loop in
idle_thread() briefly enables interrupts and disables them, so if you
have an interrupt filter routine (that executes at interrupt context)
that could be another point where corruption could occur (in addition
to the the saved context at the base of the thread's kernel stack I
mentioned previously--corruption of the register context below the
interrupt stack frame that contains the saved value of the EBP
register, for instance).

Unfortunately, there's no single magic bullet when it comes to
identifying sources of memory corruption of this type--determining
the patterns and location of corruption and binary search via logging/
tracing is one approach (after carefully walking through your code to
look for erroneous stores to memory, bad DMA bounds, stack overflows
etc.; I don't think page protection/debug register type schemes to
trap the bad store (assuming it's not a physical mode store) would be
useful here since the register context would be very frequently
accessed. Logic analyzers (very expensive) would be a last resort).
The kernel trace facility (/usr/local/bin/trace -h) can tell you what
events (such as interrupts and context switches) occurred on that
processor, but given that it panics, you'd probably have to examine
the trace buffer in memory (see xnu/bsd/kern/kdebug.c in the kernel
sources for the internals of the trace facility) to extract the last
few trace events.
This email sent to site_archiver@lists.apple.com

Re: Understanding cores...

Derek Kumar