Lists

Terms and Conditions
Lists hosted on this site
Email the Postmaster
Tips for posting to public mailing lists

Re: Understanding cores...

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Understanding cores...

Subject: Re: Understanding cores...
From: Derek Kumar <email@hidden>
Date: Tue, 9 Jan 2007 13:28:48 -0500

On Jan 8, 2007, at 4:05 PM, Michael Tuexen wrote:

Well on that system the NKE is load always, because it is required by the application running on that Mac Pro. In the meantime another system was setup (on a different hardware) the it also crashes a lot. BTW: A lot means a couple of times per day. And yes, the cores we get (using a core dump server), are all like this (some pointed to bugs in the NKE in the past, but these could be fixed, then also SCTP.kext was explicitly mentioned in the paniclog).

Any idea how to narrow down the problem?

If the EIP values in the "paniclog" register dump are identical/ similar across all the crashes you've observed (note that "unresolved kernel trap" is just a generic label), and only occur when your driver is loaded, it's likely to be memory corruption as I noted previously. Is it always an EBP based access (typically a local or parameter) in idle_thread() that's causing the fault? The loop in idle_thread() briefly enables interrupts and disables them, so if you have an interrupt filter routine (that executes at interrupt context) that could be another point where corruption could occur (in addition to the the saved context at the base of the thread's kernel stack I mentioned previously--corruption of the register context below the interrupt stack frame that contains the saved value of the EBP register, for instance). Unfortunately, there's no single magic bullet when it comes to identifying sources of memory corruption of this type--determining the patterns and location of corruption and binary search via logging/ tracing is one approach (after carefully walking through your code to look for erroneous stores to memory, bad DMA bounds, stack overflows etc.; I don't think page protection/debug register type schemes to trap the bad store (assuming it's not a physical mode store) would be useful here since the register context would be very frequently accessed. Logic analyzers (very expensive) would be a last resort). The kernel trace facility (/usr/local/bin/trace -h) can tell you what events (such as interrupts and context switches) occurred on that processor, but given that it panics, you'd probably have to examine the trace buffer in memory (see xnu/bsd/kern/kdebug.c in the kernel sources for the internals of the trace facility) to extract the last few trace events.

Derek
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Darwin-kernel mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden


Follow-Ups:

Re: Understanding cores...
From: Andreas Fink <email@hidden>


References:  
  >Understanding cores... (From: Michael Tuexen <email@hidden>)
  >Re: Understanding cores... (From: "Brian Bechtel" <email@hidden>)
  >Re: Understanding cores... (From: Derek Kumar <email@hidden>)
  >Re: Understanding cores... (From: Michael Tuexen <email@hidden>)




Prev by Date:
Re: User-space to kernel communication

Next by Date:
Re: User-space to kernel communication

Previous by thread:
Re: Understanding cores...

Next by thread:
Re: Understanding cores...

Index(es):

Date
Thread