Re: Debugging a machine check panic
Re: Debugging a machine check panic
- Subject: Re: Debugging a machine check panic
- From: Terry Lambert <email@hidden>
- Date: Thu, 12 Feb 2009 21:20:00 -0800
On Feb 12, 2009, at 2:46 PM, Ryan McGann wrote:
I have an NKE that intermittently, but reliably, panics in what
looks like an interrupt handler:
panic(cu 0 caler 0x001A8Fd4): Machine-check at 0x0014329a, thread:
0x3639790, code:0x12, registers:
CR0: 0x8001003b, CR2: 0x37d0bfec, CR3: 0x0107d000, CR4:
0x0000006e0
EAX: 0x00522d08, EBX: 0x00000000, ECX: 0x00000000, EDX: 0x00001000
ESP: 0x37d0bff0, EBP: 0x37d0c068, ESI: 0x00000000, EDI:
0x00522d08
EFL: 0x00010206, EIP: 0x0019f967
I can reproduce the panic fairly easily but I cannot see in the code
where it is panicing. The EIP given in the panic log is someplace in
hw_simplelock OR zalloc_canblock, but never in my code (though I'm
sure it's my kext causing the panic). I just don't know how to tell
where it's going wrong. I don't even know what this panic indicates
since I don't know exactly what machine-check is, other than some
sort of interrupt. My guess is stack overflow or stack corruption,
but I could be dead wrong. The ESP and CR2 registers may be inside
kext, since showallkmods shows my kext is loaded at 0x2bc7a000,
though that'd be a large jump (the kext is 800k).
I've done all the usual debugging that I can think of but two
machine debugging is pretty useless since the backtrace is not in my
code, and printf can only tell me the last function in my code that
is called, not any kernel functions I call, (plus it's not as
reliable as one would hope). I've commented out a lot of code to the
point that it doesn't panic anymore, but also doesn't do anything
useful anymore. So I'm pretty much at a dead end.
Anybody have any other ideas on where to begin, like what machine-
check means?
Definition:
<http://en.wikipedia.org/wiki/Machine_Check_Exception>
...basically, it thinks you have a hardware problem. There are
typically two causes for something like this:
(1) You have a hardware problem (usually, bad RAM that can't clock up
fast enough - within
manufacturers spec - after it's been clocked down during a power
management operation)
(2) You have software that's stomping saved state information that,
when restored,
causes your hardware to barf
(3) You are trying to change the contents of MSRs, not realizing that
these are adjusted
dynamically by the OS for architectural reasons, and must be left
alone by third parties
...I'm going to guess #2 for this one, given the information you've
provided so far.
Using two machine debugging, you can enter the kgmacros command:
(gdb) showMCAstate
...to dump the machine check state for your CPU(s).
Then you can build a copy of mcelog from:
<http://freshmeat.net/projects/mcelog/>
...and shovel the register data through there, and hope it tells you
something Andi Kleen was able to figure out from the architecture
manuals, excluding the Intel NDA'd information, so you can figure out
what you're causing to be stomped.
Alternately you can file a bug report, and hope you can gather enough
information that whoever it lands on can do this for you, or you could
probably find the problem if you were to go through your NKE with a
fine tooth comb doing a source inspection for logic errors.
-- Terry
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Darwin-dev mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden